A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/apache/arrow/issues/34583 below:

[Python] Pyarrow DictionaryArray.dictionary_decode mangling strings · Issue #34583 · apache/arrow · GitHub

Describe the bug, including details regarding any error messages, version, and platform.

I have an Numpy array of strings containing UK postcodes (2,603,678 entries) and another Numpy array of indices (1,785,499,246 entries) to them.

In the output file I want to replace the indices with the strings, so I created a DictionaryArray from them as follows:

postcode_dict = pa.DictionaryArray.from_arrays(pcds_id, pcds)

Where pcds_id contains the indices and pcds contains the postcode strings.

A UK postcode format is A[A]N[N|A] NAA so varies between 6 and 8 characters in length.

A = alpha, N = numeric, | = or, [] = optional.

The dictionary creation works file:

<pyarrow.lib.DictionaryArray object at 0x7fc6b758b8b0>

-- dictionary:
  [
    "AB1 0AA",
    "AB1 0AB",
    "AB1 0AD",
    "AB1 0AE",
    "AB1 0AF",
    "AB1 0AG",
    "AB1 0AJ",
    "AB1 0AL",
    "AB1 0AN",
    "AB1 0AP",
    ...
    "ZE3 9JP",
    "ZE3 9JR",
    "ZE3 9JS",
    "ZE3 9JT",
    "ZE3 9JU",
    "ZE3 9JW",
    "ZE3 9JX",
    "ZE3 9JY",
    "ZE3 9JZ",
    "ZE3 9XP"
  ]
-- indices:
  [
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    ...
    2603676,
    2603676,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677
  ]

However applying dictionary_decode to this array results in the strings becoming mangled:

<pyarrow.lib.StringArray object at 0x7ed1b2e2f760>
[
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  ...
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1"
]

However, if I convert the array to pandas it formats correctly:

postcode_dict.to_pandas()

0             AB1 0AA
1             AB1 0AA
2             AB1 0AA
3             AB1 0AA
4             AB1 0AA
               ...   
1785499241    ZE3 9XP
1785499242    ZE3 9XP
1785499243    ZE3 9XP
1785499244    ZE3 9XP
1785499245    ZE3 9XP
Length: 1785499246, dtype: category
Categories (2603678, object): ['AB1 0AA', 'AB1 0AB', 'AB1 0AD', 'AB1 0AE', ..., 'ZE3 9JX', 'ZE3 9JY', 'ZE3 9JZ', 'ZE3 9XP']

Source arrays in NPZ format inside a zip file:

postcode_dict.zip

DictionaryArray in parquet inside a zip file on Google Drive (too large for Github):

postcode_dict_parquet.zip

I'm using a RAPIDS Docker file from NVIDIA NGC (as there are GPU dependencies, cuGraph, in the workflow) and the Pyarrow version is '10.0.1'

Component(s)

Python

Data Source Attribution & License

Data is a spatial weights matrix (crosstab of graph distances) built from UK Office for National Statistics ONS Postcode Directory February 2023


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4