I have an Numpy array of strings containing UK postcodes (2,603,678 entries) and another Numpy array of indices (1,785,499,246 entries) to them.
In the output file I want to replace the indices with the strings, so I created a DictionaryArray from them as follows:
postcode_dict = pa.DictionaryArray.from_arrays(pcds_id, pcds)
Where pcds_id contains the indices and pcds contains the postcode strings.
A UK postcode format is A[A]N[N|A] NAA so varies between 6 and 8 characters in length.
A = alpha, N = numeric, | = or, [] = optional.
The dictionary creation works file:
<pyarrow.lib.DictionaryArray object at 0x7fc6b758b8b0>
-- dictionary:
[
"AB1 0AA",
"AB1 0AB",
"AB1 0AD",
"AB1 0AE",
"AB1 0AF",
"AB1 0AG",
"AB1 0AJ",
"AB1 0AL",
"AB1 0AN",
"AB1 0AP",
...
"ZE3 9JP",
"ZE3 9JR",
"ZE3 9JS",
"ZE3 9JT",
"ZE3 9JU",
"ZE3 9JW",
"ZE3 9JX",
"ZE3 9JY",
"ZE3 9JZ",
"ZE3 9XP"
]
-- indices:
[
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
...
2603676,
2603676,
2603677,
2603677,
2603677,
2603677,
2603677,
2603677,
2603677,
2603677
]
However applying dictionary_decode to this array results in the strings becoming mangled:
<pyarrow.lib.StringArray object at 0x7ed1b2e2f760>
[
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
...
" 1UGAL1",
" 1UGAL1",
" 1UGAL1",
" 1UGAL1",
" 1UGAL1",
" 1UGAL1",
" 1UGAL1",
" 1UGAL1",
" 1UGAL1",
" 1UGAL1"
]
However, if I convert the array to pandas it formats correctly:
postcode_dict.to_pandas()
0 AB1 0AA
1 AB1 0AA
2 AB1 0AA
3 AB1 0AA
4 AB1 0AA
...
1785499241 ZE3 9XP
1785499242 ZE3 9XP
1785499243 ZE3 9XP
1785499244 ZE3 9XP
1785499245 ZE3 9XP
Length: 1785499246, dtype: category
Categories (2603678, object): ['AB1 0AA', 'AB1 0AB', 'AB1 0AD', 'AB1 0AE', ..., 'ZE3 9JX', 'ZE3 9JY', 'ZE3 9JZ', 'ZE3 9XP']
Source arrays in NPZ format inside a zip file:
DictionaryArray in parquet inside a zip file on Google Drive (too large for Github):
I'm using a RAPIDS Docker file from NVIDIA NGC (as there are GPU dependencies, cuGraph, in the workflow) and the Pyarrow version is '10.0.1'
Component(s)Python
Data Source Attribution & LicenseData is a spatial weights matrix (crosstab of graph distances) built from UK Office for National Statistics ONS Postcode Directory February 2023
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4