I created a random array and wrote it repeatedly to an Arrow IPC file so that the whole array was too large to fit in RAM. Then, I read it by memory mapping. I could slice
it without any problem, but when I tried to access the rows based on an arbitrary list of indices by using take
, the RAM usage went up until the computer hung. The code is as follows (in which the array length and the number of writes may be adjusted according to your disk space and RAM size):
import numpy as np import pyarrow as pa from pyarrow import feather rng = np.random.default_rng(1337) data = rng.normal(size=(1000000,)) table = pa.table({'data': data}) sink = pa.output_stream('data.feather') schema = pa.schema([('data', pa.float64())]) with pa.ipc.new_file(sink, schema) as writer: for i in range(1000): writer.write_table(table) table = feather.read_table('data.feather', memory_map=True) print(table.take([0]))Component(s)
Python
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4