Bases: object
Reader interface for a single Parquet file.
str
, pathlib.Path
, pyarrow.NativeFile
, or file-like object
Readable source. For passing bytes or buffer-like file containing a Parquet file, use pyarrow.BufferReader.
FileMetaData
, default None
Use existing metadata object, rather than reading from file.
FileMetaData
, default None
Will be used in reads for pandas schema metadata if not found in the main fileâs metadata, no other uses at the moment.
list
List of column names to read directly as DictionaryArray.
pyarrow.DataType
, default None
If given, Parquet binary columns will be read as this datatype. This setting is ignored if a serialized Arrow schema is found in the Parquet metadata.
subclass
of pyarrow.DataType
, default None
If given, non-MAP repeated columns will be read as an instance of this datatype (either pyarrow.ListType or pyarrow.LargeListType). This setting is ignored if a serialized Arrow schema is found in the Parquet metadata.
False
If the source is a file path, use a memory map to read file, which can improve performance in some environments.
int
, default 0
If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.
False
Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3). If True, Arrow will use a background I/O thread pool.
str
, default None
Cast timestamps that are stored in INT96 format to a particular resolution (e.g. âmsâ). Setting to None is equivalent to ânsâ and therefore INT96 timestamps will be inferred as timestamps in nanoseconds.
FileDecryptionProperties
, default None
File decryption properties for Parquet Modular Encryption.
int
, default None
If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
int
, default None
If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
FileSystem
, default None
If nothing passed, will be inferred based on path. Path will try to be found in the local on-disk filesystem otherwise it will be parsed as an URI to determine the filesystem.
False
If True, verify the checksum for each page read from the file.
True
If True, read Parquet logical types as Arrow extension types where possible, (e.g., read JSON as the canonical arrow.json extension type or UUID as the canonical arrow.uuid extension type).
Examples
Generate an example PyArrow Table and write it to Parquet file:
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]})
>>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet')
Create a ParquetFile
object from the Parquet file:
>>> parquet_file = pq.ParquetFile('example.parquet')
Read the data:
>>> parquet_file.read() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
Create a ParquetFile object with âanimalâ column as DictionaryArray:
>>> parquet_file = pq.ParquetFile('example.parquet', ... read_dictionary=["animal"]) >>> parquet_file.read() pyarrow.Table n_legs: int64 animal: dictionary<values=string, indices=int32, ordered=0> ---- n_legs: [[2,2,4,4,5,100]] animal: [ -- dictionary: ["Flamingo","Parrot",...,"Brittle stars","Centipede"] -- indices: [0,1,2,3,4,5]]
Methods
Attributes
Read streaming batches from a Parquet file.
int
, default 64K
Maximum number of records to yield per batch. Batches may be smaller if there arenât enough rows in the file.
list
Only these row groups will be read from the file.
list
If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. âaâ will select âa.bâ, âa.câ, and âa.d.eâ.
True
Perform multi-threaded column reads.
False
If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
pyarrow.RecordBatch
Contents of each batch as a record batch
Examples
Generate an example Parquet file:
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet') >>> for i in parquet_file.iter_batches(): ... print("RecordBatch") ... print(i.to_pandas()) ... RecordBatch n_legs animal 0 2 Flamingo 1 2 Parrot 2 4 Dog 3 4 Horse 4 5 Brittle stars 5 100 Centipede
Return the Parquet metadata.
Return the number of row groups of the Parquet file.
Examples
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.num_row_groups 1
Read a Table from Parquet format.
list
If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. âaâ will select âa.bâ, âa.câ, and âa.d.eâ.
True
Perform multi-threaded column reads.
False
If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
pyarrow.table.Table
Content of the file as a table (of columns).
Examples
Generate an example Parquet file:
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet')
Read a Table:
>>> parquet_file.read(columns=["animal"]) pyarrow.Table animal: string ---- animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]
Read a single row group from a Parquet file.
int
Index of the individual row group that we want to read.
list
If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. âaâ will select âa.bâ, âa.câ, and âa.d.eâ.
True
Perform multi-threaded column reads.
False
If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
pyarrow.table.Table
Content of the row group as a table (of columns)
Examples
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.read_row_group(0) pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot",...,"Brittle stars","Centipede"]]
Read a multiple row groups from a Parquet file.
list
Only these row groups will be read from the file.
list
If not None, only these columns will be read from the row group. A column name may be a prefix of a nested field, e.g. âaâ will select âa.bâ, âa.câ, and âa.d.eâ.
True
Perform multi-threaded column reads.
False
If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
pyarrow.table.Table
Content of the row groups as a table (of columns).
Examples
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.read_row_groups([0,0]) pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,...,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog",...,"Brittle stars","Centipede"]]
Read contents of file for the given columns and batch size.
list
of integers
, default None
Select columns to read, if None scan all columns.
int
, default 64K
Number of rows to read at a time internally.
int
Number of rows in file
Notes
This functionâs primary purpose is benchmarking. The scan is executed on a single thread.
Examples
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet')
>>> parquet_file.scan_contents() 6
Return the Parquet schema, unconverted to Arrow types
Return the inferred Arrow schema, converted from the whole Parquet fileâs schema
Examples
Generate an example Parquet file:
>>> import pyarrow as pa >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet') >>> parquet_file = pq.ParquetFile('example.parquet')
Read the Arrow schema:
>>> parquet_file.schema_arrow n_legs: int64 animal: string
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4