Bases: object
Encapsulates details of reading a complete Parquet dataset possibly consisting of multiple files and partitions in subdirectories.
str
or List
[str
]
A directory name, single file name, or list of file names.
FileSystem
, default None
If nothing passed, will be inferred based on path. Path will try to be found in the local on-disk filesystem otherwise it will be parsed as an URI to determine the filesystem.
pyarrow.parquet.Schema
Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.
pyarrow.compute.Expression
or List
[Tuple
] or List
[List
[Tuple
]], default None
Rows which do not match the filter predicate will be removed from scanned data. Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows. Within-file level filtering and different partitioning schemes are supported.
Predicates are expressed using an Expression
or using the disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]
. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).
Predicates may also be passed as List[Tuple]. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) List[List[Tuple]] notation.
Each tuple has format: (key
, op
, value
) and compares the key
with the value
. The supported op
are: =
or ==
, !=
, <
, >
, <=
, >=
, in
and not in
. If the op
is in
or not in
, the value
must be a collection such as a list
, a set
or a tuple
.
Examples:
Using the Expression
API:
import pyarrow.compute as pc pc.field('x') = 0 pc.field('y').isin(['a', 'b', 'c']) ~pc.field('y').isin({'a', 'b'})
Using the DNF format:
('x', '=', 0) ('y', 'in', ['a', 'b', 'c']) ('z', 'not in', {'a','b'})
list
, default None
List of names or column paths (for nested types) to read directly as DictionaryArray. Only supported for BYTE_ARRAY storage. To read a flat column as dictionary-encoded pass the column name. For nested types, you must pass the full column âpathâ, which could be something like level1.level2.list.item. Refer to the Parquet fileâs schema to obtain the paths.
pyarrow.DataType
, default None
If given, Parquet binary columns will be read as this datatype. This setting is ignored if a serialized Arrow schema is found in the Parquet metadata.
subclass
of pyarrow.DataType
, default None
If given, non-MAP repeated columns will be read as an instance of this datatype (either pyarrow.ListType or pyarrow.LargeListType). This setting is ignored if a serialized Arrow schema is found in the Parquet metadata.
False
If the source is a file path, use a memory map to read file, which can improve performance in some environments.
int
, default 0
If positive, perform read buffering when deserializing individual column chunks. Otherwise IO calls are unbuffered.
pyarrow.dataset.Partitioning
or str
or list
of str
, default âhiveâ
The partitioning scheme for a partitioned dataset. The default of âhiveâ assumes directory names with key=value pairs like â/year=2009/month=11â. In addition, a scheme like â/2009/11â is also supported, in which case you need to specify the field names or a full schema. See the pyarrow.dataset.partitioning()
function for more details.
list
, optional
Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is [â.â, â_â]. Note that discovery happens only if a directory is passed as source.
True
Coalesce and issue file reads in parallel to improve performance on high-latency filesystems (e.g. S3, GCS). If True, Arrow will use a background I/O thread pool. If using a filesystem layer that itself performs readahead (e.g. fsspecâs S3FS), disable readahead for best results. Set to False if you want to prioritize minimal memory usage over maximum speed.
str
, default None
Cast timestamps that are stored in INT96 format to a particular resolution (e.g. âmsâ). Setting to None is equivalent to ânsâ and therefore INT96 timestamps will be inferred as timestamps in nanoseconds.
FileDecryptionProperties
or None
File-level decryption properties. The decryption properties can be created using CryptoFactory.file_decryption_properties()
.
int
, default None
If not None, override the maximum total string size allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
int
, default None
If not None, override the maximum total size of containers allocated when decoding Thrift structures. The default limit should be sufficient for most Parquet files.
False
If True, verify the page checksum for each page read from the file.
True
If True, read Parquet logical types as Arrow extension types where possible, (e.g., read JSON as the canonical arrow.json extension type or UUID as the canonical arrow.uuid extension type).
Examples
Generate an example PyArrow Table and write it to a partitioned dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_v2', ... partition_cols=['year'])
create a ParquetDataset object from the dataset source:
>>> dataset = pq.ParquetDataset('dataset_v2/')
and read the data:
>>> dataset.read().to_pandas() n_legs animal year 0 5 Brittle stars 2019 1 2 Flamingo 2020 2 4 Dog 2021 3 100 Centipede 2021 4 2 Parrot 2022 5 4 Horse 2022
create a ParquetDataset object with filter:
>>> dataset = pq.ParquetDataset('dataset_v2/', ... filters=[('n_legs','=',4)]) >>> dataset.read().to_pandas() n_legs animal year 0 4 Dog 2021 1 4 Horse 2022
Methods
Attributes
A list of absolute Parquet file paths in the Dataset source.
Examples
Generate an example dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_v2_files', ... partition_cols=['year']) >>> dataset = pq.ParquetDataset('dataset_v2_files/')
List the files:
>>> dataset.files ['dataset_v2_files/year=2019/...-0.parquet', ...
The filesystem type of the Dataset source.
A list of the Dataset source fragments or pieces with absolute file paths.
Examples
Generate an example dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_v2_fragments', ... partition_cols=['year']) >>> dataset = pq.ParquetDataset('dataset_v2_fragments/')
List the fragments:
>>> dataset.fragments [<pyarrow.dataset.ParquetFileFragment path=dataset_v2_fragments/...
The partitioning of the Dataset source, if discovered.
Read (multiple) Parquet files as a single pyarrow.Table.
List
[str
]
Names of columns to read from the dataset. The partition fields are not automatically included.
True
Perform multi-threaded column reads.
False
If True and file has custom pandas schema metadata, ensure that index columns are also loaded.
pyarrow.Table
Content of the file as a table (of columns).
Examples
Generate an example dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_v2_read', ... partition_cols=['year']) >>> dataset = pq.ParquetDataset('dataset_v2_read/')
Read the dataset:
>>> dataset.read(columns=["n_legs"]) pyarrow.Table n_legs: int64 ---- n_legs: [[5],[2],[4,100],[2,4]]
Read dataset including pandas metadata, if any. Other arguments passed through to read()
, see docstring for further details.
Additional options for read()
Examples
Generate an example parquet file:
>>> import pyarrow as pa >>> import pandas as pd >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> table = pa.Table.from_pandas(df) >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'table_V2.parquet') >>> dataset = pq.ParquetDataset('table_V2.parquet')
Read the dataset with pandas metadata:
>>> dataset.read_pandas(columns=["n_legs"]) pyarrow.Table n_legs: int64 ---- n_legs: [[2,2,4,4,5,100]]
>>> dataset.read_pandas(columns=["n_legs"]).schema.pandas_metadata {'index_columns': [{'kind': 'range', 'name': None, 'start': 0, ...}
Schema of the Dataset.
Examples
Generate an example dataset:
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_v2_schema', ... partition_cols=['year']) >>> dataset = pq.ParquetDataset('dataset_v2_schema/')
Read the schema:
>>> dataset.schema n_legs: int64 animal: string year: dictionary<values=int32, indices=int32, ordered=0>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4