Bases: _Weakrefable
A materialized scan operation with context and options bound.
A scanner is the class that glues the scan tasks, data fragments and data sources together.
Methods
Attributes
Count rows matching the scanner filter.
int
The schema with which batches will be read from fragments.
Create a Scanner from an iterator of batches.
This creates a scanner which can be used only once. It is intended to support writing a dataset (which takes a scanner) from a source which can be read only once (e.g. a RecordBatchReader or generator).
stream
object
The iterator of Batches. This can be a pyarrow RecordBatchReader, any object that implements the Arrow PyCapsule Protocol for streams, or an actual Python iterator of RecordBatches.
Schema
The schema of the batches (required when passing a Python iterator).
list
[str
] or dict
[str
, Expression
], default None
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the datasetâs Schema.
Expression
, default None
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
int
, default 131_072
The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
int
, default 16
The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
int
, default 4
The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
FragmentScanOptions
, default None
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
True
If enabled, metadata may be cached when scanning to speed up repeated scans.
MemoryPool
, default None
For memory allocations, if required. If not specified, uses the default pool.
Create Scanner from Dataset,
Dataset
Dataset to scan.
list
[str
] or dict
[str
, Expression
], default None
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the datasetâs Schema.
Expression
, default None
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
int
, default 131_072
The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
int
, default 16
The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
int
, default 4
The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
FragmentScanOptions
, default None
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
True
If enabled, metadata may be cached when scanning to speed up repeated scans.
MemoryPool
, default None
For memory allocations, if required. If not specified, uses the default pool.
Create Scanner from Fragment,
Fragment
fragment to scan.
Schema
, optional
The schema of the fragment.
list
[str
] or dict
[str
, Expression
], default None
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the datasetâs Schema.
Expression
, default None
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
int
, default 131_072
The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
int
, default 16
The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
int
, default 4
The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
FragmentScanOptions
, default None
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
True
If enabled, metadata may be cached when scanning to speed up repeated scans.
MemoryPool
, default None
For memory allocations, if required. If not specified, uses the default pool.
Load the first N rows of the dataset.
int
The number of rows to load.
Table
The materialized schema of the data, accounting for projections.
This is the schema of any data returned from the scanner.
Consume a Scanner in record batches with corresponding fragments.
TaggedRecordBatch
Select rows of data by index.
Will only consume as many batches of the underlying dataset as needed. Otherwise, this is equivalent to to_table().take(indices)
.
Array
or array-like
indices of rows to select in the dataset.
Table
Consume a Scanner in record batches.
RecordBatch
Consume this scanner as a RecordBatchReader.
Convert a Scanner into a Table.
Use this convenience utility with care. This will serially materialize the Scan result in memory before creating the Table.
Table
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4