Bases: _Weakrefable
Options for reading CSV files.
True
)
Whether to use multiple threads to accelerate reading
int
, optional
How much bytes to process at a time from the input stream. This will determine multi-threading granularity as well as the size of individual record batches or table chunks. Minimum valid value for block size is 1
int
, optional (default 0)
The number of rows to skip before the column names (if any) and the CSV data.
int
, optional (default 0)
The number of rows to skip after the column names. This number can be larger than the number of rows in one block, and empty rows are counted. The order of application is as follows: - skip_rows is applied (if non-zero); - column names are read (unless column_names is set); - skip_rows_after_names is applied (if non-zero).
list
, optional
The column names of the target table. If empty, fall back on autogenerate_column_names.
False
)
Whether to autogenerate column names if column_names is empty. If true, column names will be of the form âf0â, âf1â⦠If false, column names will be read from the first CSV row after skip_rows.
str
, optional (default âutf8â)
The character encoding of the CSV data. Columns that cannot decode using this encoding can still be read as Binary.
Examples
Defining an example data:
>>> import io >>> s = "1,2,3\nFlamingo,2,2022-03-01\nHorse,4,2022-03-02\nBrittle stars,5,2022-03-03\nCentipede,100,2022-03-04" >>> print(s) 1,2,3 Flamingo,2,2022-03-01 Horse,4,2022-03-02 Brittle stars,5,2022-03-03 Centipede,100,2022-03-04
Ignore the first numbered row and substitute it with defined or autogenerated column names:
>>> from pyarrow import csv >>> read_options = csv.ReadOptions( ... column_names=["animals", "n_legs", "entry"], ... skip_rows=1) >>> csv.read_csv(io.BytesIO(s.encode()), read_options=read_options) pyarrow.Table animals: string n_legs: int64 entry: date32[day] ---- animals: [["Flamingo","Horse","Brittle stars","Centipede"]] n_legs: [[2,4,5,100]] entry: [[2022-03-01,2022-03-02,2022-03-03,2022-03-04]]
>>> read_options = csv.ReadOptions(autogenerate_column_names=True, ... skip_rows=1) >>> csv.read_csv(io.BytesIO(s.encode()), read_options=read_options) pyarrow.Table f0: string f1: int64 f2: date32[day] ---- f0: [["Flamingo","Horse","Brittle stars","Centipede"]] f1: [[2,4,5,100]] f2: [[2022-03-01,2022-03-02,2022-03-03,2022-03-04]]
Remove the first 2 rows of the data:
>>> read_options = csv.ReadOptions(skip_rows_after_names=2) >>> csv.read_csv(io.BytesIO(s.encode()), read_options=read_options) pyarrow.Table 1: string 2: int64 3: date32[day] ---- 1: [["Brittle stars","Centipede"]] 2: [[5,100]] 3: [[2022-03-03,2022-03-04]]
Methods
Attributes
Whether to autogenerate column names if column_names is empty. If true, column names will be of the form âf0â, âf1â⦠If false, column names will be read from the first CSV row after skip_rows.
How much bytes to process at a time from the input stream. This will determine multi-threading granularity as well as the size of individual record batches or table chunks.
The column names of the target table. If empty, fall back on autogenerate_column_names.
encoding: object
pyarrow.csv.ReadOptions
The number of rows to skip before the column names (if any) and the CSV data. See skip_rows_after_names for interaction description
The number of rows to skip after the column names. This number can be larger than the number of rows in one block, and empty rows are counted. The order of application is as follows: - skip_rows is applied (if non-zero); - column names are read (unless column_names is set); - skip_rows_after_names is applied (if non-zero).
Whether to use multiple threads to accelerate reading.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4