Stay organized with collections Save and categorize content based on your preferences.
Session(
context: typing.Optional[bigframes._config.bigquery_options.BigQueryOptions] = None,
clients_provider: typing.Optional[bigframes.session.clients.ClientsProvider] = None,
)
Establishes a BigQuery connection to capture a group of job activities related to DataFrames.
Parameters Name Descriptioncontext
bigframes._config.bigquery_options.BigQueryOptions
Configuration adjusting how to connect to BigQuery and related APIs. Note that some options are ignored if clients_provider
is set.
clients_provider
bigframes.session.clients.ClientsProvider
An object providing client library objects.
Properties bqclientAPI documentation for bqclient
property.
API documentation for bqconnectionclient
property.
API documentation for bqconnectionmanager
property.
API documentation for bqstoragereadclient
property.
The sum of all bytes processed by bigquery jobs using this session.
cloudfunctionsclientAPI documentation for cloudfunctionsclient
property.
API documentation for objects
property.
API documentation for resourcemanagerclient
property.
API documentation for session_id
property.
The sum of all slot time used by bigquery jobs in this session.
Methods __del__Automatic cleanup of internal resources.
__enter__ __exit__ closeDelete resources that were created with this session's session_id. This includes BigQuery tables, remote functions and cloud functions serving the remote functions.
from_glob_pathfrom_glob_path(
path: str, *, connection: Optional[str] = None, name: Optional[str] = None
) -> dataframe.DataFrame
Create a BigFrames DataFrame that contains a BigFrames Blob column from a global wildcard path. This operation creates a temporary BQ Object Table under the hood and requires bigquery.connections.delegate permission or BigQuery Connection Admin role. If you have an existing BQ Object Table, use read_gbq_object_table().
Note: BigFrames Blob is still under experiments. It may not work and subject to change in the future. Parameters Name Descriptionpath
str
The wildcard global path, such as "gs://
connection
str or None, default None
Connection to connect with remote service. str of the format <PROJECT_NUMBER/PROJECT_ID>.
name
str
The column name of the Blob column.
read_csvread_csv(
filepath_or_buffer: str | IO["bytes"],
*,
sep: Optional[str] = ",",
header: Optional[int] = 0,
names: Optional[
Union[MutableSequence[Any], np.ndarray[Any, Any], Tuple[Any, ...], range]
] = None,
index_col: Optional[
Union[
int,
str,
Sequence[Union[str, int]],
bigframes.enums.DefaultIndexKind,
Literal[False],
]
] = None,
usecols: Optional[
Union[
MutableSequence[str],
Tuple[str, ...],
Sequence[int],
pandas.Series,
pandas.Index,
np.ndarray[Any, Any],
Callable[[Any], bool],
]
] = None,
dtype: Optional[Dict] = None,
engine: Optional[
Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
] = None,
encoding: Optional[str] = None,
write_engine: constants.WriteEngineType = "default",
**kwargs
) -> dataframe.DataFrame
Loads data from a comma-separated values (csv) file into a DataFrame.
The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Note: usingengine="bigquery"
will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame. Only files stored on your local machine or in Google Cloud Storage are supported. Note: For non-bigquery engine, data is inlined in the query SQL if it is small enough (roughly 5MB or less in memory). Larger size data is loaded to a BigQuery table instead. Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
name post_abbr
0 Alabama AL
1 Alaska AK
<BLANKLINE>
[2 rows x 2 columns]
Parameters Name Description filepath_or_buffer
str
A local or Google Cloud Storage (gs://
) path with engine="bigquery"
otherwise passed to pandas.read_csv.
sep
Optional[str], default ","
the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support sep=" "
to specify tab character as separator. Default engine supports having any number of spaces as separator by specifying sep="\s+"
. Separators longer than 1 character are interpreted as regular expressions by the default engine. BigQuery engine only supports single character separators.
header
Optional[int], default 0
row number to use as the column names. - None
: Instructs autodetect that there are no headers and data should be read starting from the first row. - 0
: If using engine="bigquery"
, Autodetect tries to detect headers in the first row. If they are not detected, the row is read as data. Otherwise data is read starting from the second row. When using default engine, pandas assumes the first row contains column names unless the names
argument is specified. If names
is provided, then the first row is ignored, second row is read as data, and column names are inferred from names
. - N > 0
: If using engine="bigquery"
, Autodetect skips N rows and tries to detect headers in row N+1. If headers are not detected, row N+1 is just skipped. Otherwise row N+1 is used to extract column names for the detected schema. When using default engine, pandas will skip N rows and assumes row N+1 contains column names unless the names
argument is specified. If names
is provided, row N+1 will be ignored, row N+2 will be read as data, and column names are inferred from names
.
names
default None
a list of column names to use. If the file contains a header row and you want to pass this parameter, then header=0
should be passed as well so the first (header) row is ignored. Only to be used with default engine.
index_col
default None
column(s) to use as the row labels of the DataFrame, either given as string name or column index. index_col=False
can be used with the default engine only to enforce that the first column is not used as the index. Using column index instead of column name is only supported with the default engine. The BigQuery engine only supports having a single column name as the index_col
. Neither engine supports having a multi-column index.
usecols
default None
List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in usecols
can be defined to correspond to column names provided with the names
parameter (ignoring the document's header row of column names). The order of the column indices/names in usecols
is ignored with the default engine. The order of the column names provided with the BigQuery engine will be consistent in the resulting dataframe. If using a callable function with the default engine, only column names that evaluate to True by the callable function will be in the resulting dataframe.
dtype
data type for data or columns
Data type for data or columns. Only to be used with default engine.
engine
Optional[Dict], default None
Type of engine to use. If engine="bigquery"
is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_csv
.
encoding
Optional[str], default to None
encoding the character encoding of the data. The default encoding is UTF-8
for both engines. The default engine acceps a wide range of encodings. Refer to Python documentation for a comprehensive list, https://docs.python.org/3/library/codecs.html#standard-encodings The BigQuery engine only supports UTF-8
and ISO-8859-1
.
write_engine
str
How data should be written to BigQuery (if at all). See bigframes.pandas.read_pandas
for a full description of supported values.
read_gbq(
query_or_table: str,
*,
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
configuration: Optional[Dict] = None,
max_results: Optional[int] = None,
filters: third_party_pandas_gbq.FiltersType = (),
use_cache: Optional[bool] = None,
col_order: Iterable[str] = ()
) -> dataframe.DataFrame
Loads a DataFrame from BigQuery.
BigQuery tables are an unordered, unindexed data source. To add support pandas-compatibility, the following indexing options are supported via the index_col
parameter:
(Empty iterable, default) A default index. Behavior may change. Explicitly set index_col
if your application makes use of specific index values.
If a table has primary key(s), those are used as the index, otherwise a sequential index is generated.
<xref uid="bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64">bigframes.enums.DefaultIndexKind.SEQUENTIAL_INT64</xref>
) Add an arbitrary sequential index and ordering. Warning This uses an analytic windowed operation that prevents filtering push down. Avoid using on large clustered or partitioned tables.index_col
argument to one or more columns. Unique values for the row labels are recommended. Duplicate labels are possible, but note that joins on a non-unique index can duplicate rows via pandas-like outer join behavior.row_number() OVER (ORDER BY ...) AS rowindex
in your SQL query and set index_col='rowindex'
to preserve the desired ordering.
If your query doesn't have an ordering, select
GENERATE_UUID() AS rowindex
in your SQL and set
index_col='rowindex'
for the best performance.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
If the input is a table ID:
>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
Read table path with wildcard suffix and filters:
df = bpd.read_gbq_table("bigquery-public-data.noaa_gsod.gsod19*", filters=[("_table_suffix", ">=", "30"), ("_table_suffix", "<=", "39")])
Preserve ordering in a query input.
>>> df = bpd.read_gbq('''
... SELECT
... -- Instead of an ORDER BY clause on the query, use
... -- ROW_NUMBER() to create an ordered DataFrame.
... ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
... AS rowindex,
...
... pitcherFirstName,
... pitcherLastName,
... AVG(pitchSpeed) AS averagePitchSpeed
... FROM `bigquery-public-data.baseball.games_wide`
... WHERE year = 2016
... GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
pitcherFirstName pitcherLastName averagePitchSpeed
rowindex
1 Albertin Chapman 96.514113
2 Zachary Britton 94.591039
<BLANKLINE>
[2 rows x 3 columns]
Reading data with columns
and filters
parameters:
>>> columns = ['pitcherFirstName', 'pitcherLastName', 'year', 'pitchSpeed']
>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant']), ('pitchSpeed', '>', 94)]
>>> df = bpd.read_gbq(
... "bigquery-public-data.baseball.games_wide",
... columns=columns,
... filters=filters,
... )
>>> df.head(1)
pitcherFirstName pitcherLastName year pitchSpeed
0 John Gant 2016 95
<BLANKLINE>
[1 rows x 4 columns]
Parameters Name Description query_or_table
str
A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of project.dataset.tablename
or dataset.tablename
. Can also take wildcard table name, such as project.dataset.table_prefix*
. In tha case, will read all the matched table as one DataFrame.
index_col
Iterable[str], str, bigframes.enums.DefaultIndexKind
Name of result column(s) to use for index in results DataFrame. If an empty iterable, such as ()
, a default index is generated. Do not depend on specific index values in this case. New in bigframes version 1.3.0: If index_cols
is not set, the primary key(s) of the table are used as the index. New in bigframes version 1.4.0: Support bigframes.enums.DefaultIndexKind
to override default index behavior.
columns
Iterable[str]
List of BigQuery column names in the desired order for results DataFrame.
configuration
dict, optional
Query config parameters for job processing. For example: configuration = {'query': {'useQueryCache': False}}. For more information see BigQuery REST API Reference https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query
__.
max_results
Optional[int], default None
If set, limit the maximum number of rows to fetch from the query results.
filters
Union[Iterable[FilterType], Iterable[Iterable[FilterType]]], default ()
To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, >, >=, <, <=, !=, in, not in, LIKE]. The innermost tuples are transposed into a set of filters applied through an AND operation. The outer Iterable combines these sets of filters through an OR operation. A single Iterable of tuples can also be used, meaning that no OR operation between set of filters is to be conducted. If using wildcard table suffix in query_or_table, can specify '_table_suffix' pseudo column to filter the tables to be read into the DataFrame.
use_cache
Optional[bool], default None
Caches query results if set to True
. When None
, it behaves as True
, but should not be combined with useQueryCache
in configuration
to avoid conflicts.
col_order
Iterable[str]
Alias for columns, retained for backwards compatibility.
Exceptions Type Descriptionbigframes.exceptions.DefaultIndexWarning
Using the default index is discouraged, such as with clustered or partitioned tables without primary keys. ValueError
When both columns
and col_order
are specified. ValueError
If configuration
is specified when directly reading from a table. read_gbq_function
read_gbq_function(function_name: str, is_row_processor: bool = False)
Loads a BigQuery function from BigQuery.
Then it can be applied to a DataFrame or Series.
Note: The return type of the function must be explicitly specified in the function's original definition even if not otherwise required.BigQuery Utils provides many public functions under the
bqutil
project on Google Cloud Platform project (See:
https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs#using-the-udfs). You can checkout Community UDFs to use community-contributed functions. (See:
https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community#community-udfs).
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Use the cw_lower_case_ascii_only function from Community UDFs.
>>> func = bpd.read_gbq_function("bqutil.fn.cw_lower_case_ascii_only")
You can run it on scalar input. Usually you would do so to verify that it works as expected before applying to all values in a Series.
>>> func('AURÉLIE')
'aurÉlie'
You can apply it to a BigQuery DataFrames Series.
>>> df = bpd.DataFrame({'id': [1, 2, 3], 'name': ['AURÉLIE', 'CÉLESTINE', 'DAPHNÉ']})
>>> df
id name
0 1 AURÉLIE
1 2 CÉLESTINE
2 3 DAPHNÉ
<BLANKLINE>
[3 rows x 2 columns]
>>> df1 = df.assign(new_name=df['name'].apply(func))
>>> df1
id name new_name
0 1 AURÉLIE aurÉlie
1 2 CÉLESTINE cÉlestine
2 3 DAPHNÉ daphnÉ
<BLANKLINE>
[3 rows x 3 columns]
You can even use a function with multiple inputs. For example, cw_regexp_replace_5 from Community UDFs.
>>> func = bpd.read_gbq_function("bqutil.fn.cw_regexp_replace_5")
>>> func('TestStr123456', 'Str', 'Cad$', 1, 1)
'TestCad$123456'
>>> df = bpd.DataFrame({
... "haystack" : ["TestStr123456", "TestStr123456Str", "TestStr123456Str"],
... "regexp" : ["Str", "Str", "Str"],
... "replacement" : ["Cad$", "Cad$", "Cad$"],
... "offset" : [1, 1, 1],
... "occurrence" : [1, 2, 1]
... })
>>> df
haystack regexp replacement offset occurrence
0 TestStr123456 Str Cad$ 1 1
1 TestStr123456Str Str Cad$ 1 2
2 TestStr123456Str Str Cad$ 1 1
<BLANKLINE>
[3 rows x 5 columns]
>>> df.apply(func, axis=1)
0 TestCad$123456
1 TestStr123456Cad$
2 TestCad$123456Str
dtype: string
Another use case is to define your own remote function and use it later. For example, define the remote function:
>>> @bpd.remote_function()
... def tenfold(num: int) -> float:
... return num * 10
Then, read back the deployed BQ remote function:
>>> tenfold_ref = bpd.read_gbq_function(
... tenfold.bigframes_remote_function,
... )
>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
a b c
0 1 3 5
1 2 4 6
<BLANKLINE>
[2 rows x 3 columns]
>>> df['a'].apply(tenfold_ref)
0 10.0
1 20.0
Name: a, dtype: Float64
It also supports row processing by using is_row_processor=True
. Please note, row processor implies that the function has only one input parameter.
>>> @bpd.remote_function()
... def row_sum(s: bpd.Series) -> float:
... return s['a'] + s['b'] + s['c']
>>> row_sum_ref = bpd.read_gbq_function(
... row_sum.bigframes_remote_function,
... is_row_processor=True,
... )
>>> df = bpd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
>>> df
a b c
0 1 3 5
1 2 4 6
<BLANKLINE>
[2 rows x 3 columns]
>>> df.apply(row_sum_ref, axis=1)
0 9.0
1 12.0
dtype: Float64
Parameters Name Description function_name
str
The function's name in BigQuery in the format project_id.dataset_id.function_name
, or dataset_id.function_name
to load from the default project, or function_name
to load from the default project and the dataset associated with the current session.
is_row_processor
bool, default False
Whether the function is a row processor. This is set to True for a function which receives an entire row of a DataFrame as a pandas Series.
Returns Type Descriptioncollections.abc.Callable
A function object pointing to the BigQuery function read from BigQuery. The object is similar to the one created by the remote_function
decorator, including the bigframes_remote_function
property, but not including the bigframes_cloud_function
property. read_gbq_model
read_gbq_model(model_name: str)
Loads a BigQuery ML model from BigQuery.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read an existing BigQuery ML model.
>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)
Parameter Name Description model_name
str
the model's name in BigQuery in the format project_id.dataset_id.model_id
, or just dataset_id.model_id
to load from the default project.
read_gbq_object_table(
object_table: str, *, name: Optional[str] = None
) -> dataframe.DataFrame
Read an existing object table to create a BigFrames Blob DataFrame. Use the connection of the object table for the connection of the blob. This function dosen't retrieve the object table data. If you want to read the data, use read_gbq() instead.
Note: BigFrames Blob is still under experiments. It may not work and subject to change in the future. Parameters Name Descriptionobject_table
str
name of the object table of form <PROJECT_ID>.<DATASET_ID>.<TABLE_ID>.
name
str or None
the returned blob column name.
read_gbq_queryread_gbq_query(
query: str,
*,
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
configuration: Optional[Dict] = None,
max_results: Optional[int] = None,
use_cache: Optional[bool] = None,
col_order: Iterable[str] = (),
filters: third_party_pandas_gbq.FiltersType = ()
) -> dataframe.DataFrame
Turn a SQL query into a DataFrame.
Note: Because the results are written to a temporary table, ordering by ORDER BY
is not preserved. A unique index_col
is recommended. Use row_number() over ()
if there is no natural unique index or you want to preserve ordering.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Simple query input:
>>> df = bpd.read_gbq_query('''
... SELECT
... pitcherFirstName,
... pitcherLastName,
... pitchSpeed,
... FROM `bigquery-public-data.baseball.games_wide`
... ''')
Preserve ordering in a query input.
>>> df = bpd.read_gbq_query('''
... SELECT
... -- Instead of an ORDER BY clause on the query, use
... -- ROW_NUMBER() to create an ordered DataFrame.
... ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
... AS rowindex,
...
... pitcherFirstName,
... pitcherLastName,
... AVG(pitchSpeed) AS averagePitchSpeed
... FROM `bigquery-public-data.baseball.games_wide`
... WHERE year = 2016
... GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
pitcherFirstName pitcherLastName averagePitchSpeed
rowindex
1 Albertin Chapman 96.514113
2 Zachary Britton 94.591039
<BLANKLINE>
[2 rows x 3 columns]
See also: Session.read_gbq
.
ValueError
When both columns
and col_order
are specified. read_gbq_table
read_gbq_table(
query: str,
*,
index_col: Iterable[str] | str | bigframes.enums.DefaultIndexKind = (),
columns: Iterable[str] = (),
max_results: Optional[int] = None,
filters: third_party_pandas_gbq.FiltersType = (),
use_cache: bool = True,
col_order: Iterable[str] = ()
) -> dataframe.DataFrame
Turn a BigQuery table into a DataFrame.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).
>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")
See also: Session.read_gbq
.
ValueError
When both columns
and col_order
are specified. read_gbq_table_streaming
read_gbq_table_streaming(table: str) -> streaming_dataframe.StreamingDataFrame
Turn a BigQuery table into a StreamingDataFrame.
Note: The bigframes.streaming module is a preview feature, and subject to change.read_jsonimport bigframes.streaming as bst import bigframes.pandas as bpd bpd.options.display.progress_bar = None
sdf = bst.read_gbq_table("bigquery-public-data.ml_datasets.penguins")
read_json(
path_or_buf: str | IO["bytes"],
*,
orient: Literal[
"split", "records", "index", "columns", "values", "table"
] = "columns",
dtype: Optional[Dict] = None,
encoding: Optional[str] = None,
lines: bool = False,
engine: Literal["ujson", "pyarrow", "bigquery"] = "ujson",
write_engine: constants.WriteEngineType = "default",
**kwargs
) -> dataframe.DataFrame
Convert a JSON string to DataFrame object.
Note: usingengine="bigquery"
will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame. Note: For non-bigquery engine, data is inlined in the query SQL if it is small enough (roughly 5MB or less in memory). Larger size data is loaded to a BigQuery table instead. Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
id name
0 1 Alice
1 2 Bob
<BLANKLINE>
[2 rows x 2 columns]
Parameters Name Description path_or_buf
a valid JSON str, path object or file-like object
A local or Google Cloud Storage (gs://
) path with engine="bigquery"
otherwise passed to pandas.read_json.
orient
str, optional
If engine="bigquery"
orient only supports "records". Indication of expected JSON string format. Compatible JSON strings can be produced by to_json()
with a corresponding orient value. The set of possible orients is: - 'split'
: dict like {{index -> [index], columns -> [columns], data -> [values]}}
- 'records'
: list like [{{column -> value}}, ... , {{column -> value}}]
- 'index'
: dict like {{index -> {{column -> value}}}}
- 'columns'
: dict like {{column -> {{index -> value}}}}
- 'values'
: just the values array
dtype
bool or dict, default None
If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all orient
values except 'table'
, default is True.
encoding
str, default is 'utf-8'
The encoding to use to decode py3 bytes.
lines
bool, default False
Read the file as a json object per line. If using engine="bigquery"
lines only supports True.
engine
{{"ujson", "pyarrow", "bigquery"}}, default "ujson"
Type of engine to use. If engine="bigquery"
is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_json
.
write_engine
str
How data should be written to BigQuery (if at all). See bigframes.pandas.read_pandas
for a full description of supported values.
bigframes.exceptions.DefaultIndexWarning
Using the default index is discouraged, such as with clustered or partitioned tables without primary keys. ValueError
lines
is only valid when orient
is records
. read_pandas
Loads DataFrame from a pandas DataFrame.
The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Note: Data is inlined in the query SQL if it is small enough (roughly 5MB or less in memory). Larger size data is loaded to a BigQuery table instead. Examples:>>> import bigframes.pandas as bpd
>>> import pandas as pd
>>> bpd.options.display.progress_bar = None
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
col1 col2
0 1 3
1 2 4
<BLANKLINE>
[2 rows x 2 columns]
Parameters Name Description pandas_dataframe
pandas.DataFrame, pandas.Series, or pandas.Index
a pandas DataFrame/Series/Index object to be loaded.
write_engine
str
How data should be written to BigQuery (if at all). Supported values: * "default": (Recommended) Select an appropriate mechanism to write data to BigQuery. Depends on data size and supported data types. * "bigquery_inline": Inline data in BigQuery SQL. Use this when you know the data is small enough to fit within BigQuery's 1 MB query text size limit. * "bigquery_load": Use a BigQuery load job. Use this for larger data sizes. * "bigquery_streaming": Use the BigQuery streaming JSON API. Use this if your workload is such that you exhaust the BigQuery load job quota and your data cannot be embedded in SQL due to size or data type limitations.
Exceptions Type DescriptionValueError
When the object is not a Pandas DataFrame. read_parquet
read_parquet(
path: str | IO["bytes"],
*,
engine: str = "auto",
write_engine: constants.WriteEngineType = "default"
) -> dataframe.DataFrame
Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.
Note: This method will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame. Note: For non-"bigquery" engine, data is inlined in the query SQL if it is small enough (roughly 5MB or less in memory). Larger size data is loaded to a BigQuery table instead. Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path, engine="bigquery")
Parameters Name Description path
str
Local or Cloud Storage path to Parquet file.
engine
str
One of 'auto', 'pyarrow', 'fastparquet'
, or 'bigquery'
. Parquet library to parse the file. If set to 'bigquery'
, order is not preserved. Default, 'auto'
.
read_pickle(
filepath_or_buffer: FilePath | ReadPickleBuffer,
compression: CompressionOptions = "infer",
storage_options: StorageOptions = None,
*,
write_engine: constants.WriteEngineType = "default"
)
Load pickled BigFrames object (or any object) from file.
Note: If the content of the pickle file is a Series and its name attribute is None, the name will be set to '0' by default. Note: Data is inlined in the query SQL if it is small enough (roughly 5MB or less in memory). Larger size data is loaded to a BigQuery table instead. Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)
Parameters Name Description filepath_or_buffer
str, path object, or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS.
compression
str or dict, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}.
storage_options
dict, default None
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.
write_engine
str
How data should be written to BigQuery (if at all). See bigframes.pandas.read_pandas
for a full description of supported values.
remote_function(
input_types: typing.Union[None, type, typing.Sequence[type]] = None,
output_type: typing.Optional[type] = None,
dataset: typing.Optional[str] = None,
bigquery_connection: typing.Optional[str] = None,
reuse: bool = True,
name: typing.Optional[str] = None,
packages: typing.Optional[typing.Sequence[str]] = None,
cloud_function_service_account: typing.Optional[str] = None,
cloud_function_kms_key_name: typing.Optional[str] = None,
cloud_function_docker_repository: typing.Optional[str] = None,
max_batching_rows: typing.Optional[int] = 1000,
cloud_function_timeout: typing.Optional[int] = 600,
cloud_function_max_instances: typing.Optional[int] = None,
cloud_function_vpc_connector: typing.Optional[str] = None,
cloud_function_memory_mib: typing.Optional[int] = 1024,
cloud_function_ingress_settings: typing.Optional[
typing.Literal["all", "internal-only", "internal-and-gclb"]
] = None,
)
Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.
Note:input_types=Series
scenario is in preview. It currently only supports dataframe with column types Int64
/Float64
/boolean
/ string
/binary[pyarrow]
. Warning: To use remote functions with Bigframes 2.0 and onwards, please (preferred) set an explicit user-managed cloud_function_service_account
or (discouraged) set cloud_function_service_account
to use the Compute Engine service account by setting it to "default"
.
See,
https://cloud.google.com/functions/docs/securing/function-identity.
Note: Please make sure following is setup before using this API:Have the below APIs enabled for your project:
This can be done from the cloud console (change PROJECT_ID
to yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID
Or from the gcloud CLI:
$ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com
Have following IAM roles enabled for you:
PROJECT_NUMBER-compute@developer.gserviceaccount.com
Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:
To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function
Alternatively, the IAM could also be setup via the gcloud CLI:
$ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker"
.
input_types
type or sequence(type), Optional
For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are bool
, bytes
, float
, int
, str
. For row processing user defined function (i.e. functions that receive a single input representing a row in form of a Series), type Series
should be specified.
output_type
type, Optional
Data type of the output in the user defined function. If the user defined function returns an array, then list[type]
should be specified. The supported output types are bool
, bytes
, float
, int
, str
, list[bool]
, list[float]
, list[int]
and list[str]
.
dataset
str, Optional
Dataset in which to create a BigQuery remote function. It should be in <project_id>.<dataset_name>
or <dataset_name>
format. If this parameter is not provided then session dataset id is used.
bigquery_connection
str, Optional
Name of the BigQuery connection. You should either have the connection already created in the location
you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it. If this parameter is not provided then the BigQuery connection from the session is used.
reuse
bool, Optional
Reuse the remote function if already exists. True
by default, which will result in reusing an existing remote function and corresponding cloud function that was previously created (if any) for the same udf. Please note that for an unnamed (i.e. created without an explicit name
argument) remote function, the BigQuery DataFrames session id is attached in the cloud artifacts names. So for the effective reuse across the sessions it is recommended to create the remote function with an explicit name
. Setting it to False
would force creating a unique remote function. If the required remote function does not exist then it would be created irrespective of this param.
name
str, Optional
Explicit name of the persisted BigQuery remote function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other's remote functions if they use the same persistent name. When an explicit name is provided, any session specific clean up ( bigframes.session.Session.close
/ bigframes.pandas.close_session
/ bigframes.pandas.reset_session
/ bigframes.pandas.clean_up_by_session_id
) does not clean up the function, and leaves it for the user to manage the function and the associated cloud function directly.
packages
str[], Optional
Explicit name of the external package dependencies. Each dependency is added to the requirements.txt
as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.
cloud_function_service_account
str, Optional
Service account to use for the cloud functions. If not provided then the default service account would be used. See https://cloud.google.com/functions/docs/securing/function-identity for more details. Please make sure the service account has the necessary IAM permissions configured as described in https://cloud.google.com/functions/docs/reference/iam/roles#additional-configuration.
cloud_function_kms_key_name
str, Optional
Customer managed encryption key to protect cloud functions and related data at rest. This is of the format projects/PROJECT_ID/locations/LOCATION/keyRings/KEYRING/cryptoKeys/KEY. Read https://cloud.google.com/functions/docs/securing/cmek for more details including granting necessary service accounts access to the key.
cloud_function_docker_repository
str, Optional
Docker repository created with the same encryption key as cloud_function_kms_key_name
to store encrypted artifacts created to support the cloud function. This is of the format projects/PROJECT_ID/locations/LOCATION/repositories/REPOSITORY_NAME. For more details see https://cloud.google.com/functions/docs/securing/cmek#before_you_begin.
max_batching_rows
int, Optional
The maximum number of rows to be batched for processing in the BQ remote function. Default value is 1000. A lower number can be passed to avoid timeouts in case the user code is too complex to process large number of rows fast enough. A higher number can be used to increase throughput in case the user code is fast enough. None
can be passed to let BQ remote functions service apply default batching. See for more details https://cloud.google.com/bigquery/docs/remote-functions#limiting_number_of_rows_in_a_batch_request.
cloud_function_timeout
int, Optional
The maximum amount of time (in seconds) BigQuery should wait for the cloud function to return a response. See for more details https://cloud.google.com/functions/docs/configuring/timeout. Please note that even though the cloud function (2nd gen) itself allows seeting up to 60 minutes of timeout, BigQuery remote function can wait only up to 20 minutes, see for more details https://cloud.google.com/bigquery/quotas#remote_function_limits. By default BigQuery DataFrames uses a 10 minute timeout. None
can be passed to let the cloud functions default timeout take effect.
cloud_function_max_instances
int, Optional
The maximumm instance count for the cloud function created. This can be used to control how many cloud function instances can be active at max at any given point of time. Lower setting can help control the spike in the billing. Higher setting can help support processing larger scale data. When not specified, cloud function's default setting applies. For more details see https://cloud.google.com/functions/docs/configuring/max-instances.
cloud_function_vpc_connector
str, Optional
The VPC connector you would like to configure for your cloud function. This is useful if your code needs access to data or service(s) that are on a VPC network. See for more details https://cloud.google.com/functions/docs/networking/connecting-vpc.
cloud_function_memory_mib
int, Optional
The amounts of memory (in mebibytes) to allocate for the cloud function (2nd gen) created. This also dictates a corresponding amount of allocated CPU for the function. By default a memory of 1024 MiB is set for the cloud functions created to support BigQuery DataFrames remote function. If you want to let the default memory of cloud functions be allocated, pass None
. See for more details https://cloud.google.com/functions/docs/configuring/memory.
cloud_function_ingress_settings
str, Optional
Ingress settings controls dictating what traffic can reach the function. Options are: all
, internal-only
, or internal-and-gclb
. If no setting is provided, all
will be used by default and a warning will be issued. See for more details https://cloud.google.com/functions/docs/networking/network-settings#ingress_settings.
collections.abc.Callable
A remote function object pointing to the cloud assets created in the background to support the remote execution. The cloud assets can be located through the following properties set in the object: bigframes_cloud_function
- The google cloud function deployed for the user defined code. bigframes_remote_function
- The bigquery remote function capable of calling into bigframes_cloud_function
. udf
udf(
*,
input_types: typing.Union[None, type, typing.Sequence[type]] = None,
output_type: typing.Optional[type] = None,
dataset: typing.Optional[str] = None,
bigquery_connection: typing.Optional[str] = None,
name: typing.Optional[str] = None,
packages: typing.Optional[typing.Sequence[str]] = None
)
Decorator to turn a Python udf into a BigQuery managed function.
Note: Please have following IAM roles enabled for you:input_types
type or sequence(type), Optional
For scalar user defined function it should be the input type or sequence of input types. The supported scalar input types are bool
, bytes
, float
, int
, str
.
output_type
type, Optional
Data type of the output in the user defined function. If the user defined function returns an array, then list[type]
should be specified. The supported output types are bool
, bytes
, float
, int
, str
, list[bool]
, list[float]
, list[int]
and list[str]
.
dataset
str, Optional
Dataset in which to create a BigQuery managed function. It should be in <project_id>.<dataset_name>
or <dataset_name>
format. If this parameter is not provided then session dataset id is used.
bigquery_connection
str, Optional
Name of the BigQuery connection. You should either have the connection already created in the location
you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it. If this parameter is not provided then the BigQuery connection from the session is used.
name
str, Optional
Explicit name of the persisted BigQuery managed function. Use it with caution, because more than one users working in the same project and dataset could overwrite each other's managed functions if they use the same persistent name. When an explicit name is provided, any session specific clean up ( bigframes.session.Session.close
/ bigframes.pandas.close_session
/ bigframes.pandas.reset_session
/ bigframes.pandas.clean_up_by_session_id
) does not clean up the function, and leaves it for the user to manage the function and the associated cloud function directly.
packages
str[], Optional
Explicit name of the external package dependencies. Each dependency is added to the requirements.txt
as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.
collections.abc.Callable
A managed function object pointing to the cloud assets created in the background to support the remote execution. The cloud ssets can be located through the following properties set in the object: bigframes_bigquery_function
- The bigquery managed function deployed for the user defined code.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-12 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-12 UTC."],[],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4