Stay organized with collections Save and categorize content based on your preferences.
API documentation for pandas
package.
NamedAgg(column, aggfunc)
option_contextContext manager to temporarily set options in the with
statement context.
You need to invoke as option_context(pat, val, [(pat, val), ...])
.
Examples:
>>> import bigframes
>>> with bigframes.option_context('display.max_rows', 10, 'display.max_columns', 5):
... pass
Packages Functions concat
Concatenate BigQuery DataFrames objects along a particular axis.
Allows optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
Note: It is not recommended to build DataFrames by adding single rows in a for loop. Build a list of rows and make a DataFrame in a single concat. Examples:>>> import bigframes.pandas as pd
>>> pd.options.display.progress_bar = None
Combine two Series
.
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0 a
1 b
0 c
1 d
dtype: string
Clear the existing index and reset it in the result by setting the ignore_index
option to True
.
>>> pd.concat([s1, s2], ignore_index=True)
0 a
1 b
2 c
3 d
dtype: string
Combine two DataFrame
objects with identical columns.
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter', 'number'])
>>> df1
letter number
0 a 1
1 b 2
<BLANKLINE>
[2 rows x 2 columns]
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> df2
letter number
0 c 3
1 d 4
<BLANKLINE>
[2 rows x 2 columns]
>>> pd.concat([df1, df2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
<BLANKLINE>
[4 rows x 2 columns]
Combine DataFrame
objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN
values.
>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
... columns=['letter', 'number', 'animal'])
>>> df3
letter number animal
0 c 3 cat
1 d 4 dog
<BLANKLINE>
[2 rows x 3 columns]
>>> pd.concat([df1, df3])
letter number animal
0 a 1 <NA>
1 b 2 <NA>
0 c 3 cat
1 d 4 dog
<BLANKLINE>
[4 rows x 3 columns]
Combine DataFrame
objects with overlapping columns and return only those that are shared by passing inner
to the join
keyword argument.
>>> pd.concat([df1, df3], join="inner")
letter number
0 a 1
1 b 2
0 c 3
1 d 4
<BLANKLINE>
[4 rows x 2 columns]
Parameters Name Description objs
Objects to concatenate. Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
axis
The axis to concatenate along.
join
How to handle indexes on other axis (or axes).
ignore_index
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
cutcut(
x: bigframes.series.Series, bins: int, *, labels: typing.Optional[bool] = None
) -> bigframes.series.Series
Bin values into discrete intervals.
Use cut
when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut
could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
labels=False
implies you just want the bins back.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> s = bpd.Series([0, 1, 5, 10])
>>> s
0 0
1 1
2 5
3 10
dtype: Int64
Cut with an integer (equal-width bins):
>>> bpd.cut(s, bins=4)
0 {'left_exclusive': -0.01, 'right_inclusive': 2.5}
1 {'left_exclusive': -0.01, 'right_inclusive': 2.5}
2 {'left_exclusive': 2.5, 'right_inclusive': 5.0}
3 {'left_exclusive': 7.5, 'right_inclusive': 10.0}
dtype: struct<left_exclusive: double, right_inclusive: double>[pyarrow]
Cut with an integer (equal-width bins) and labels=False:
>>> bpd.cut(s, bins=4, labels=False)
0 0
1 0
2 1
3 3
dtype: Int64
Cut with pd.IntervalIndex, requires importing pandas for IntervalIndex:
>>> import pandas as pd
>>> interval_index = pd.IntervalIndex.from_tuples([(0, 1), (1, 5), (5, 20)])
>>> bpd.cut(s, bins=interval_index)
0 <NA>
1 {'left_exclusive': 0, 'right_inclusive': 1}
2 {'left_exclusive': 1, 'right_inclusive': 5}
3 {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]
Cut with an iterable of tuples:
>>> bins_tuples = [(0, 1), (1, 4), (5, 20)]
>>> bpd.cut(s, bins=bins_tuples)
0 <NA>
1 {'left_exclusive': 0, 'right_inclusive': 1}
2 <NA>
3 {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]
Cut with an iterable of ints:
>>> bins_ints = [0, 1, 5, 20]
>>> bpd.cut(s, bins=bins_ints)
0 <NA>
1 {'left_exclusive': 0, 'right_inclusive': 1}
2 {'left_exclusive': 1, 'right_inclusive': 5}
3 {'left_exclusive': 5, 'right_inclusive': 20}
dtype: struct<left_exclusive: int64, right_inclusive: int64>[pyarrow]
Parameters Name Description x
The input Series to be binned. Must be 1-dimensional.
bins
The criteria to bin by. int: Defines the number of equal-width bins in the range of x
. The range of x
is extended by .1% on each side to include the minimum and maximum values of x
. pd.IntervalIndex or Iterable of tuples: Defines the exact bins to be used. It's important to ensure that these bins are non-overlapping. Iterable of numerics: Defines the exact bins by using the interval between each item and its following item. The items must be monotonically increasing.
labels
Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container.
get_dummiesget_dummies(
data: typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series],
prefix: typing.Optional[typing.Union[typing.List, dict, str]] = None,
prefix_sep: typing.Optional[typing.Union[typing.List, dict, str]] = "_",
dummy_na: bool = False,
columns: typing.Optional[typing.List] = None,
drop_first: bool = False,
dtype: typing.Optional[typing.Any] = None,
) -> bigframes.dataframe.DataFrame
Convert categorical variable into dummy/indicator variables.
Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.
Examples:
>>> import bigframes.pandas as pd
>>> pd.options.display.progress_bar = None
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 True False False
1 False True False
2 False False True
3 True False False
<BLANKLINE>
[4 rows x 3 columns]
>>> s1 = pd.Series(['a', 'b', None])
>>> pd.get_dummies(s1)
a b
0 True False
1 False True
2 False False
<BLANKLINE>
[3 rows x 2 columns]
>>> pd.get_dummies(s1, dummy_na=True)
a b <NA>
0 True False False
1 False True False
2 False False True
<BLANKLINE>
[3 rows x 3 columns]
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 True False False True False
1 2 False True True False False
2 3 True False False False True
<BLANKLINE>
[3 rows x 6 columns]
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 True False False
1 False True False
2 False False True
3 True False False
4 True False False
<BLANKLINE>
[5 rows x 3 columns]
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 False False
1 True False
2 False True
3 False False
4 False False
<BLANKLINE>
[5 rows x 2 columns]
Parameters Name Description data
Data of which to get dummy indicators.
prefix
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
prefix_sep
Separator/delimiter to use, appended to prefix. Or pass a list or dictionary as with prefix.
dummy_na
Add a column to indicate NaNs, if False NaNs are ignored.
columns
Column names in the DataFrame to be encoded. If columns is None then only the columns with string dtype will be converted.
drop_first
Whether to get k-1 dummies out of k categorical levels by removing the first level.
dtype
Data type for new columns. Only a single dtype is allowed.
mergemerge(
left: bigframes.dataframe.DataFrame,
right: bigframes.dataframe.DataFrame,
how: typing.Literal["inner", "left", "outer", "right", "cross"] = "inner",
on: typing.Optional[str] = None,
*,
left_on: typing.Optional[str] = None,
right_on: typing.Optional[str] = None,
sort: bool = False,
suffixes: tuple[str, str] = ("_x", "_y")
) -> bigframes.dataframe.DataFrame
Merge DataFrame objects with a database-style join.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Note: A named Series object is treated as a DataFrame with a single named column. Warning: If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results. Parameters Name Descriptionon
Columns to join on. It must be found in both DataFrames. Either on or left_on + right_on must be passed in.
left_on
Columns to join on in the left DataFrame. Either on or left_on + right_on must be passed in.
right_on
Columns to join on in the right DataFrame. Either on or left_on + right_on must be passed in.
qcutqcut(
x: bigframes.series.Series,
q: int,
*,
labels: typing.Optional[bool] = None,
duplicates: typing.Literal["drop", "error"] = "error"
) -> bigframes.series.Series
Quantile-based discretization function.
Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
Parameters Name Descriptionx
The input Series to be binned. Must be 1-dimensional.
q
Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
labels
Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error.
duplicates
If bin edges are not unique, raise ValueError or drop non-uniques.
read_csvread_csv(
filepath_or_buffer: typing.Union[str, typing.IO[bytes]],
*,
sep: typing.Optional[str] = ",",
header: typing.Optional[int] = 0,
names: typing.Optional[
typing.Union[
typing.MutableSequence[typing.Any],
numpy.ndarray[typing.Any, typing.Any],
typing.Tuple[typing.Any, ...],
range,
]
] = None,
index_col: typing.Optional[
typing.Union[
int, str, typing.Sequence[typing.Union[str, int]], typing.Literal[False]
]
] = None,
usecols: typing.Optional[
typing.Union[
typing.MutableSequence[str],
typing.Tuple[str, ...],
typing.Sequence[int],
pandas.core.series.Series,
pandas.core.indexes.base.Index,
numpy.ndarray[typing.Any, typing.Any],
typing.Callable[[typing.Any], bool],
]
] = None,
dtype: typing.Optional[typing.Dict] = None,
engine: typing.Optional[
typing.Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
] = None,
encoding: typing.Optional[str] = None,
**kwargs
) -> bigframes.dataframe.DataFrame
Loads DataFrame from comma-separated values (csv) file locally or from Cloud Storage.
The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Note: usingengine="bigquery"
will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame. Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
name post_abbr
0 Alabama AL
1 Alaska AK
<BLANKLINE>
[2 rows x 2 columns]
Parameters Name Description filepath_or_buffer
A local or Google Cloud Storage (gs://
) path with engine="bigquery"
otherwise passed to pandas.read_csv.
sep
the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support sep=" "
to specify tab character as separator. Default engine supports having any number of spaces as separator by specifying sep="\s+"
. Separators longer than 1 character are interpreted as regular expressions by the default engine. BigQuery engine only supports single character separators.
header
row number to use as the column names. - None
: Instructs autodetect that there are no headers and data should be read starting from the first row. - 0
: If using engine="bigquery"
, Autodetect tries to detect headers in the first row. If they are not detected, the row is read as data. Otherwise data is read starting from the second row. When using default engine, pandas assumes the first row contains column names unless the names
argument is specified. If names
is provided, then the first row is ignored, second row is read as data, and column names are inferred from names
. - N > 0
: If using engine="bigquery"
, Autodetect skips N rows and tries to detect headers in row N+1. If headers are not detected, row N+1 is just skipped. Otherwise row N+1 is used to extract column names for the detected schema. When using default engine, pandas will skip N rows and assumes row N+1 contains column names unless the names
argument is specified. If names
is provided, row N+1 will be ignored, row N+2 will be read as data, and column names are inferred from names
.
names
a list of column names to use. If the file contains a header row and you want to pass this parameter, then header=0
should be passed as well so the first (header) row is ignored. Only to be used with default engine.
index_col
column(s) to use as the row labels of the DataFrame, either given as string name or column index. index_col=False
can be used with the default engine only to enforce that the first column is not used as the index. Using column index instead of column name is only supported with the default engine. The BigQuery engine only supports having a single column name as the index_col
. Neither engine supports having a multi-column index.
usecols
List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in usecols
can be defined to correspond to column names provided with the names
parameter (ignoring the document's header row of column names). The order of the column indices/names in usecols
is ignored with the default engine. The order of the column names provided with the BigQuery engine will be consistent in the resulting dataframe. If using a callable function with the default engine, only column names that evaluate to True by the callable function will be in the resulting dataframe.
dtype
Data type for data or columns. Only to be used with default engine.
engine
Type of engine to use. If engine="bigquery"
is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_csv
.
encoding
encoding the character encoding of the data. The default encoding is UTF-8
for both engines. The default engine acceps a wide range of encodings. Refer to Python documentation for a comprehensive list, https://docs.python.org/3/library/codecs.html#standard-encodings The BigQuery engine only supports UTF-8
and ISO-8859-1
.
read_gbq(
query_or_table: str,
*,
index_col: typing.Union[typing.Iterable[str], str] = (),
columns: typing.Iterable[str] = (),
configuration: typing.Optional[typing.Dict] = None,
max_results: typing.Optional[int] = None,
filters: typing.Union[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
],
typing.Iterable[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
]
],
] = (),
use_cache: typing.Optional[bool] = None,
col_order: typing.Iterable[str] = ()
) -> bigframes.dataframe.DataFrame
Loads a DataFrame from BigQuery.
BigQuery tables are an unordered, unindexed data source. To add support pandas-compatibility, the following indexing options are supported:
index_col
argument to one or more columns. Unique values for the row labels are recommended. Duplicate labels are possible, but note that joins on a non-unique index can duplicate rows and operations like cumsum()
that window across a non-unique index can have some non-deternimism.row_number() OVER (ORDER BY ...) AS rowindex
in your SQL query and set index_col='rowindex'
to preserve the desired ordering.
If your query doesn't have an ordering, select
GENERATE_UUID() AS rowindex
in your SQL and set
index_col='rowindex'
for the best performance.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
If the input is a table ID:
>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
Read table path with wildcard suffix and filters:
df = bpd.read_gbq_table("bigquery-public-data.noaa_gsod.gsod19*", filters=[("_table_suffix", ">=", "30"), ("_table_suffix", "<=", "39")])
Preserve ordering in a query input.
>>> df = bpd.read_gbq('''
... SELECT
... -- Instead of an ORDER BY clause on the query, use
... -- ROW_NUMBER() to create an ordered DataFrame.
... ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
... AS rowindex,
...
... pitcherFirstName,
... pitcherLastName,
... AVG(pitchSpeed) AS averagePitchSpeed
... FROM `bigquery-public-data.baseball.games_wide`
... WHERE year = 2016
... GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
pitcherFirstName pitcherLastName averagePitchSpeed
rowindex
1 Albertin Chapman 96.514113
2 Zachary Britton 94.591039
<BLANKLINE>
[2 rows x 3 columns]
Reading data with columns
and filters
parameters:
>>> columns = ['pitcherFirstName', 'pitcherLastName', 'year', 'pitchSpeed']
>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant'])]
>>> df = bpd.read_gbq(
... "bigquery-public-data.baseball.games_wide",
... columns=columns,
... filters=filters,
... )
>>> df.head(1)
pitcherFirstName pitcherLastName year pitchSpeed
0 John Gant 2016 82
<BLANKLINE>
[1 rows x 4 columns]
Parameters Name Description query_or_table
A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of project.dataset.tablename
or dataset.tablename
. Can also take wildcard table name, such as project.dataset.table_prefix*
. In tha case, will read all the matched table as one DataFrame.
index_col
Name of result column(s) to use for index in results DataFrame. New in bigframes version 1.3.0: If index_cols
is not set, the primary key(s) of the table are used as the index.
columns
List of BigQuery column names in the desired order for results DataFrame.
configuration
Query config parameters for job processing. For example: configuration = {'query': {'useQueryCache': False}}. For more information see BigQuery REST API Reference https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query
__.
max_results
If set, limit the maximum number of rows to fetch from the query results.
filters
To filter out data. Filter syntax: [[(column, op, val), …],…] where op is [==, >, >=, <, <=, !=, in, not in, LIKE]. The innermost tuples are transposed into a set of filters applied through an AND operation. The outer Iterable combines these sets of filters through an OR operation. A single Iterable of tuples can also be used, meaning that no OR operation between set of filters is to be conducted. If using wildcard table suffix in query_or_table, can specify '_table_suffix' pseudo column to filter the tables to be read into the DataFrame.
use_cache
Caches query results if set to True
. When None
, it behaves as True
, but should not be combined with useQueryCache
in configuration
to avoid conflicts.
col_order
Alias for columns, retained for backwards compatibility.
read_gbq_functionread_gbq_function(function_name: str)
Parameter Name Description function_name
the function's name in BigQuery in the format project_id.dataset_id.function_name
, or dataset_id.function_name
to load from the default project, or function_name
to load from the default project and the dataset associated with the current session.
read_gbq_model(model_name: str)
Loads a BigQuery ML model from BigQuery.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read an existing BigQuery ML model.
>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)
Parameter Name Description model_name
the model's name in BigQuery in the format project_id.dataset_id.model_id
, or just dataset_id.model_id
to load from the default project.
read_gbq_query(
query: str,
*,
index_col: typing.Union[typing.Iterable[str], str] = (),
columns: typing.Iterable[str] = (),
configuration: typing.Optional[typing.Dict] = None,
max_results: typing.Optional[int] = None,
use_cache: typing.Optional[bool] = None,
col_order: typing.Iterable[str] = ()
) -> bigframes.dataframe.DataFrame
Turn a SQL query into a DataFrame.
Note: Because the results are written to a temporary table, ordering by ORDER BY
is not preserved. A unique index_col
is recommended. Use row_number() over ()
if there is no natural unique index or you want to preserve ordering.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Simple query input:
>>> df = bpd.read_gbq_query('''
... SELECT
... pitcherFirstName,
... pitcherLastName,
... pitchSpeed,
... FROM `bigquery-public-data.baseball.games_wide`
... ''')
Preserve ordering in a query input.
>>> df = bpd.read_gbq_query('''
... SELECT
... -- Instead of an ORDER BY clause on the query, use
... -- ROW_NUMBER() to create an ordered DataFrame.
... ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
... AS rowindex,
...
... pitcherFirstName,
... pitcherLastName,
... AVG(pitchSpeed) AS averagePitchSpeed
... FROM `bigquery-public-data.baseball.games_wide`
... WHERE year = 2016
... GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
pitcherFirstName pitcherLastName averagePitchSpeed
rowindex
1 Albertin Chapman 96.514113
2 Zachary Britton 94.591039
<BLANKLINE>
[2 rows x 3 columns]
See also: Session.read_gbq
.
read_gbq_table(
query: str,
*,
index_col: typing.Union[typing.Iterable[str], str] = (),
columns: typing.Iterable[str] = (),
max_results: typing.Optional[int] = None,
filters: typing.Union[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
],
typing.Iterable[
typing.Iterable[
typing.Tuple[
str,
typing.Literal[
"in", "not in", "<",><=", "="=" ,"="" "!=", ">=", ">", "LIKE"
],
typing.Any,
]
]
],
] = (),
use_cache: bool = True,
col_order: typing.Iterable[str] = ()
) -> bigframes.dataframe.DataFrame
Turn a BigQuery table into a DataFrame.
Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).
>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")
See also: Session.read_gbq
.
read_json(
path_or_buf: typing.Union[str, typing.IO[bytes]],
*,
orient: typing.Literal[
"split", "records", "index", "columns", "values", "table"
] = "columns",
dtype: typing.Optional[typing.Dict] = None,
encoding: typing.Optional[str] = None,
lines: bool = False,
engine: typing.Literal["ujson", "pyarrow", "bigquery"] = "ujson",
**kwargs
) -> bigframes.dataframe.DataFrame
Convert a JSON string to DataFrame object.
Note: usingengine="bigquery"
will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame. Examples:
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
id name
0 1 Alice
1 2 Bob
<BLANKLINE>
[2 rows x 2 columns]
Parameters Name Description path_or_buf
A local or Google Cloud Storage (gs://
) path with engine="bigquery"
otherwise passed to pandas.read_json.
orient
If engine="bigquery"
orient only supports "records". Indication of expected JSON string format. Compatible JSON strings can be produced by to_json()
with a corresponding orient value. The set of possible orients is: - 'split'
: dict like {{index -> [index], columns -> [columns], data -> [values]}}
- 'records'
: list like [{{column -> value}}, ... , {{column -> value}}]
- 'index'
: dict like {{index -> {{column -> value}}}}
- 'columns'
: dict like {{column -> {{index -> value}}}}
- 'values'
: just the values array
dtype
If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all orient
values except 'table'
, default is True.
encoding
The encoding to use to decode py3 bytes.
lines
Read the file as a json object per line. If using engine="bigquery"
lines only supports True.
engine
Type of engine to use. If engine="bigquery"
is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to pandas.read_json
.
Loads DataFrame from a pandas DataFrame.
The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.
Examples:
>>> import bigframes.pandas as bpd
>>> import pandas as pd
>>> bpd.options.display.progress_bar = None
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
col1 col2
0 1 3
1 2 4
<BLANKLINE>
[2 rows x 2 columns]
Parameter Name Description pandas_dataframe
a pandas DataFrame/Series/Index object to be loaded.
read_parquetread_parquet(
path: typing.Union[str, typing.IO[bytes]], *, engine: str = "auto"
) -> bigframes.dataframe.DataFrame
Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.
Note: This method will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame. Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path, engine="bigquery")
Parameters Name Description path
Local or Cloud Storage path to Parquet file.
engine
One of 'auto', 'pyarrow', 'fastparquet'
, or 'bigquery'
. Parquet library to parse the file. If set to 'bigquery'
, order is not preserved. Default, 'auto'
.
read_pickle(
filepath_or_buffer: FilePath | ReadPickleBuffer,
compression: CompressionOptions = "infer",
storage_options: StorageOptions = None,
)
Load pickled BigFrames object (or any object) from file.
Note: If the content of the pickle file is a Series and its name attribute is None, the name will be set to '0' by default. Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)
Parameters Name Description filepath_or_buffer
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS.
compression
For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}.
storage_options
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.
remote_functionremote_function(
input_types: typing.Union[type, typing.Sequence[type]],
output_type: type,
dataset: typing.Optional[str] = None,
bigquery_connection: typing.Optional[str] = None,
reuse: bool = True,
name: typing.Optional[str] = None,
packages: typing.Optional[typing.Sequence[str]] = None,
cloud_function_service_account: typing.Optional[str] = None,
cloud_function_kms_key_name: typing.Optional[str] = None,
cloud_function_docker_repository: typing.Optional[str] = None,
max_batching_rows: typing.Optional[int] = 1000,
cloud_function_timeout: typing.Optional[int] = 600,
)
Parameters Name Description input_types
Input data type, or sequence of input data types in the user defined function.
output_type
Data type of the output in the user defined function.
dataset
Dataset in which to create a BigQuery remote function. It should be in <project_id>.<dataset_name>
or <dataset_name>
format. If this parameter is not provided then session dataset id is used.
bigquery_connection
Name of the BigQuery connection. You should either have the connection already created in the location
you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it. If this parameter is not provided then the BigQuery connection from the session is used.
reuse
Reuse the remote function if already exists. True
by default, which will result in reusing an existing remote function and corresponding cloud function (if any) that was previously created for the same udf. Setting it to False
would force creating a unique remote function. If the required remote function does not exist then it would be created irrespective of this param.
name
Explicit name of the persisted BigQuery remote function. Use it with caution, because two users working in the same project and dataset could overwrite each other's remote functions if they use the same persistent name.
packages
Explicit name of the external package dependencies. Each dependency is added to the requirements.txt
as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.
cloud_function_service_account
Service account to use for the cloud functions. If not provided then the default service account would be used. See https://cloud.google.com/functions/docs/securing/function-identity for more details. Please make sure the service account has the necessary IAM permissions configured as described in https://cloud.google.com/functions/docs/reference/iam/roles#additional-configuration.
cloud_function_kms_key_name
Customer managed encryption key to protect cloud functions and related data at rest. This is of the format projects/PROJECT_ID/locations/LOCATION/keyRings/KEYRING/cryptoKeys/KEY. Read https://cloud.google.com/functions/docs/securing/cmek for more details including granting necessary service accounts access to the key.
cloud_function_docker_repository
Docker repository created with the same encryption key as cloud_function_kms_key_name
to store encrypted artifacts created to support the cloud function. This is of the format projects/PROJECT_ID/locations/LOCATION/repositories/REPOSITORY_NAME. For more details see https://cloud.google.com/functions/docs/securing/cmek#before_you_begin.
max_batching_rows
The maximum number of rows to be batched for processing in the BQ remote function. Default value is 1000. A lower number can be passed to avoid timeouts in case the user code is too complex to process large number of rows fast enough. A higher number can be used to increase throughput in case the user code is fast enough. None
can be passed to let BQ remote functions service apply default batching. See for more details https://cloud.google.com/bigquery/docs/remote-functions#limiting_number_of_rows_in_a_batch_request.
cloud_function_timeout
The maximum amount of time (in seconds) BigQuery should wait for the cloud function to return a response. See for more details https://cloud.google.com/functions/docs/configuring/timeout. Please note that even though the cloud function (2nd gen) itself allows seeting up to 60 minutes of timeout, BigQuery remote function can wait only up to 20 minutes, see for more details https://cloud.google.com/bigquery/quotas#remote_function_limits. By default BigQuery DataFrames uses a 10 minute timeout. None
can be passed to let the cloud functions default timeout take effect.
to_datetime(
arg: typing.Union[
int,
float,
str,
datetime.datetime,
typing.Iterable,
pandas.core.series.Series,
pandas.core.frame.DataFrame,
typing.Mapping,
bigframes.series.Series,
bigframes.dataframe.DataFrame,
],
*,
utc: bool = False,
format: typing.Optional[str] = None,
unit: typing.Optional[str] = None
) -> typing.Union[
pandas._libs.tslibs.timestamps.Timestamp, datetime.datetime, bigframes.series.Series
]
This function converts a scalar, array-like or Series to a datetime object.
Note: BigQuery only supports precision up to microseconds (us). Therefore, when working with timestamps that have a finer granularity than microseconds, be aware that the additional precision will not be represented in BigQuery. Note: The format strings for specifying datetime representations in BigQuery and pandas are not completely identical. Ensure that the format string provided is compatible with BigQuery (https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#format_elements_date_time). Examples:>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
Converting a Scalar to datetime:
>>> scalar = 123456.789
>>> bpd.to_datetime(scalar, unit = 's')
Timestamp('1970-01-02 10:17:36.789000')
Converting a List of Strings without Timezone Information:
>>> list_str = ["01-31-2021 14:30", "02-28-2021 15:45"]
>>> bpd.to_datetime(list_str, format="%m-%d-%Y %H:%M", utc=True)
0 2021-01-31 14:30:00+00:00
1 2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]
Converting a Series of Strings with Timezone Information:
>>> series_str = bpd.Series(["01-31-2021 14:30+08:00", "02-28-2021 15:45+00:00"])
>>> bpd.to_datetime(series_str, format="%m-%d-%Y %H:%M%Z", utc=True)
0 2021-01-31 06:30:00+00:00
1 2021-02-28 15:45:00+00:00
dtype: timestamp[us, tz=UTC][pyarrow]
Parameters Name Description arg
The object to convert to a datetime.
utc
Control timezone-related parsing, localization and conversion. If True, the function always returns a timezone-aware UTC-localized timestamp or series. If False (default), inputs will not be coerced to UTC.
format
The strftime to parse time, e.g. "%d/%m/%Y".
unit
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-12 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-12 UTC."],[],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4