For most data types, pandas uses NumPy arrays as the concrete objects contained with a Index
, Series
, or DataFrame
.
For some data types, pandas extends NumPyâs type system. String aliases for these types can be found at dtypes.
pandas and third-party libraries can extend NumPyâs type system (see Extension types). The top-level array()
method can be used to create a new array, which may be stored in a Series
, Index
, or as a column in a DataFrame
.
Warning
This feature is experimental, and the API can change in a future release without warning.
The arrays.ArrowExtensionArray
is backed by a pyarrow.ChunkedArray
with a pyarrow.DataType
instead of a NumPy array and data type. The .dtype
of a arrays.ArrowExtensionArray
is an ArrowDtype
.
Pyarrow provides similar array and data type support as NumPy including first-class nullability support for all data types, immutability and more.
The table below shows the equivalent pyarrow-backed (pa
), pandas extension, and numpy (np
) types that are recognized by pandas. Pyarrow-backed types below need to be passed into ArrowDtype
to be recognized by pandas e.g. pd.ArrowDtype(pa.bool_())
.
Note
Pyarrow-backed string support is provided by both pd.StringDtype("pyarrow")
and pd.ArrowDtype(pa.string())
. pd.StringDtype("pyarrow")
is described below in the string section and will be returned if the string alias "string[pyarrow]"
is specified. pd.ArrowDtype(pa.string())
generally has better interoperability with ArrowDtype
of different types.
While individual values in an arrays.ArrowExtensionArray
are stored as a PyArrow objects, scalars are returned as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or NA
for missing values.
For more information, please see the PyArrow user guide.
Datetimes#NumPy cannot natively represent timezone-aware datetimes. pandas supports this with the arrays.DatetimeArray
extension array, which can hold timezone-naive or timezone-aware values.
Timestamp
, a subclass of datetime.datetime
, is pandasâ scalar type for timezone-naive or timezone-aware datetime data. NaT
is the missing value for datetime data.
A collection of timestamps may be stored in a arrays.DatetimeArray
. For timezone-aware data, the .dtype
of a arrays.DatetimeArray
is a DatetimeTZDtype
. For timezone-naive data, np.dtype("datetime64[ns]")
is used.
If the data are timezone-aware, then every value in the array must have the same timezone.
Timedeltas#NumPy can natively represent timedeltas. pandas provides Timedelta
for symmetry with Timestamp
. NaT
is the missing value for timedelta data.
A collection of Timedelta
may be stored in a TimedeltaArray
.
pandas represents spans of times as Period
objects.
A collection of Period
may be stored in a arrays.PeriodArray
. Every period in a arrays.PeriodArray
must have the same freq
.
Arbitrary intervals can be represented as Interval
objects.
A collection of intervals may be stored in an arrays.IntervalArray
.
numpy.ndarray
cannot natively represent integer-data with missing values. pandas provides this through arrays.IntegerArray
.
pandas defines a custom data type for representing data that can take only a limited, fixed set of values. The dtype of a Categorical
can be described by a CategoricalDtype
.
Categorical data can be stored in a pandas.Categorical
:
The alternative Categorical.from_codes()
constructor can be used when you have the categories and integer codes already:
The dtype information is available on the Categorical
np.asarray(categorical)
works by implementing the array interface. Be aware, that this converts the Categorical
back to a NumPy array, so categories and order information is not preserved!
A Categorical
can be stored in a Series
or DataFrame
. To create a Series of dtype category
, use cat = s.astype(dtype)
or Series(..., dtype=dtype)
where dtype
is either
the string 'category'
an instance of CategoricalDtype
.
If the Series
is of dtype CategoricalDtype
, Series.cat
can be used to change the categorical data. See Categorical accessor for more.
More methods are available on Categorical
:
Data where a single value is repeated many times (e.g. 0
or NaN
) may be stored efficiently as a arrays.SparseArray
.
The Series.sparse
accessor may be used to access sparse-specific attributes and methods if the Series
contains sparse values. See Sparse accessor and the user guide for more.
When working with text data, where each valid element is a string or missing, we recommend using StringDtype
(with the alias "string"
).
The Series.str
accessor is available for Series
backed by a arrays.StringArray
. See String handling for more.
The boolean dtype (with the alias "boolean"
) provides support for storing boolean data (True
, False
) with missing values, which is not possible with a bool numpy.ndarray
.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4