Note
IntegerArray is currently experimental. Its API or implementation may change without warning. Uses pandas.NA
as the missing value.
In Working with missing data, we saw that pandas primarily uses NaN
to represent missing data. Because NaN
is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.
pandas can represent integer data with possibly missing values using arrays.IntegerArray
. This is an extension type implemented within pandas.
In [1]: arr = pd.array([1, 2, None], dtype=pd.Int64Dtype()) In [2]: arr Out[2]: <IntegerArray> [1, 2, <NA>] Length: 3, dtype: Int64
Or the string alias "Int64"
(note the capital "I"
) to differentiate from NumPyâs 'int64'
dtype:
In [3]: pd.array([1, 2, np.nan], dtype="Int64") Out[3]: <IntegerArray> [1, 2, <NA>] Length: 3, dtype: Int64
All NA-like values are replaced with pandas.NA
.
In [4]: pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64") Out[4]: <IntegerArray> [1, 2, <NA>, <NA>, <NA>] Length: 5, dtype: Int64
This array can be stored in a DataFrame
or Series
like any NumPy array.
In [5]: pd.Series(arr) Out[5]: 0 1 1 2 2 <NA> dtype: Int64
You can also pass the list-like object to the Series
constructor with the dtype.
Warning
Currently pandas.array()
and pandas.Series()
use different rules for dtype inference. pandas.array()
will infer a nullable-integer dtype
In [6]: pd.array([1, None]) Out[6]: <IntegerArray> [1, <NA>] Length: 2, dtype: Int64 In [7]: pd.array([1, 2]) Out[7]: <IntegerArray> [1, 2] Length: 2, dtype: Int64
For backwards-compatibility, Series
infers these as either integer or float dtype.
In [8]: pd.Series([1, None]) Out[8]: 0 1.0 1 NaN dtype: float64 In [9]: pd.Series([1, 2]) Out[9]: 0 1 1 2 dtype: int64
We recommend explicitly providing the dtype to avoid confusion.
In [10]: pd.array([1, None], dtype="Int64") Out[10]: <IntegerArray> [1, <NA>] Length: 2, dtype: Int64 In [11]: pd.Series([1, None], dtype="Int64") Out[11]: 0 1 1 <NA> dtype: Int64
In the future, we may provide an option for Series
to infer a nullable-integer dtype.
Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and the data will be coerced to another dtype if needed.
In [12]: s = pd.Series([1, 2, None], dtype="Int64") # arithmetic In [13]: s + 1 Out[13]: 0 2 1 3 2 <NA> dtype: Int64 # comparison In [14]: s == 1 Out[14]: 0 True 1 False 2 <NA> dtype: boolean # slicing operation In [15]: s.iloc[1:3] Out[15]: 1 2 2 <NA> dtype: Int64 # operate with other dtypes In [16]: s + s.iloc[1:3].astype("Int8") Out[16]: 0 <NA> 1 4 2 <NA> dtype: Int64 # coerce when needed In [17]: s + 0.01 Out[17]: 0 1.01 1 2.01 2 <NA> dtype: Float64
These dtypes can operate as part of a DataFrame
.
In [18]: df = pd.DataFrame({"A": s, "B": [1, 1, 3], "C": list("aab")}) In [19]: df Out[19]: A B C 0 1 1 a 1 2 1 a 2 <NA> 3 b In [20]: df.dtypes Out[20]: A Int64 B int64 C object dtype: object
These dtypes can be merged, reshaped & casted.
In [21]: pd.concat([df[["A"]], df[["B", "C"]]], axis=1).dtypes Out[21]: A Int64 B int64 C object dtype: object In [22]: df["A"].astype(float) Out[22]: 0 1.0 1 2.0 2 NaN Name: A, dtype: float64
Reduction and groupby operations such as sum()
work as well.
In [23]: df.sum(numeric_only=True) Out[23]: A 3 B 5 dtype: Int64 In [24]: df.sum() Out[24]: A 3 B 5 C aab dtype: object In [25]: df.groupby("B").A.sum() Out[25]: B 1 3 3 0 Name: A, dtype: Int64Scalar NA Value#
arrays.IntegerArray
uses pandas.NA
as its scalar missing value. Slicing a single element thatâs missing will return pandas.NA
In [26]: a = pd.array([1, None], dtype="Int64") In [27]: a[1] Out[27]: <NA>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4