PyMongoArrow supports a majority of the BSON types. Because Arrow and Polars provide first-class support for Lists and Structs, this includes embedded arrays and documents.
Support for additional types will be added in subsequent releases.
BSON Type
Type Identifiers
String
py.str
Instance of pyarrow.string
Embedded document
py.dict
Instance of pyarrow.struct
Embedded array
Instance of pyarrow.list_
ObjectId
py.bytes
bson.ObjectId
Instance of pymongoarrow.types.ObjectIdType
Instance of pymongoarrow.pandas_types.PandasObjectId
Decimal128
bson.Decimal128
Instance of pymongoarrow.types.Decimal128Type
Instance of pymongoarrow.pandas_types.PandasDecimal128
Instance of pyarrow.Decimal128
Boolean
~py.bool
Instance of ~pyarrow.bool_
64-bit binary floating point
py.float
Instance of pyarrow.float64
32-bit integer
Instance of pyarrow.int32
64-bit integer
~py.int
bson.int64.Int64
Instance of pyarrow.int64
UTC datetime
py.datetime.datetime
Instance of ~pyarrow.timestamp
with ms
resolution
Binary data
bson.Binary
Instance of pymongoarrow.types.BinaryType
Instance of pymongoarrow.pandas_types.PandasBinary
JavaScript code
bson.Code
Instance of pymongoarrow.types.CodeType
Instance of pymongoarrow.pandas_types.PandasCode
null
Instance of pyarrow.null
PyMongoArrow supports Decimal128
on only little-endian systems. On big-endian systems, it uses null
instead.
Use type identifiers to specify that a field is of a certain type during pymongoarrow.api.Schema
declaration. For example, if your data has fields f1
and f2
bearing types 32-bit integer and UTC datetime, and an _id
that is an ObjectId
, you can define your schema as follows:
schema = Schema({ '_id': ObjectId, 'f1': pyarrow.int32(), 'f2': pyarrow.timestamp('ms')})
Unsupported data types in a schema cause a ValueError
identifying the field and its data type.
The schema used for an embedded array must use the pyarrow.list_()
type, to specify the type of the array elements. For example,
from pyarrow import list_, float64schema = Schema({'_id': ObjectId, 'location': {'coordinates': list_(float64())}})
PyMongoArrow implements the ObjectId
, Decimal128
, Binary data
, and JavaScript code
types as extension types for PyArrow and Pandas. For arrow tables, values of these types have the appropriate pymongoarrow
extension type, such as pymongoarrow.types.ObjectIdType
. You can obtain the appropriate bson
Python object by using the .as_py()
method, or by calling .to_pylist()
on the table.
>>> from pymongo import MongoClient>>> from bson import ObjectId>>> from pymongoarrow.api import find_arrow_all>>> client = MongoClient()>>> coll = client.test.test>>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}])<pymongo.results.InsertManyResult at 0x1080a72b0>>>> table = find_arrow_all(coll, {})>>> tablepyarrow.Table_id: extension<arrow.py_extension_type<ObjectIdType>>foo: int32----_id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]]foo: [[100,200]]>>> table["_id"][0]<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')>>>> table["_id"][0].as_py()ObjectId('64408b0d5ac9e208af220142')>>> table.to_pylist()[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100}, {'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]
When converting to pandas, the extension type columns have an appropriate pymongoarrow
extension type, such as pymongoarrow.pandas_types.PandasDecimal128
. The value of the element in the dataframe is the appropriate bson
type.
>>> from pymongo import MongoClient>>> from bson import Decimal128>>> from pymongoarrow.api import find_pandas_all>>> client = MongoClient()>>> coll = client.test.test>>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}])<pymongo.results.InsertManyResult at 0x1080a72b0>>>> df = find_pandas_all(coll, {})>>> df _id foo0 64408bf65ac9e208af220144 0.11 64408bf65ac9e208af220145 0.1>>> df["foo"].dtype<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90>>>> df["foo"][0]Decimal128('0.1')>>> df["_id"][0]ObjectId('64408bf65ac9e208af220144')
Polars does not support Extension Types.
In Arrow and Polars, all Arrays are nullable. Pandas has experimental nullable data types, such as Int64
. You can instruct Arrow to create a pandas DataFrame using nullable dtypes with the following Apache documentation code.
>>> dtype_mapping = {... pa.int8(): pd.Int8Dtype(),... pa.int16(): pd.Int16Dtype(),... pa.int32(): pd.Int32Dtype(),... pa.int64(): pd.Int64Dtype(),... pa.uint8(): pd.UInt8Dtype(),... pa.uint16(): pd.UInt16Dtype(),... pa.uint32(): pd.UInt32Dtype(),... pa.uint64(): pd.UInt64Dtype(),... pa.bool_(): pd.BooleanDtype(),... pa.float32(): pd.Float32Dtype(),... pa.float64(): pd.Float64Dtype(),... pa.string(): pd.StringDtype(),... }... df = arrow_table.to_pandas(... types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True... )... del arrow_table
Defining a conversion for pa.string()
also converts Arrow strings to NumPy strings, and not objects.
Pending ARROW-179 , extension types, such as ObjectId
, that appear in nested documents are not converted to the corresponding PyMongoArrow extension type, but instead have the raw Arrow type, FixedSizeBinaryType(fixed_size_binary[12])
.
These values can be consumed as-is, or converted individually to the desired extension type, such as _id = out['nested'][0]['_id'].cast(ObjectIdType())
.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4