This tutorial is intended as an introduction to working with PyMongoArrow. The tutorial assumes the reader is familiar with basic PyMongo and MongoDB concepts.
Ensure that you have the PyMongoArrow distribution installed. In the Python shell, the following should run without raising an exception:
>>> import pymongoarrow as pma
This tutorial also assumes that a MongoDB instance is running on the default host and port. After you have downloaded and installed MongoDB, you can start it as shown in the following code example:
The pymongoarrow.monkey
module provides an interface to patch PyMongo in place, and add PyMongoArrow functionality directly to Collection
instances:
from pymongoarrow.monkey import patch_allpatch_all()
After you run the monkey.patch_all()
method, new instances of the Collection
class will contain the PyMongoArrow APIs-- for example, the pymongoarrow.api.find_pandas_all()
method.
You can also use any of the PyMongoArrow APIs by importing them from the pymongoarrow.api
module. If you do, you must pass the instance of the Collection
on which the operation is to be run as the first argument when calling the API method.
The following code uses PyMongo to add sample data to your cluster:
from datetime import datetimefrom pymongo import MongoClientclient = MongoClient()client.db.data.insert_many([ {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']}, {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']}, {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']}, {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
PyMongoArrow relies on a data schema to marshall query result sets into tabular form. If you don't provide this schema, PyMongoArrow infers one from the data. You can define the schema by creating a Schema
object and mapping the field names to type-specifiers, as shown in the following example:
from pymongoarrow.api import Schemaschema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
MongoDB uses embedded documents to represent nested data. PyMongoArrow offers first-class support for these documents:
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
PyMongoArrow also supports lists and nested lists:
from pyarrow import list_, stringschema = Schema({'txns': list_(string())})polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
Tip
PyMongoArrow includes multiple permissible type-identifiers for each supported BSON type. For a full list of these data types and their associated type-identifiers, see Data Types.
The following code example shows how to load all records that have a non-zero value for the amount
field as a pandas.DataFrame
object:
df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
You can also load the same result set as a pyarrow.Table
instance:
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
Or as a polars.DataFrame
instance:
df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
Or as a NumPy arrays
object:
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
When using NumPy, the return value is a dictionary where the keys are field names and the values are the corresponding numpy.ndarray
instances.
In all of the preceding examples, you can omit the schema as shown in the following example:
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
If you omit the schema, PyMongoArrow tries to automatically apply a schema based on the data contained in the first batch.
Running an aggregate operation is similar to running a find operation, but it takes a sequence of operations to perform.
The following is a simple example of the aggregate_pandas_all()
method that outputs a new dataframe in which all _id
values are grouped together and their amount
values summed:
df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
You can also run aggregate operations on embedded documents. The following example unwinds values in the nested txn
field, counts the number of each value, then returns the results as a list of NumPy ndarray
objects, sorted in descending order:
pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]ndarrays = client.db.data.aggregate_numpy_all(pipeline)
You can use the write()
method to write objects of the following types to MongoDB:
Arrow Table
Pandas DataFrame
NumPy ndarray
Polars DataFrame
from pymongoarrow.api import writefrom pymongo import MongoClientcoll = MongoClient().db.my_collectionwrite(coll, df)write(coll, arrow_table)write(coll, ndarrays)
Note
NumPy arrays are specified as dict[str, ndarray]
.
The write()
method optionally accepts a boolean exclude_none
parameter. If you set this parameter to True
, the driver doesn't write empty values to the database. If you set this parameter to False
or leave it blank, the driver writes None
for each empty field.
The code in the following example writes an Arrow Table
to MongoDB twice. One of the values in the 'b'
field is set to None
.
The first call to the write()
method omits the exclude_none
parameter, so it defaults to False
. All values in the Table
, including None
, are written to the database. The second call to the write()
method sets exclude_none
to True
, so the empty value in the 'b'
field is ignored.
data_a = [1, 2, 3]data_b = [1, None, 3]data = Table.from_pydict( { "a": data_a, "b": data_b, },)coll.drop()write(coll, data)col_data = list(coll.find({}))coll.drop()write(coll, data, exclude_none=True)col_data_exclude_none = list(coll.find({}))print(col_data)print(col_data_exclude_none)
{'_id': ObjectId('...'), 'a': 1, 'b': 1}{'_id': ObjectId('...'), 'a': 2, 'b': None}{'_id': ObjectId('...'), 'a': 3, 'b': 3}{'_id': ObjectId('...'), 'a': 1, 'b': 1}{'_id': ObjectId('...'), 'a': 2}{'_id': ObjectId('...'), 'a': 3, 'b': 3}
Once result sets have been loaded, you can then write them to any format that the package supports.
For example, to write the table referenced by the variable arrow_table
to a Parquet file named example.parquet
, run the following code:
import pyarrow.parquet as pqpq.write_table(arrow_table, 'example.parquet')
Pandas also supports writing DataFrame
instances to a variety of formats, including CSV and HDF. To write the data frame referenced by the variable df
to a CSV file named out.csv
, run the following code:
df.to_csv('out.csv', index=False)
The Polars API is a mix of the two preceding examples:
import polars as pldf = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})df.write_parquet('example.parquet')
Note
Nested data is supported for parquet read and write operations, but is not well supported by Arrow or Pandas for CSV read and write operations.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4