RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://www.mongodb.com/docs/languages/python/pymongo-arrow-driver/current/quick-start/ below:

Quick Start - PyMongoArrow - MongoDB Docs

This tutorial is intended as an introduction to working with PyMongoArrow. The tutorial assumes the reader is familiar with basic PyMongo and MongoDB concepts.

Ensure that you have the PyMongoArrow distribution installed. In the Python shell, the following should run without raising an exception:

>>> import pymongoarrow as pma

This tutorial also assumes that a MongoDB instance is running on the default host and port. After you have downloaded and installed MongoDB, you can start it as shown in the following code example:

The pymongoarrow.monkey module provides an interface to patch PyMongo in place, and add PyMongoArrow functionality directly to Collection instances:

from pymongoarrow.monkey import patch_allpatch_all()

After you run the monkey.patch_all() method, new instances of the Collection class will contain the PyMongoArrow APIs-- for example, the pymongoarrow.api.find_pandas_all() method.

Note

You can also use any of the PyMongoArrow APIs by importing them from the pymongoarrow.api module. If you do, you must pass the instance of the Collection on which the operation is to be run as the first argument when calling the API method.

The following code uses PyMongo to add sample data to your cluster:

from datetime import datetimefrom pymongo import MongoClientclient = MongoClient()client.db.data.insert_many([  {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},  {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},  {'_id': 3, 'amount': 3,  'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},  {'_id': 4, 'amount': 0,  'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])

PyMongoArrow relies on a data schema to marshall query result sets into tabular form. If you don't provide this schema, PyMongoArrow infers one from the data. You can define the schema by creating a Schema object and mapping the field names to type-specifiers, as shown in the following example:

from pymongoarrow.api import Schemaschema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})

MongoDB uses embedded documents to represent nested data. PyMongoArrow offers first-class support for these documents:

schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})

PyMongoArrow also supports lists and nested lists:

from pyarrow import list_, stringschema = Schema({'txns': list_(string())})polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)

Tip

PyMongoArrow includes multiple permissible type-identifiers for each supported BSON type. For a full list of these data types and their associated type-identifiers, see Data Types.

The following code example shows how to load all records that have a non-zero value for the amount field as a pandas.DataFrame object:

df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)

You can also load the same result set as a pyarrow.Table instance:

arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)

Or as a polars.DataFrame instance:

df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)

Or as a NumPy arrays object:

ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)

When using NumPy, the return value is a dictionary where the keys are field names and the values are the corresponding numpy.ndarray instances.

Note

In all of the preceding examples, you can omit the schema as shown in the following example:

arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})

If you omit the schema, PyMongoArrow tries to automatically apply a schema based on the data contained in the first batch.

Running an aggregate operation is similar to running a find operation, but it takes a sequence of operations to perform.

The following is a simple example of the aggregate_pandas_all() method that outputs a new dataframe in which all _id values are grouped together and their amount values summed:

df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])

You can also run aggregate operations on embedded documents. The following example unwinds values in the nested txn field, counts the number of each value, then returns the results as a list of NumPy ndarray objects, sorted in descending order:

pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]ndarrays = client.db.data.aggregate_numpy_all(pipeline)

You can use the write() method to write objects of the following types to MongoDB:

Arrow Table
Pandas DataFrame
NumPy ndarray
Polars DataFrame

from pymongoarrow.api import writefrom pymongo import MongoClientcoll = MongoClient().db.my_collectionwrite(coll, df)write(coll, arrow_table)write(coll, ndarrays)

Note

NumPy arrays are specified as dict[str, ndarray].

The write() method optionally accepts a boolean exclude_none parameter. If you set this parameter to True, the driver doesn't write empty values to the database. If you set this parameter to False or leave it blank, the driver writes None for each empty field.

The code in the following example writes an Arrow Table to MongoDB twice. One of the values in the 'b' field is set to None.

The first call to the write() method omits the exclude_none parameter, so it defaults to False. All values in the Table, including None, are written to the database. The second call to the write() method sets exclude_none to True, so the empty value in the 'b' field is ignored.

data_a = [1, 2, 3]data_b = [1, None, 3]data = Table.from_pydict(   {      "a": data_a,      "b": data_b,   },)coll.drop()write(coll, data)col_data = list(coll.find({}))coll.drop()write(coll, data, exclude_none=True)col_data_exclude_none = list(coll.find({}))print(col_data)print(col_data_exclude_none)

{'_id': ObjectId('...'), 'a': 1, 'b': 1}{'_id': ObjectId('...'), 'a': 2, 'b': None}{'_id': ObjectId('...'), 'a': 3, 'b': 3}{'_id': ObjectId('...'), 'a': 1, 'b': 1}{'_id': ObjectId('...'), 'a': 2}{'_id': ObjectId('...'), 'a': 3, 'b': 3}

Once result sets have been loaded, you can then write them to any format that the package supports.

For example, to write the table referenced by the variable arrow_table to a Parquet file named example.parquet, run the following code:

import pyarrow.parquet as pqpq.write_table(arrow_table, 'example.parquet')

Pandas also supports writing DataFrame instances to a variety of formats, including CSV and HDF. To write the data frame referenced by the variable df to a CSV file named out.csv, run the following code:

df.to_csv('out.csv', index=False)

The Polars API is a mix of the two preceding examples:

import polars as pldf = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})df.write_parquet('example.parquet')

Note

Nested data is supported for parquet read and write operations, but is not well supported by Arrow or Pandas for CSV read and write operations.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4