A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://docs.databricks.com/aws/en/dev-tools/sdk-english below:

English SDK for Apache Spark

English SDK for Apache Spark

note

This article covers the English SDK for Apache Spark. This English SDK for Apache Spark is not supported directly by Databricks. To provide feedback, ask questions, and report issues, use the Issues tab in the English SDK for Apache Spark repository in GitHub.

The English SDK for Apache Spark takes English instructions and compiles them into Spark objects. Its goal is to make Spark more user-friendly and accessible, which enables you to focus your efforts on extracting insights from your data.

The following information includes an example that describes how you can use a Databricks Python notebook to call the English SDK for Apache Spark. This example uses a plain English question to guide the English SDK for Apache Spark to run a SQL query on a table from your Databricks workspace.

Requirements​ Step 1: Install the Python package for the English SDK for Apache Spark​

In the notebook's first cell, run the following code, which installs on the attached compute resource the latest version of the Python package for the English SDK for Apache Spark:

%pip install pyspark-ai --upgrade
Step 2: Restart the Python kernel to use the updated package​

In the notebook's second cell, run the following code, which restarts the Python kernel to use the updated Python package for the English SDK for Apache Spark and its updated package dependencies:

Python

dbutils.library.restartPython()
Step 3: Set your OpenAI API key​

In the notebook's third cell, run the following code, which sets an environment variable named OPENAI_API_KEY to the value of your OpenAI API key. The English SDK for Apache Spark uses this OpenAPI key to authenticate with OpenAI. Replace <your-openai-api-key> with the value of your OpenAI API key:

Python

import os

os.environ['OPENAI_API_KEY'] = '<your-openai-api-key>'

important

In this example, for speed and ease of use, you hard-code your OpenAI API key into the notebook. In production scenarios, it is a security best practice not to hard-code your OpenAI API key into your notebooks. One alternative approach is to set this environment variable on the attached cluster. See Environment variables.

Step 4: Set and activate the LLM​

In the notebook's fourth cell, run the following code, which sets the LLM that you want the English SDK for Apache Spark to use and then activates the English SDK for Apache Spark with the selected model. For this example, you use GPT-4. By default, the English SDK for Apache Spark looks for an environment variable named OPENAI_API_KEY and uses its value to authenticate with OpenAI to use GPT-4:

Python

from langchain.chat_models import ChatOpenAI
from pyspark_ai import SparkAI

chatOpenAI = ChatOpenAI(model = 'gpt-4')

spark_ai = SparkAI(llm = chatOpenAI)
spark_ai.activate()

tip

To use GPT-4 as the default LLM, you can simplify this code as follows:

Python

from pyspark_ai import SparkAI

spark_ai = SparkAI()
spark_ai.activate()
Step 5: Create a source DataFrame​

In the notebook's fifth cell, run the following code, which selects all of the data in the samples.nyctaxi.trips table from your Databricks workspace and stores this data in a DataFrame that is optimized to work with the English SDK for Apache Spark. This DataFrame is represented here by the variable df:

Python

df = spark_ai._spark.sql("SELECT * FROM samples.nyctaxi.trips")
Step 6: Query the DataFrame by using a plain English question​

In the notebook's sixth cell, run the following code, which asks the English SDK for Apache Spark to print the average trip distance, to the nearest tenth, for each day during January of 2016.

Python

df.ai.transform("What was the average trip distance for each day during the month of January 2016? Print the averages to the nearest tenth.").display()

The English SDK for Apache Spark prints its analysis and final answer as follows:

> Entering new AgentExecutor chain...
Thought: This can be achieved by using the date function to extract the date from the timestamp and then grouping by the date.
Action: query_validation
Action Input: SELECT DATE(tpep_pickup_datetime) as pickup_date, ROUND(AVG(trip_distance), 1) as avg_trip_distance FROM spark_ai_temp_view_2a0572 WHERE MONTH(tpep_pickup_datetime) = 1 AND YEAR(tpep_pickup_datetime) = 2016 GROUP BY pickup_date ORDER BY pickup_date
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT DATE(tpep_pickup_datetime) as pickup_date, ROUND(AVG(trip_distance), 1) as avg_trip_distance FROM spark_ai_temp_view_2a0572 WHERE MONTH(tpep_pickup_datetime) = 1 AND YEAR(tpep_pickup_datetime) = 2016 GROUP BY pickup_date ORDER BY pickup_date

> Finished chain.

The English SDK for Apache Spark runs its final answer and prints the results as follows:

+-----------+-----------------+
|pickup_date|avg_trip_distance|
+-----------+-----------------+
| 2016-01-01| 3.1|
| 2016-01-02| 3.0|
| 2016-01-03| 3.2|
| 2016-01-04| 3.0|
| 2016-01-05| 2.6|
| 2016-01-06| 2.6|
| 2016-01-07| 3.0|
| 2016-01-08| 2.9|
| 2016-01-09| 2.8|
| 2016-01-10| 3.0|
| 2016-01-11| 2.8|
| 2016-01-12| 2.9|
| 2016-01-13| 2.7|
| 2016-01-14| 3.3|
| 2016-01-15| 3.0|
| 2016-01-16| 3.0|
| 2016-01-17| 2.7|
| 2016-01-18| 2.9|
| 2016-01-19| 3.1|
| 2016-01-20| 2.8|
+-----------+-----------------+
only showing top 20 rows
Next steps​ Additional resources​

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4