Starting here? This lesson is part of a full-length tutorial in using Python for Data Analysis. Check out the beginning.
Goals of this lessonIn this lesson, you'll learn how to create and use a DataFrame, a Python data structure that is similar to a database or spreadsheet table. You'll learn how to:
DataFrame
DataFrame
DataFrame
The Python data analysis tools that you'll learn throughout this tutorial are very useful, but they become immensely valuable when they are applied to real data (and real problems). In this lesson, you'll be using tools from pandas, one of the go-to libraries for data manipulation, to conduct analysis of web traffic, which can help drive valuable decisions for a business.
Pandas DataFramesPandas has a few powerful data structures:
A DataFrame is a table much like in SQL or Excel. It's similar in structure, too, making it possible to use similar operations such as aggregation, filtering, and pivoting. However, because DataFrames are built in Python, it's possible to use Python to program more advanced operations and manipulations than SQL and Excel can offer. As a bonus, the creators of pandas have focused on making the DataFrame operate very quickly, even over large datasets.
DataFrames are particularly useful because powerful methods are built into them. In Python, methods are associated with objects, so you need your data to be in the DataFrame to use these methods. DataFrames can load data through a number of different data structures and files, including lists and dictionaries, csv files, excel files, and database records (more on that here).
Loading data into a Mode Python NotebookMode is an analytics platform that brings together a SQL editor, Python notebook, and data visualization builder. Throughout this tutorial, you can use Mode for free to practice writing and running Python code.
For this lesson, you’ll be using web traffic data from Watsi, an organization that allows people to fund healthcare costs for people around the world. To access the data, you’ll need to use a bit of SQL. Here’s how:
datasets[0].head(n=5)
. Run this code so you can see the first five rows of the dataset.datasets
is a list object. Nested inside this list is a DataFrame containing the results generated by the SQL query you wrote. To learn more about how to access SQL queries in Mode Python Notebooks, read this documentation.
Now you’re all ready to go.
Creating a Pandas DataFrame Prepping a DataFrameIn Mode Python Notebooks, the first cell is automatically populated with the following code to access the data produced by the SQL query:
datasets[0].head(n=5)
The datasets
object is a list, where each item is a DataFrame corresponding to one of the SQL queries in the Mode report. So datasets[0]
is a dataframe object within the datasets
list. You can see that the above command produces a table showing the first 5 rows of the results of your SQL query.
Mode is able to do this because it comes pre-loaded with pandas. Still, you should get in the habit of giving librarias aliases so that you can refer to them easily throughout your code. Pandas is typically aliased as pd
:
You should also assign the DataFrame as a variable. Since you'll only be working with one DataFrame in this lesson, you can keep it simple and just call it data
:
Input
data = datasets[0] # assign SQL query results to the data variable
One final step before we're ready to start analysis: text cleanup. There are a few missing values in this dataset (in SQL you'd refer to them as null
). For the sake of making this easier to look at, use the fillna()
method to replace missing values with empty strings:
Input
data = data.fillna('') # replace missing values with strings for easier text processing
About this dataset
As mentioned above, in this lesson you'll be working with web traffic data from a nonprofit called Watsi. Every row in this dataset corresponds to a person visiting a watsi.org page (this is known as a pageview). The general flow of pageviews is referred to as web traffic.
Every pageview (row in the dataset) is composed of:
'referrer'
The url that referred the user to the site (if available). For example, if someone arrived at the page through a Facebook link, referrer
would be https://www.facebook.com'timestamp'
The time the event occurred'title'
The title of the page the user visited on the Watsi website'url'
The url the user visited. For example, https://watsi.org/team/the-meteor-chef'user_agent'
The software the user used to accessed the site, including platform, browser, and extensions'user_id'
A unique id for each user (normally they’d be numbers—we've turned them into anonymous names instead)'referrer_domain'
The domain of the url that referred the user to the site. For example, "facebook.com"'website_section'
The section of the website visited. For example, the section of https://watsi.org/team/the-meteor-chef is "team"'platform'
The device platform the user visited from. Possible values are "Desktop"
and "Mobile"
Through their website, Watsi enables direct funding of medical care. Take the time to understand what that looks like in practice. Visit some of the URLs you see in this dataset to familiarize yourself with the structure of the site and content, such as Mary's patient profile. Google Watsi and consider why people might engage with the service. Context is important - it'll help you make educated inferences in your analysis of the data.
Data samplingThis dataset contains 5,000 rows, which were sampled from a 500,000 row dataset spanning the same time period. Throughout these analyses, the number of events you count will be about 100 times smaller than they actually were, but the proportions of events will still generally be reflective of that larger dataset. In this case, a sample is fine because our purpose is to learn methods of data analysis with Python, not to create 100% accurate recommendations to Watsi.
Selecting columns in a DataFrameAs you learned in the previous lesson, you can select a value in a list or dictionary using brackets:
cities[0]
(gets item at place 0 in the list "cities")city_population['Tokyo']
(gets values associated with the key 'Tokyo'
in the dictionary city_population
)Similarly, you can use brackets to select a column in the DataFrame:
Output
0 https://watsi.org/
1 https://watsi.org/team/the-meteor-chef
2 https://watsi.org/gift-cards
3 https://watsi.org/
4 https://watsi.org/
Name: url, dtype: object
Selecting the column gives you access to the whole column, but will only show a preview. Below the column, the column name and data type (dtype) are printed for easy reference.
The url
column you got back has a list of numbers on the left. This is called the index, which uniquely identifies rows in the DataFrame. You will use the index to select individual rows, similar to how you selected rows from a list in an earlier lesson. A unique identifier is often necessary to refer to specific records in the dataset. For example, the DMV uses license plates to identify specific vehicles, instead of "Blue 1999 Honda Civic in California," which may or may not uniquely identify a car.
Selecting columns will be important to much of the analysis you do throughout the tutorials, especially in grouping and counting events.
Selecting rows in a DataFrameSelecting rows is useful for exploring the data and getting familiar with what values you might see. You can select rows by using brackets and row indexes. For example, you can select the first three rows of the DataFrame with the following code:
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform 0 https://www.google.com/ 2016-02-05 00:48:23 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... CHAROLETTE S google Desktop 1 https://themeteorchef.com/snippets/making-use-... 2016-02-24 23:12:10 Watsi | The Meteor Chef https://watsi.org/team/the-meteor-chef Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... WARREN Q themeteorchef.com team Desktop 2 https://watsi.org/ 2015-12-25 17:59:35 Watsi | Give the gift of health with a Watsi G... https://watsi.org/gift-cards Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1... MITCHEL O watsi.org gift-cards DesktopThe ":3" between the brackets effectively means "up to index 3". Similarly, you could select everything "from index 4 up to (but not including) index 7":
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform 4 https://watsi.org/fund-treatments 2016-02-14 19:30:08 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2... ANDREE N watsi.org Desktop 5 http://blog.watsi.org/ 2015-10-15 06:04:40 Watsi | Fund a medical treatment on Watsi. 100... https://watsi.org/fund-treatments Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4... SHAREN Y watsi.org fund-treatments Desktop 6 2015-12-25 10:23:43 Watsi | Redeem your Watsi Gift Card https://watsi.org/redeem Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) G... BRICE Z redeem DesktopFinally, you can select "everything from index 4997 onward":
Output
referrer timestamp title url user_agent user_id referrer_domain website_section platform 4997 http://aplus.com/a/kid-has-flexible-neck?c=654... 2016-01-03 02:48:38 Watsi | Fund medical treatments for people aro... https://watsi.org/ Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_4 like ... NOELLE P aplus.com iPhone 4998 https://watsi.org/fund-treatments?page=2 2016-02-07 23:47:53 Watsi | Success! Sarah from Kenya raised $1,12... https://watsi.org/profile/6705ce017f7e-sarah Mozilla/5.0 (iPad; CPU OS 9_2 like Mac OS X) A... JERICA F watsi.org profile iPad 4999 https://watsi.org/ 2015-11-17 16:38:25 Watsi | Fund a medical treatment on Watsi. 100... https://watsi.org/fund-treatments?page=4 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5... MARIANNA I watsi.org fund-treatments Desktop Selecting a specific rowTo select a specific row, you must use the .iloc
or .loc
method, with the row's index in brackets. Which to use depends on what you are trying to return.
.iloc
is integer index-based. If you .iloc[1]
, it will return to you the row at the 1st index regardless of the index’s name. .loc
however is label-based, meaning that .iloc[1]
will only return the row at the first index if "1" is the index’s label. In the case of this dataframe .iloc[1]
and .loc[1]
will return the same row.
Output
referrer https://themeteorchef.com/snippets/making-use-...
timestamp 2016-02-24 23:12:10
title Watsi | The Meteor Chef
url https://watsi.org/team/the-meteor-chef
user_agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
user_id WARREN Q
referrer_domain themeteorchef.com
website_section team
platform Desktop
Name: 1, dtype: object
Output
referrer https://themeteorchef.com/snippets/making-use-...
timestamp 2016-02-24 23:12:10
title Watsi | The Meteor Chef
url https://watsi.org/team/the-meteor-chef
user_agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
user_id WARREN Q
referrer_domain themeteorchef.com
website_section team
platform Desktop
Name: 1, dtype: object
This is different from selecting columns. When selecting a column, you'll use data[]
, and when selecting a row, you'll use data.iloc[]
or data.loc[]
. To learn more about the differences between .iloc
and .loc
, check out pandas documentation.
Just as you can select from rows or columns, you can also select from both rows and columns at the same time. For example, you can select the first three rows of the title
column by naming both the column and rows in square brackets:
Output
0 Watsi | Fund medical treatments for people aro...
1 Watsi | The Meteor Chef
2 Watsi | Give the gift of health with a Watsi G...
Name: title, dtype: object
Think about this as listing the row and column selections one after another. Putting together a column selection and a row selection:
data['title']
data[:3]
You get the combined selection:
data['title'][:3]
The brackets selecting the column and selecting the rows are separate, and the selections are applied from left to right (in this last example, the column is selected, then it is filtered down to the first 3 rows).
In fact, selecting the rows and then the column yields the same result:
Output
0 Watsi | Fund medical treatments for people aro...
1 Watsi | The Meteor Chef
2 Watsi | Give the gift of health with a Watsi G...
Name: title, dtype: object
Practice Problem
Select records from rows 10 to 15 in the 'referrer'
column.
There are at least two possible answers!
Input
data['referrer'][10:15] # select the column, then the rows
Output
10
11 https://www.facebook.com/
12 https://watsi.org/profile/ef9cce891dc4-anibal
13
14 https://watsi.org/donor/settings/payment
Name: referrer, dtype: object
Input
data[10:15]['referrer'] # select the rows, then the column
Output
10
11 https://www.facebook.com/
12 https://watsi.org/profile/ef9cce891dc4-anibal
13
14 https://watsi.org/donor/settings/payment
Name: referrer, dtype: object
Lesson summary:
In this lesson, you learned to:
DataFrame
with dataDataFrame
DataFrame
DataFrame
In the next lesson, you'll learn how to count values and plot a bar chart.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4