Pretty Plotting Pandas

This is the last session/chapter of our Basic Python & Data Wrangling crash course. Make sure to get the last exercise and the one coming up in this session done by the beginning of next session. Then we’ll start to dive into actual machine learning concepts and how to apply them in code.

In this session we will use two kaggle tutorials to get familiar with Pandas and plotting with Seaborn (a high level interface to Matplotlib):

You can work with the jupyter notebooks live on kaggle, but in the end make sure you can run the same code (except for the exercise checks) on your own machine.

After this session you should have the basic building blocks to solve exercise 2

Demo example

The following script demonstrates how to calculate percentiles and to create additional data frames based on those percentiles with the data set “Radverkehrszählungen Wien”.

import pandas as pd
import matplotlib.pyplot as plt

# This is our original data source
# uri = 'https://www.wien.gv.at/gogv/l9ogdradverkehrszaehlungen'
# For continuous working it is more feasible to download the file once,
# and work with the file, instead of creating an HTTP request every time
file = '/home/jackie/Downloads/radverkehrszaehlungen.csv'

# read in the data frame from the CSV file with an applicable encoding
# and also parse the dates as dates
df = pd.read_csv(file, delimiter=';', encoding='latin-1', parse_dates=[0])

# those counts that have been written with a 1000-dot-notation, have
# to be updated accordingly
df.loc[df.Donaukanal < 10, 'Donaukanal'] = df.Donaukanal * 1000

# calculate the percentiles
q1 = df.Donaukanal.quantile(0.1)
q9 = df.Donaukanal.quantile(0.9)

# then use those to get our three data frames
df10 = df.loc[df.Donaukanal <= q1]
df90 = df.loc[df.Donaukanal >= q9]
df_rest = df.loc[(df.Donaukanal > q1) & (df.Donaukanal < q9)]

# now produce the scatter plots
df10.plot.scatter(x='Datum', y='Donaukanal', title='<= 10th percentile', figsize=(17, 8))
df90.plot.scatter(x='Datum', y='Donaukanal', title='>= 90th percentile', figsize=(17, 8))
df_rest.plot.scatter(x='Datum', y='Donaukanal', title='main 80%', figsize=(17, 8))
# and display them
plt.show()