Session 5: Pretty Plotting Pandas

This is the last session of our Basic Python & Data Wrangling crash course. Make sure to get the last exercise and the one coming up in this session done by the beginning of next session. Then we’ll start to dive into actual machine learning concepts and how to apply them in code.

In this session we will use two kaggle tutorials to get familiar with Pandas and plotting with Seaborn (a high level interface to Matplotlib):

You can work with the jupyter notebooks live on kaggle, but in the end make sure you can run the same code (except for the exercise checks) on your own machine.

Exercise 2 - Pretty plotting with Pandas

For this exercise you will have to find your own data set first. I would suggest to find one on https://data.gv.at, maybe even one specifically related to Vienna, e.g. from the data sets published by the City of Vienna: https://www.data.gv.at/auftritte/?organisation=stadt-wien. In the end it is up to you which data set you want to use. It should be something that interests you and that you are curious to find out more about. You might keep working with this data set when we start to apply some machine learning.

Now create a script which facilitates Pandas to investigate this data set. Make sure to have the data in a useful state. Some preparation might be needed before you can analyse it, depending on the data set.

Decide which column is your index that should be used for printing and pick at least two further columns that seem specifically interesting to you. If any of those columns is not a number-based column, create an additional column and find a way to assign a number-value based on the column (it is up to you how to transform this).

For the resulting columns print some general statistics:

min, max, and mean
the following percentiles: 10th, 25th, 75th, 90th
the standard deviation

Then for both columns create one general plot visualising the data.

Additionally create the following plots for at least one of the two columns:

of all data points below the 10th percentile
of all data points above the 90th percentile
of all remaining data points

The type of plot you produce is up to you, whatever makes more sense to make sense of your specific data set. It might be helpful to play around with the plot types, and in some cases it might be useful to create two different plot types for the same data set. In that case, add those additional plots.

Bonus challenge for the plotting pros: Create a plot combining all of the above data, where the different characteristics are encoded e.g. by color or dot size in case of scatter plot.

Exercise submission & deadline: Hand in this assignment until Sun, April 23rd 2023 23:42 to get the full points - or ask for an extension beforehand. Late submission (if not asked for an extension beforehand) will only get you half the points.

To hand in the exercise name your script in the following format: f'{student_id}_{firstname}_{lastname}.py' and upload it to the Exercise 2 folder of our base cloud course folder.

Demo example

The following script demonstrates how to calculate percentiles and to create additional data frames based on those percentiles with the data set “Radverkehrszählungen Wien”.

import pandas as pd
import matplotlib.pyplot as plt

# This is our original data source
# uri = 'https://www.wien.gv.at/gogv/l9ogdradverkehrszaehlungen'
# For continuous working it is more feasible to download the file once,
# and work with the file, instead of creating an HTTP request every time
file = '/home/jackie/Downloads/radverkehrszaehlungen.csv'

# read in the data frame from the CSV file with an applicable encoding
# and also parse the dates as dates
df = pd.read_csv(file, delimiter=';', encoding='latin-1', parse_dates=[0])

# those counts that have been written with a 1000-dot-notation, have
# to be updated accordingly
df.loc[df.Donaukanal < 10, 'Donaukanal'] = df.Donaukanal * 1000

# calculate the percentiles
q1 = df.Donaukanal.quantile(0.1)
q9 = df.Donaukanal.quantile(0.9)

# then use those to get our three data frames
df10 = df.loc[df.Donaukanal <= q1]
df90 = df.loc[df.Donaukanal >= q9]
df_rest = df.loc[(df.Donaukanal > q1) & (df.Donaukanal < q9)]

# now produce the scatter plots
df10.plot.scatter(x='Datum', y='Donaukanal', title='<= 10th percentile', figsize=(17, 8))
df90.plot.scatter(x='Datum', y='Donaukanal', title='>= 90th percentile', figsize=(17, 8))
df_rest.plot.scatter(x='Datum', y='Donaukanal', title='main 80%', figsize=(17, 8))
# and display them
plt.show()