Exercises

Exercise 1 - Basic Python applied to data wrangling

Download the data set from the Fictional scenario section in the Initial self-learning phase chapter: ai_bs_trainingset.csv

Inspect the file and then create a script that reads the file and calculates and prints the following statistics for all available types of publications:

minimum words
maximum words
average words
minimum mentions of AI
maximum mentions of AI
average mentions of AI
minimum bs_factor
maximum bs_factor
average bs_factor

Then also output these stats for the overall data set.

To read in the file you can use this code snippet from the Reading local .csv and initial analysis section:

filename = 'ai_bs_trainingset.csv'

# read in the file line by line and transform it to a list of lists
lines = []
with open(filename, 'r') as input:
    for line in input:
        # as we know the values are separate by ; we split them first
        line_parts = line.split(';')
        # the last item contains a newline character, which we want to strip
        line_parts[-1] = line_parts[-1].rstrip('\n')
        # now we can append it to our lines list
        lines.append(line_parts)

Alternatively you can also use the requests library and read the dataset directly from https://tantemalkah.at/2023/machine-learning/bs-dataset/

Bonus challenge for the pythonically gifted:

Adapt the script so that it asks the user for a URL to the CSV file that should be grabbed. Then output all available columns to the user and let them choose one of more of the columns (e.g. by entering a comma separated string with the column indexes), for which the above stats should be calculated.

Exercise submission & deadline: Hand in this assignment until Sun, 21st April 2024 23:42 to get the full points. Later submissions will get you only half the points.

To hand in the exercise name your script in the following format: f'{student_id}_{firstname}_{lastname}.py' and upload it to the Exercise 1 folder of our base cloud course folder.

Exercise 2 - Pretty plotting with Pandas

For this exercise you will have to find your own data set first. I would suggest to find one on https://data.gv.at, maybe even one specifically related to Vienna, e.g. from the data sets published by the City of Vienna: https://www.data.gv.at/auftritte/?organisation=stadt-wien. In the end it is up to you which data set you want to use. It should be something that interests you and that you are curious to find out more about. You might keep working with this data set when we start to apply some machine learning.

Now create a script which facilitates Pandas to investigate this data set. Make sure to have the data in a useful state. Some preparation might be needed before you can analyse it, depending on the data set.

Decide which column is your index that should be used for printing and pick at least two further columns that seem specifically interesting to you. If any of those columns is not a number-based column, create an additional column and find a way to assign a number-value based on the column (it is up to you how to transform this).

For the resulting columns print some general statistics:

min, max, and mean
the following percentiles: 10th, 25th, 75th, 90th
the standard deviation

Then for both columns create one general plot visualising the data.

Additionally create the following plots for at least one of the two columns:

of all data points below the 10th percentile
of all data points above the 90th percentile
of all remaining data points

The type of plot you produce is up to you, whatever makes more sense to make sense of your specific data set. It might be helpful to play around with the plot types, and in some cases it might be useful to create two different plot types for the same data set. In that case, add those additional plots.

Bonus challenge for the plotting pros: Create a plot combining all of the above data, where the different characteristics are encoded e.g. by color or dot size in case of scatter plot.

Exercise submission & deadline: Hand in this assignment until Sun, 5th May 2024 23:42 to get the full points. Later submissions will get you only half the points.

To hand in the exercise name your script in the following format: f'{student_id}_{firstname}_{lastname}.py' and upload it to the Exercise 2 folder of our base cloud course folder.

Exercise 3 - Markovian funsense or linear depression?

The idea of this exercise is that you extend on one of the two examples we worked on in the Markov chains and linear regression session. So choose between one of the following two:

Extend the Markov-chain based nonsense text generator in a way so that it uses words instead of characters to generate a new text.
Extend the linear-regression algorithm in a way that combines the training and the plotting script. So the result will be one script, that shows some plots about the data you want process, then goes through the training phase and afterwards immediately also plots the resulting line. Use some real-life data set for this. Maybe you can re-use the dataset from exercise 2, or find another one, that shows some form of linearity.

Exercise submission & deadline: Hand in this assignment until Sun, 16th June 2024 23:42 to get the full points. Later submissions will get you only half the points.

To hand in the exercise name your script in the following format: f'{student_id}_{firstname}_{lastname}.py' and upload it to the Exercise 3 folder of our base cloud course folder.