Session 4 : Reading and analysing .csv data sets

In this session we will read in .csv files and do some statistical calculations. Additionally we will create some first simple plots of the data we read in. To do all of that we will use the requests library and matplotlib.

Make sure to check out this sites. Especially the matplotlib site has a nice Getting started area, that gives a good overview of the different plot types and how to use them in their most simple forms.

At the end of this session you first coding exercise is introduced.

Fictional scenario

In this session we start to use a data set that is based on a purely fictional scenario. Based on this scenario the following training set .csv file is made available: ai_bs_trainingset.csv

This file contains statistics for a vast range of publications somehow relating to “AI”. Our fictional research team has categorised these publications into 4 different types:

  • newspaper articles

  • journal articles

  • book publications

  • other types of publications

For all those, a word count was created, as well as a count of how often “AI” or any other semantically similar term was mentioned. Then our research team meticulously analysed the publication for its bs_factor which was added to the data set as well.

The resulting data set is reflected in the file linked above. We will use this file in the following snippets and will reuse it in future sessions.

Reading local .csv and initial analysis

filename = 'ai_bs_trainingset.csv'

# read in the file line by line and transform it to a list of lists
lines = []
with open(filename, 'r') as input:
    for line in input:
        # as we know the values are separate by ; we split them first
        line_parts = line.split(';')
        # the last item contains a newline character, which we want to strip
        line_parts[-1] = line_parts[-1].rstrip('\n')
        # now we can append it to our lines list
        lines.append(line_parts)

# remove and print header line
header = lines.pop(0)
print(f'here are the csv file header fields: {header}')

# find out which types we have
types = []
for item in lines:
    types.append(item[3])
# or do the same thing as above in a more pythonic way
types = [item[3] for item in lines]
# get the unique values for publication types
publication_types = set(types)
print(f'the following publications types can be found: {publication_types}')
# now check how many of each item types there are
type_stats = {item_type: 0 for item_type in publication_types}
for item in lines:
    type_stats[item[3]] += 1
print(f'the following amounts of publication types could be found: {type_stats}')

# collect average word count and ai mentions in news paper publications
stats = {'news': {'total_words': 0, 'total_ai_mentions': 0, 'count': 0}}
for item in lines:
    if item[3] == 'news':
        stats['news']['count'] += 1
        stats['news']['total_words'] += int(item[1])
        stats['news']['total_ai_mentions'] += int(item[2])

avg_words_news = stats['news']['total_words'] / stats['news']['count']
avg_ai_mentions_news = stats['news']['total_ai_mentions'] / stats['news']['count']
print(f'news articles have on average ~{round(avg_words_news, 2)} words an mention AI ~{round(avg_ai_mentions_news)} times on average')

As a file: snippets/data_prep1.py

Fetching online .csv and basic plots

For this example we are not using the same file. The thing that is returned is a subset of our fictional resarch data. So it might differ a little bit from the data in the file you downloaded, but the structure is the same and generally the statistics should match too.

import requests
import sys
import matplotlib.pyplot as plt

file_url = 'https://tantemalkah.at/2023/machine-learning/bs-dataset/'

# read in the file line by line and transform it to a list of lists
r = requests.get(file_url)
if r.status_code != 200:
    print(f'Something seems to be wrong. HTTP response status code was: {r.status_code}')
    sys.exit(1)
# first split the whole response text string on newlines
unsplit_lines = r.text.split('\n')
# then create a list of lists where the lines are split on ;
lines = [line.split(';') for line in unsplit_lines]

# remove and print header line
header = lines.pop(0)
print(f'here are the csv file header fields: {header}')
# and also pop the last line, because it is an empty one
lines.pop(-1)

# now we could do exactly the same things as in data_prep1.py

# we can also create simple plots of the data
# let's try that with words agains the bs_factor
words = [int(item[1]) for item in lines]
bs_factors = [float(item[0]) for item in lines]
fig, ax = plt.subplots()
ax.scatter(words, bs_factors, s=1)
plt.show()

# ok, what a mess, we'll have to make some sense of that later. but let's
# also plot the ai_mentions first
mentions = [int(item[2]) for item in lines]
fig, ax = plt.subplots()
ax.scatter(mentions, bs_factors, s=1)
plt.show()

# doesn't look soo much, better, but let's try to use a mentions per words ratio
mentions_per_word = [mentions[i] / words[i] for i in range(len(lines))]
fig, ax = plt.subplots()
ax.scatter(mentions_per_word, bs_factors, s=1)
plt.show()

# stunning, who would have thought that there is such a clear correlation?
# no more manual bs paper analysis, a linreg machine learning alg might just
# do that for us splendidly

As a file: snippets/data_prep2.py

Exercise 1 - Basic Python applied to data wrangling

Create a script that fetches the .csv file contents from https://tantemalkah.at/2023/machine-learning/bs-dataset/ (using the requests library) and then calculates and prints the following statistics for all available types of publications:

  • minimum words

  • maximum words

  • average words

  • minimum mentions of AI

  • maximum mentions of AI

  • average mentions of AI

  • minimum bs_factor

  • maximum bs_factor

  • average bs_factor

Then also output these stats for the overall data set.

Bonus challenge for the pythonically gifted:

Adapt the script so that it asks the user for a URL to the CSV file that should be grabbed. Then output all available columns to the user and let them choose one of more of the columns (e.g. by entering a comma separated string with the column indexes), for which the above stats should be calculated.

Exercise submission & deadline: Hand in this assignment until Sun, April 2nd 2023 23:42 to get the full points - or ask for an extension beforehand. Late submission (if not asked for an extension beforhand) will only get you half the points.

To hand in the exercise name your script in the folllwing format: f'{student_id}_{firstname}_{lastname}.py' and upload it to the Exercise 1 folder of our base cloud course folder.