Initial self-learning phase

Before we start into experimenting with machine learning, we need to get some Python basics going. This will require a quite different amount of engagement from each participant, depending on their prior coding experiences. The goal of this mostly self-directed learning phase is to know basic Python code when you see it and be able to make sense of it. The goal is not that you know Python by heart (this would be something most people only learn over several years in full-time engagements), but that you are familiar enough with it to look for solutions to what you want to do online, and to be able to adapt whatever you find to make it work for you.

So this is kind of a super dense crash course to Python. Until our fist real machine learning engagement, your task is to work through the materials listed in this section up to a state, where you can finish the first exercise. You will be provided with code snippets and video walkthroughs. And the course lecturer is there to help you get unstuck, if you are stuck somewhere. But it is your responsibility to communicate proactively. Ask your colleagues, use the base chat channel. Usually you are not the only one who has a question on how a particular piece of code works, and others will be glad that somebody else asked it.

Especially when it comes to coding, you might have the feeling that only you are not getting it. But actually it is the other way around, only very few get it the first time they see it - and usually they have some prior knowledge that makes what they see quite relatable. Here is a quote from a former lawyer on his bumpy way into software development:

“I tried to teach myself to code THREE times. In 2014, in 2015, and in 2017. And all three times I quit because I tried to jump too high, set myself up for failure, and then assumed I was not smart enough. But actually, I had just tried to run before I’d learned to walk.” — Zubin Pratap, freeCodeCamp alum who went on to become a software engineer at Google, Source: freeCodeCamp Newsletter

And as with drawing or any other craft or artform, getting good in coding is not about knowing magic, but about practice, practice, and more practice, as this comic by Sarah Andersen brings to the point with a very poignant example.

So always remember: if you are stuck somewhere, probably someone else is too. And please just ask around. The lecturer is also always happy to help you (actually, it kind of is their job), but they only can do so, if you ask them.

Now that we have set the scene, let’s go into setting up and experimenting with Python code.

Setup & Python basics

In this part we will make sure that everyone can run Python on their devices and then do a quite fast-paced run-through over the code pieces below, to explain the basic Python (and programming) concepts.

As this might be a lot in a very short amount of time for people who are new not only to Python but to coding in general, please also take a look at the Python Bascis resources on our References page. Especially the w3schools Python Tutorial might be a good first starting point, if you are still puzzled about some of the concepts after our quick run-through.

There are three recordings in which I talk you through the following code snippets (and the setup initially, with some remarks about code editors):

Setup & Basics 1 (~51min, 163MB)
Basics 2 (~33min, 100MB)
Basics 3 (~50min, 156MB)

Some comments and the code snippets shown in the videos can be found in the following subsections.

Setup

If you already have Python set up and also have an editor/IDE you like to work with, then you are ready to go. If not, the easiest way to start ist to install Thonny, which is a beginner friendly IDE (itself written in Python).

Thonny was developed by the University of Tartu specifically for the context of teaching and learning Python. Thonny is not only an editor/IDE, it also brings a full Python 3 installation with it, as well as a version of pip - the package installer for Python. So with Thonny you have everything you need to get started, and we will use this as a reference tool throughout the course.

Python basics crash course I

Here’s the code we’ll run through and work in this part:

basic.py

'''
This is a multi-line comment at the beginning of this Python
script. Its aim is to introduce you to the most basic concepts
in Python. We'll walk through this live in the course. If you
look at this in advance and everything here makes perfect sense
to you, and you could recreate it yourself from scratch, feel
free to skip the first hour of our Setup & Python basics session.
'''

# also a comment, only a single line comment
# althought it is followed by just another one
# no machine code will be executed up to (down) here

# but now, let's define some variables
my_boolean = True
my_boolean = False  # well, now we've overwritten the content
my_boolean = 'Not a boolean at all, but a string now!'
my_boolean = True or False  # ahm ... what's that?

# let's check and print the content to the console
print(my_boolean)

print('How can we explain that?')
my_explanation = 'Well, True or False is always ' + str(my_boolean)
print(my_explanation)

# combine printing and evaluation
print('But True and False is always', True and False)

# Let's get some user input
print()  # do an empty line first
user_name = input('Hey! What\'s your name? ')
print(f'Hi {user_name}, nice to talk to you.\n')  # using a format string and adding an empty line at the end


# before we can do more awesome stuff we need some better ideas
# of what types we can use here
my_string1 = 'Well, we already know this one'
my_string2 = "Double-quotes are also fine"
my_string3 = ''
my_string4 = 'The last one was an empty string. Still a string!'

my_multiline_string = '''This string starts here,
but spans several lines.
Quite convenient.'''

my_tedious_multiline_string = 'This string starts here,\n'
my_tedious_multiline_string += 'but spans several lines.\n'
my_tedious_multiline_string += 'Quite inconvenient doing it this way.'

print(my_multiline_string)
print(my_tedious_multiline_string, '\n')

print('So' + 'yes,' + 'strings' + 'can' + 'be' + 'concatenated!')  # whats up with the spaces?

# ok, we got strings, what about numbers and calculations?
my_int = 42
my_float = 42.0000000000001
my_float2 = 42.0
print(f'The type of {my_int} is {type(my_int)}.')
print(f'The type of {my_float2} is {type(my_float2)}.')

# let's do some calculations
print(5 * (12-8) + -15)
print(98 + (59872 / (13*8)) * -51)
print(72 % 8)  # this is the modulo operation
print(73 % 8)  # it returns the remainder of a division

# we can even "multiply" strings
print('The essence is ' + 'bla' * 3)

# now, comparing things is a very boolean thing to do
print('2 == 2? ', 2 == 2)
print('2 == 3? ', 2 == 3)
print('2 < 3? ', 2 < 3)
print('2 <= 2? ', 2 <= 2)
print('42 > 0? ', 42 > 0)
print('"lala" == "lala"', "lala" == "lala")

# comparing can be fun, IF you can do something with it
user_guess = input(f'Hey {user_name}, give me a number: ')
if user_guess == '42':
    print('Awesome, it seems there is nothing else left to lear for you')

user_guess = input(f'Gimme another one: ')
if user_guess == '42':
    print('Sneaky you, you are really intent on being stuck here, are you?')
else:
    print('Sorry, your number is just wrong!')

user_guess = input(f'Only one more: ')
if user_guess == '42':
    print('This is getting boring: ')
elif user_name == 'jackie':
    print(f'Well {user_name}, whatever you say seems to be the right thing.')
else:
    print('Not it! Well, maybe another time')

# so here we actually compared strings, not really numbers (as in int)
# but as we already have 100 lines of code here, let's try it out before we continue with basic2.py

As a file: snippets/basic.py

basic2.py

# as we are getting into error handling now, let's define a DEMO var
DEMO = True

try:
    print(23 + '23')
except Exception as err:
    if DEMO:
        print("Ugh, there was an error. But never mind, we'll just continue")
    else:
        raise err

print(str(23) + '23')
print(23 + int(23))

# Ok, so type conversion is an important thing. Let's try that with our
# number guesser
guess = input('Hey user, gimme a number: ')
# You might get an error here, depending on your input
try:
    if int(guess) == 42:
        print('Correct!')
    else:
        print('Not it!')
except:
    print("Ugh, there was an error. But never mind, we'll just continue")

# Errors sometimes seem ugly and annoying,
# but really, don't be a snob, they can be your friends
# Turn off the DEMO mode and see what you get.
# Try to fix the code one way or another.

# Now which of the following produce an error?
try:
    print(12 + 12)
    print('12' + '12')
    print('12' + 12)
    print('12 + 12')
    print(2 * 5)
    print('2' * '5')
    print('2' * 5)
    print('2 * 5')
except:
    if DEMO:
        print("Ugh, there was an error. But never mind, we'll just continue")
    else:
        raise err

# we've already seen type conversion, here some more examples:
variable_1 = 12
variable_2 = '12'
str(variable_1)
int(variable_2)
float(variable_2)

# now we can actually check the user input more properly
guess = input('Hey user, gimme a number again: ')
try:
    guess = int(guess)
except:
    print("Come on, why can't you give me a number?")
else:
    if guess > 10000:
        print("That's quite a large number")
    else:
        print('Nothing special about this number')
# This does not handle ALL numbers. Can you fix it?

print('\nAnd now for something completely different!\n')
print('A side note on string formatting:')

name = 'jackie'
day = 'Thursday'

# simple string concatenation, as we already know it
formatted = 'Hello ' + name + ', what a ' + day + '!'
print(formatted)

# the old style - good to know but please don't use it
formatted = 'Hello %s, what a %s!' % (name, day)
print(formatted)

# the new style
formatted = 'Hello {}, what a {}!'.format(name, day)
print(formatted)

# f-strings : the best, since Python 3.6
formatted = f'Hello {name}, what a {day}!'
print(formatted)

# Still, everything is so boringly determined here. Let's add some randomness
# and use some standard library features
from random import randint
from datetime import datetime

number = randint(0, 5)
now = datetime.now()
weekday = now.weekday()

print('Today, right now, this moment is', now.strftime('%A, %Y-%m-%d %H:%M')) 
if abs(weekday - number) == 0:
    print('Wow! This is a perfect day!')
elif abs(weekday - number) <= 2:
    print('A day as every other.')
else:
    print('This day really sucks!')
  
# check out https://www.w3schools.com/python/python_datetime.asp
# for more things you can do with dates

# enough for now. try to get this script running. play around with it.
# in DEMO and outside DEMO mode. fix the errors. and then ready yourself
# for part 3

As a file: snippets/basic2.py

basic3.py

# before we venture into machine land, we really should talk about
# lists, dicts, loops and functions

item = True
my_list = ['contains', 5, 'different', 'items like', item]
my_tuple = ('contains', 4, 'immutable', 'things')
my_list_from_tuple = list(my_tuple) + ['now mutable!']  # lists can be concatenated too
print(my_list_from_tuple, len(my_list_from_tuple))
# let's correct the number of items then
my_list_from_tuple[1] = len(my_list_from_tuple)
print(my_list_from_tuple)

# lists are indexed starting with 0. and they can be sliced
print(my_list[2:4])
print(my_list[2:])
print(my_list[:3])
print(my_list[1:2])  # isn't this a bit silly, as we could just use my_list[1] then?
print(my_list[1])    # hm ... see the difference?

# lists can be extended
print('extending my_list')
my_list.append('something')
my_list.extend([1, 2, 3])
my_list.insert(2, '(number is actually incorrect)')
print(my_list)

# and we can update single or multiple items
print('updating my_list')
my_list[1] = len(my_list)  # we've already seen this above
my_list[2:5] = ['(now correct)', 'very different', 'items']
my_list[7:] = [42, 10]  # this replaces part of the list with another (shorter) list
print(my_list)

# and we can also specifically delete items
print('deleting from my_list')
my_list.remove(10)
popped_item = my_list.pop(4)
print(f'popped_item: {popped_item} ; the list now looks like: {my_list}')
del my_list[1]
print(my_list)
my_list.clear()
print(my_list)
#del my_list
print(my_list)  # well, this should throw an error, becaus the list is gone now
# so remove the last line, before you move on to explore dictionaries
# and take a look at those many other aweseom list methods:
# https://www.w3schools.com/python/python_lists_methods.asp


my_dict = {'name': 'jackie', 'xp': 42, 'zombie': True}  # after such a speed run, what did you expect?
print(my_dict)
print(my_dict['name'])

# dicts contain key:value pairs. the values can also be dicts
my_dict = {
    'name': 'jackie',
    'data': {
        'xp': 42,
        'zombie': True,  # machine braaaaaains!
    },
}
print(my_dict)
print(my_dict['data'])
print(type(my_dict['data']))
print(my_dict['data']['zombie'])
print(my_dict.get('name'))  # a different way to access 
print(my_dict.get('something'))  # because my_dict['something'] would through an error

print('\nsome more ways to access dict keys and values:')
print(my_dict.values())
print(my_dict.keys())
print(my_dict.items())

print('\nadding and changing dict items')
my_dict['name'] = 'mafalda'
my_dict['data']['zombie'] = False
my_dict['a_new_key'] = 'a new value'  # adding something is easy as that
print(my_dict)
# but we can update (change/add) several key-value-pairs at once too
my_dict.update({'name': None, 'data': 'useless', 'hidden_gem': True})
print(my_dict)

print('\ndeleting works similar to lists')
popped_item = my_dict.pop('a_new_key')
print(f'popped_item: {popped_item} ; the dict now looks like: {my_dict}')
del my_dict['hidden_gem']
print(my_dict)
my_dict.clear()
print(my_dict)
#del my_dict
print(my_dict)
# well, this is same as with lists, same as with everything actually.
# if you delete it, it's gone. so fix this before you continue.
# also check out all of those dictionary methods:
# https://www.w3schools.com/python/python_dictionaries_methods.asp

print('\nOk, now, finally it is time to do some looping')
print("We'll loop through lists and dicts just for the fun of it")

my_list = ['contains', 5, 'different', 'items', True, 0.0, 1.234, False, None, {}, [], '']
for item in my_list:
    print('We now can do stuff with', item)
    if type(item) == str:
        print(f'This item is a string and of {len(item)} length.\n')
    elif type(item) == int:
        print('This item is an int:', item, '\n')
    else:
        output = 'This item appears to be something else.'
        if item:
            output += ' It looks like to be a truthy one.'
        else:
            output += ' Totally falsy of course.'
        print(output + '\n')
# here our loop ends

print('Looping through dicts works quite similarly')
my_dict = {'name': 'jackie', 'xp': 42, 'zombie': True, 'data': {'some': 'thing'}}
for key in my_dict:
    print('Currently at key', key, 'which has the value', my_dict[key])
# but sometimes it is more convenient to get the key and value as separate variables in the loop
for key, value in my_dict.items():
    print(f'In this loop we can access {key} and directly get {value}.')
    if type(value) == dict:
        print('A dict in a dict?!? Are you nuts? I wont look at this any further.')
    else:
        print('Boring')

# and because those loops actually work with all iterables we can also do this
for n in range(23):
    print(n)
else:
    print('imagine all the possibilities!')  # whait, what?!?
    # yep, that's a thing too, you can use an else after a for loop
    # look it up: https://www.w3schools.com/python/python_for_loops.asp


print('''

You've made it to the end!
I think this is enough for today.
Get some fresh air.
Take a deep breath.
Think about happy things.
Try not to become a robot until next Thursday.
Then we collect the missing pieces to become ... ah, create universal discrete state machines.
''')
from time import sleep
sleep(4)
print('\n\nOh...')
sleep(3)
print('by the way,')
sleep(3)
print('did you know this?\n')
sleep(2)
import this

As a file: snippets/basic3.py

Loops and functions

There are three recordings in which I run through the following code snippets:

Loops (~26min, 78MB)
Functions (~34min, 107MB)
Recursion (~24min, 79MB)

Loops

# In the last session we ended with for-loops that could loop
# through a list or a function. But what if we do not want to
# loop over some thing in particular, but just want to loop,
# maybe with a specific stop condition of our own.

# Well, let's do a simple while loop
# and we will use this number as a loop condition
number = 1
# no execute the loop as long as the number is smaller than 100
while number < 100:
    print(f'counting from {number} to {number+1}')
    # and maybe doing some more serious stuff here
    # but don't forget to update the loop condition, unless you want too loop forever
    number += 1

# Thanks to Al Sweigart for the following one
name = ''
while name != 'your name':
    print('Please type your name.')
    name = input()
print('Thank you!')


# now we add some more loop control structures with break and continue
boredom_threshold = 42  
number = 0  
while number < 100:  
    # increase the loop counter  
    number += 1  
    # we only want to deal with even numbers, so we just skip odd ones  
    if number % 2 != 0:  
        continue  
    # and in case we get past our boredom threshold, we just end the loop  
    if number > boredom_threshold:  
        break  
    print(f'we have reached {number}')  
    # and maybe do some more important stuff here as well

As a file: snippets/loops.py

Functions

'''
Functions
=========
... are blocks of reusable code, that can be run by invoking/calling the function
wherever we want to use that code. Functions can have zero or more arguments,
through which we can parameterise the function to not always do exactly the
same thing. Functions can also return some value, that we can use when the
function code has finished to run.

But let's start with some examples:'''

# we already used functions, e.g. the print function
print()  # this prints and empty line
print()  # again, does exactly the same thing
print('hey there!')  # now we use one argument (a string), so the function prints something else
print('again, but with some other argument')

# now print can take many arguments
print('it will print all those', 3, 'arguments, separated by a space')
# print even allows for some keyword arguments. check it out: https://devdocs.io/python~3.8/library/functions#print
print('a', 'set', 'of', 'random', 'data', 23, 43, sep=';', end=';# i know, csv does not do comments, but we add this at the end of the line anyways\n')

# or here a function we already used that returns something
the_return_value = input('Hey, gimme something: ')


'''
Defining our own functions
==========================
We are not limited to using the functions that are already there.
We can define our own functions. How awesome is that? This way we
could rewrite the whole language basically. But let's start with
some simple examples.
'''

# a function that just prints some lines of code
def print_greeting():
    print('Hello user!')
    print('What a lovely day.')
    print('Have fun hacking around.')
    
# now every time we want to print those three lines we don't have to write
# them all along but just do:
print_greeting()
print_greeting()  # see, just does the same thing again

# let's overwrite the function and make it a bit more versatile
# by introducing an argument
def print_greeting(name):
    print('Hello', name, end='!\n')
    print('What a lovely day.')
    print('Have fun hacking around.')

# now we can use the same function for different users
print_greeting('Ada')
print_greeting('Grace')
print_greeting('Hedy')

# let's add some more arguments with defaults
def print_greeting(name, daytime='day', activity='hacking around'):
    print('Hello', name, end='!\n')
    print(f'What a lovely {daytime}.')
    print(f'Have fun {activity}.')

# now we can use the same function for more personalised greetings
print_greeting('Ada')
print_greeting('Grace', 'morning')
print_greeting('Hedy', 'evening', 'frequency hopping')
print_greeting('Margaret', activity='launching space ships with code')


# now some other functions that actually return something
def my_own_much_better_addition(num1, num2):
    result = num1 + num2  # actually not really better, just an addition
    return result

# we now can call the function and it will provide some result
my_calc = my_own_much_better_addition(23, 42)
print(my_calc)

# something more interesting that returns a more interesting data structure too
def get_user_details():
    name = input('Hey user, what\'s your name? ')
    activity = input('And what is your favourite activity? ')
    return {
        "name": name,
        "activity": activity,
    }

# which we now can use every time a user logs on, to later print the greeting
user_details = get_user_details()
print_greeting(user_details['name'], activity=user_details['activity'])


# And now for an important but not that often used concept:
'''*****************************
   ***                       ***
   ***   R E C U R S I O N   ***
   ***                       ***
   *****************************'''
# do you know the factorials?
# mathematically written: n! = n * (n-1)!
# more background: https://en.wikipedia.org/wiki/Factorial
# here's a list of the first few factorials there are (for 0 and 1 it is defined axiomatically)
factorials = [1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880]  

# so how could we get more of them?
for _n in range(100):  
    factorials.append(factorials[-1] * len(factorials))  
print(f'wow, now we have the first {len(factorials)} in a list:', factorials)


# this is how a function could look like that uses such a list, to return n!
def factorial_list(n):
    # for 0 and 1 the output is just defined as 1
    if n in [0, 1]:
        return 1
    # if n is bigger than 1, we'll create our list
    f_list = [1, 1]
    for i in range(2, n+1):
        f_list.append(f_list[-1]*i)
    # now we can return the last item of the list which is n!
    return f_list[-1]  


# but there is a much smoother way to do that, without any list
# just by calling the same function within the function itself (:= as a recursion)
def factorial(n):  
    if n in [0, 1]:  
        return 1  
    return n * factorial(n-1)  


# we can check now, whether those two functions really do the same thing and
# produce the correct factorials for numbers from 0 to 10
for i in range(10):  
    print(f'factorial({i})      : {factorial(i)}')  
    print(f'factorial_list({i}) : {factorial_list(i)}')

# Ok, so this was an example of what recursion is an can do.
# You might not need it very soon or at all. But keep in mind
# that there are some problems which are better solvable with
# recursion. E.g. a dict, that contains a `data` dict, wich
# again can contain a `data` dict and so on. And you don't know
# how deep this nesting goes. This is one case where recursion
# could come in as quite a handy feature.

'''
Well, this was a lot. Still, there is a lot more. But now
you know all the basics you need to write your own functions.
At some point you might want to split up your functions into
different files, and in your main script import those functions
from those "modules".
But let's leave it for now. If you want to know more, take a look
e.g. at those:
* https://www.w3schools.com/python/python_functions.asp
* https://www.w3schools.com/python/python_modules.asp

Or you want to go full functional? Then take this:
* https://www.w3schools.com/python/python_lambda.asp
But I'd recommend not to overdo it. You don't need to be fully
functional to be an esteemed member of our coding club. ;)
'''

As a file: snippets/functions.py

Reading and analysing .csv data sets

In this part we will read in .csv files and do some statistical calculations. Additionally we will create some first simple plots of the data we read in. To do all of that we will use the requests library and matplotlib.

Make sure to check out this sites. Especially the matplotlib site has a nice Getting started area, that gives a good overview of the different plot types and how to use them in their most simple forms.

At the end of this session your first coding exercise is introduced.

Fictional scenario

In this session we start to use a data set that is based on a purely fictional scenario. Based on this scenario the following training set .csv file is made available: ai_bs_trainingset.csv

This file contains statistics for a vast range of publications somehow relating to “AI”. Our fictional research team has categorised these publications into 4 different types:

newspaper articles
journal articles
book publications
other types of publications

For all those, a word count was created, as well as a count of how often “AI” or any other semantically similar term was mentioned. Then our research team meticulously analysed the publication for its bs_factor which was added to the data set as well.

The resulting data set is reflected in the file linked above. We will use this file in the following snippets and will reuse it in future sessions.

Reading local .csv and initial analysis

filename = 'ai_bs_trainingset.csv'

# read in the file line by line and transform it to a list of lists
lines = []
with open(filename, 'r') as input:
    for line in input:
        # as we know the values are separate by ; we split them first
        line_parts = line.split(';')
        # the last item contains a newline character, which we want to strip
        line_parts[-1] = line_parts[-1].rstrip('\n')
        # now we can append it to our lines list
        lines.append(line_parts)

# remove and print header line
header = lines.pop(0)
print(f'here are the csv file header fields: {header}')

# find out which types we have
types = []
for item in lines:
    types.append(item[3])
# or do the same thing as above in a more pythonic way
types = [item[3] for item in lines]
# get the unique values for publication types
publication_types = set(types)
print(f'the following publications types can be found: {publication_types}')
# now check how many of each item types there are
type_stats = {item_type: 0 for item_type in publication_types}
for item in lines:
    type_stats[item[3]] += 1
print(f'the following amounts of publication types could be found: {type_stats}')

# collect average word count and ai mentions in news paper publications
stats = {'news': {'total_words': 0, 'total_ai_mentions': 0, 'count': 0}}
for item in lines:
    if item[3] == 'news':
        stats['news']['count'] += 1
        stats['news']['total_words'] += int(item[1])
        stats['news']['total_ai_mentions'] += int(item[2])

avg_words_news = stats['news']['total_words'] / stats['news']['count']
avg_ai_mentions_news = stats['news']['total_ai_mentions'] / stats['news']['count']
print(f'news articles have on average ~{round(avg_words_news, 2)} words an mention AI ~{round(avg_ai_mentions_news)} times on average')

As a file: snippets/data_prep1.py

Fetching online .csv and basic plots

For this example we are not using the same file. The thing that is returned is a subset of our fictional resarch data. So it might differ a little bit from the data in the file you downloaded, but the structure is the same and generally the statistics should match too.

import requests
import sys
import matplotlib.pyplot as plt

file_url = 'https://tantemalkah.at/2023/machine-learning/bs-dataset/'

# read in the file line by line and transform it to a list of lists
r = requests.get(file_url)
if r.status_code != 200:
    print(f'Something seems to be wrong. HTTP response status code was: {r.status_code}')
    sys.exit(1)
# first split the whole response text string on newlines
unsplit_lines = r.text.split('\n')
# then create a list of lists where the lines are split on ;
lines = [line.split(';') for line in unsplit_lines]

# remove and print header line
header = lines.pop(0)
print(f'here are the csv file header fields: {header}')
# and also pop the last line, because it is an empty one
lines.pop(-1)

# now we could do exactly the same things as in data_prep1.py

# we can also create simple plots of the data
# let's try that with words agains the bs_factor
words = [int(item[1]) for item in lines]
bs_factors = [float(item[0]) for item in lines]
fig, ax = plt.subplots()
ax.scatter(words, bs_factors, s=1)
plt.show()

# ok, what a mess, we'll have to make some sense of that later. but let's
# also plot the ai_mentions first
mentions = [int(item[2]) for item in lines]
fig, ax = plt.subplots()
ax.scatter(mentions, bs_factors, s=1)
plt.show()

# doesn't look soo much, better, but let's try to use a mentions per words ratio
mentions_per_word = [mentions[i] / words[i] for i in range(len(lines))]
fig, ax = plt.subplots()
ax.scatter(mentions_per_word, bs_factors, s=1)
plt.show()

# stunning, who would have thought that there is such a clear correlation?
# no more manual bs paper analysis, a linreg machine learning alg might just
# do that for us splendidly

As a file: snippets/data_prep2.py