Session 7: Getting into linear regression

In this session we will familiarise ourselves with the idea of linear regression and start to devise an algorithm from scratch that can learn to find the best approximation for a line in a data set that shows some linear characteristic. We will use our ai_bs_trainingset.csv from the fictional scenario in Session 4 : Reading and analysing .csv data sets

Some theory first

To get a first insight into how linear regression is supposed to work, we will use the first three chapters of the Google’s Machine Learning Crash Course:

If at the end of this session you are still a bit puzzled how all this works, take a look at Linear Regression section of Kylie Ying’s Machine Learning for Everybody course, that is linked in the References section, which starts at 2:10:12 of the whole video.

Coding ML from scratch

Before we start to use TensorFlow we will create our very own implementation of a linear regression model, in order to see how those things are (or could be) built, which usually are hidden away by the libraries we use in practice (although the libraries provide very optimised implementations, which will usually be much more performant than when we fiddle together something from scratch).

Inspecting the data

import pandas as pd
import matplotlib.pyplot as plt

# our training data
file = 'ai_bs_trainingset.csv'

# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']

# create two subsets for news and journal entries only
df_news = df.loc[df.type == 'news']
df_journal = df.loc[df.type == 'journal']

# plot the ai_per_words against bs_factor for our three different dataframes
df.plot.scatter(x='ai_per_words', y='bs_factor', title='all types', figsize=(12, 8))
df_news.plot.scatter(x='ai_per_words', y='bs_factor', title='news', figsize=(12, 8))
df_journal.plot.scatter(x='ai_per_words', y='bs_factor', title='journal', figsize=(12, 8))

# and display them
plt.show()


# come back to the part below, once you trained your model

# whether to include a plot of predicted data from a trained model
PREDICTION = False
# some configurations for the plotting of a trained model
MODEL_B = 0.6
MODEL_W1 = 2.0000000000000004
START_X = 0.0
STEP_SIZE = 0.001
STEPS = 85

if PREDICTION:
    prediction = []
    feature = []
    x = START_X
    for i in range(STEPS):
        prediction.append(MODEL_B + MODEL_W1 * x)
        feature.append(x)
        x += STEP_SIZE
    df_prediction = pd.DataFrame({'ai_mentions_per_word': feature, 'predicted_bs_factor': prediction})
    df_prediction.plot.scatter(x='ai_mentions_per_word', y='predicted_bs_factor', title='model prediction', figsize=(12,8))
    plt.show()

As a file: snippets/linreg_plots.py

Training the model

import pandas as pd

VERBOSE = True

# our training data
file = 'ai_bs_trainingset.csv'

# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']

# now randomly shuffle our data set (in case of doubt always a good practice before starting to train and validate)
df = df.sample(frac=1).reset_index(drop=True)

df = df.loc[df.type == 'journal'].reset_index(drop=True)

labels = df['bs_factor']
features = df['ai_per_words']

b = 0.0
w1 = 0.0
learning_rate = 0.1
epochs = 20


def calculate_mse(x, y, b, w1):
    """Calculate the L2 loss for a data set with given model parameters

    In this case we work with the following model: y = b + w1 * x1
    So we have just one feature (x1) which should be used to predict the label (y)
    As an error function the L2 loss is used and calculated as:
    loss = sum of all (ŷ - y)² for every x in the dataset

    :param x: the x values aka feature
    :param y: the y values aka labels
    :param b: the bias of our linear regression model
    :param w1: the weight of our linear regression model
    :return: the L2 loss (sum of all squared errors) for the provided data set
    """
    loss = 0
    for i in range(len(y)):
        y_hat = b + w1 * x[i]
        loss += (y_hat - y[i]) ** 2
    return loss


for e in range(epochs):
    # calculate the error (loss) for the current model
    error = calculate_mse(features, labels, b, w1)
    # calculate the error (loss) for adapted weights
    error_w1_plus = calculate_mse(features, labels, b, w1 + learning_rate)
    error_w1_minus = calculate_mse(features, labels, b, w1 - learning_rate)
    # calculate the error (loss) for adapted bias
    error_b_plus = calculate_mse(features, labels, b + learning_rate, w1)
    error_b_minus = calculate_mse(features, labels, b - learning_rate, w1)

    # in case we want to see what is going on, provide some output
    if VERBOSE:
        print(f'-------------------------\nEpoch {e}:')
        print(f'current b/w1: {b:{2}.{3}} / {w1:{2}.{3}}   error: {error}')

    # now adjust the values accordingly
    if error_w1_plus < error_w1_minus:
        w1 += learning_rate
    else:
        w1 -= learning_rate
    if error_b_plus < error_b_minus:
        b += learning_rate
    else:
        b -= learning_rate

    # and also output the adapted values in verbose mode
    if VERBOSE:
        print(f'adapted b/w1: {b:{2}.{3}} / {w1:{2}.{3}}')

# print the results
print(f'\n\n-------------------------\nCalculation completed after {e+1} epochs ')
print(f'The final model: y_hat = {b} + {w1} * x1')
print(f'Have fun predicting some bs!')

As a file: snippets/linreg_training.py

Comparisons and improvements

When creating the ai bs datasets I initially played around with different versions that generated different amounts of “linearity” in the dataset. And there also was a dataset that showed a quite strong linearity. I then thought it might be better to have something that spreads a little more and has more outliers, because this makes it more interesting to actually interpret the learning process and the results we get. Also, initially I accidentally uploaded this strongly linear data set, instead of the final version. By now the above data set should reflect the final dataset where the data is spread out a bit more. But you still can use the initial one, if you want to compare how the algorithm works and how well the line fits in this case: ai_bs_trainingset_superlinear.csv

And Jaro provided an improved version of the above script as a Jupyter Notebook, which you can find under the following gist: https://gist.github.com/anuejn/9d329c9b62e499305202a597bd023ec9
This not only makes use of the dataframe features when calculating the error (no need for a whole for loop), but it also adds the very nice feature of updating plots in the notebook. This way you can actually see how the line gets continually fitted. You have to either set up jupyter notebooks locally or use some online book, e.g. on kaggle.com