Session 7: Getting into linear regression
In this session we will familiarise ourselves with the idea of linear regression
and start to devise an algorithm from scratch that can learn to find the best
approximation for a line in a data set that shows some linear characteristic.
We will use our ai_bs_trainingset.csv
from the fictional scenario in Session 4 : Reading and analysing .csv data sets
Some theory first
To get a first insight into how linear regression is supposed to work, we will use the first three chapters of the Google’s Machine Learning Crash Course:
If at the end of this session you are still a bit puzzled how all this works, take a look at Linear Regression section of Kylie Ying’s Machine Learning for Everybody course, that is linked in the References section, which starts at 2:10:12 of the whole video.
Coding ML from scratch
Before we start to use TensorFlow we will create our very own implementation of a linear regression model, in order to see how those things are (or could be) built, which usually are hidden away by the libraries we use in practice (although the libraries provide very optimised implementations, which will usually be much more performant than when we fiddle together something from scratch).
Inspecting the data
import pandas as pd
import matplotlib.pyplot as plt
# our training data
file = 'ai_bs_trainingset.csv'
# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']
# create two subsets for news and journal entries only
df_news = df.loc[df.type == 'news']
df_journal = df.loc[df.type == 'journal']
# plot the ai_per_words against bs_factor for our three different dataframes
df.plot.scatter(x='ai_per_words', y='bs_factor', title='all types', figsize=(12, 8))
df_news.plot.scatter(x='ai_per_words', y='bs_factor', title='news', figsize=(12, 8))
df_journal.plot.scatter(x='ai_per_words', y='bs_factor', title='journal', figsize=(12, 8))
# and display them
plt.show()
# come back to the part below, once you trained your model
# whether to include a plot of predicted data from a trained model
PREDICTION = False
# some configurations for the plotting of a trained model
MODEL_B = 0.6
MODEL_W1 = 2.0000000000000004
START_X = 0.0
STEP_SIZE = 0.001
STEPS = 85
if PREDICTION:
prediction = []
feature = []
x = START_X
for i in range(STEPS):
prediction.append(MODEL_B + MODEL_W1 * x)
feature.append(x)
x += STEP_SIZE
df_prediction = pd.DataFrame({'ai_mentions_per_word': feature, 'predicted_bs_factor': prediction})
df_prediction.plot.scatter(x='ai_mentions_per_word', y='predicted_bs_factor', title='model prediction', figsize=(12,8))
plt.show()
As a file: snippets/linreg_plots.py
Training the model
import pandas as pd
VERBOSE = True
# our training data
file = 'ai_bs_trainingset.csv'
# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']
# now randomly shuffle our data set (in case of doubt always a good practice before starting to train and validate)
df = df.sample(frac=1).reset_index(drop=True)
df = df.loc[df.type == 'journal'].reset_index(drop=True)
labels = df['bs_factor']
features = df['ai_per_words']
b = 0.0
w1 = 0.0
learning_rate = 0.1
epochs = 20
def calculate_mse(x, y, b, w1):
"""Calculate the L2 loss for a data set with given model parameters
In this case we work with the following model: y = b + w1 * x1
So we have just one feature (x1) which should be used to predict the label (y)
As an error function the L2 loss is used and calculated as:
loss = sum of all (ŷ - y)² for every x in the dataset
:param x: the x values aka feature
:param y: the y values aka labels
:param b: the bias of our linear regression model
:param w1: the weight of our linear regression model
:return: the L2 loss (sum of all squared errors) for the provided data set
"""
loss = 0
for i in range(len(y)):
y_hat = b + w1 * x[i]
loss += (y_hat - y[i]) ** 2
return loss
for e in range(epochs):
# calculate the error (loss) for the current model
error = calculate_mse(features, labels, b, w1)
# calculate the error (loss) for adapted weights
error_w1_plus = calculate_mse(features, labels, b, w1 + learning_rate)
error_w1_minus = calculate_mse(features, labels, b, w1 - learning_rate)
# calculate the error (loss) for adapted bias
error_b_plus = calculate_mse(features, labels, b + learning_rate, w1)
error_b_minus = calculate_mse(features, labels, b - learning_rate, w1)
# in case we want to see what is going on, provide some output
if VERBOSE:
print(f'-------------------------\nEpoch {e}:')
print(f'current b/w1: {b:{2}.{3}} / {w1:{2}.{3}} error: {error}')
# now adjust the values accordingly
if error_w1_plus < error_w1_minus:
w1 += learning_rate
else:
w1 -= learning_rate
if error_b_plus < error_b_minus:
b += learning_rate
else:
b -= learning_rate
# and also output the adapted values in verbose mode
if VERBOSE:
print(f'adapted b/w1: {b:{2}.{3}} / {w1:{2}.{3}}')
# print the results
print(f'\n\n-------------------------\nCalculation completed after {e+1} epochs ')
print(f'The final model: y_hat = {b} + {w1} * x1')
print(f'Have fun predicting some bs!')
As a file: snippets/linreg_training.py
Comparisons and improvements
When creating the ai bs datasets I initially played around with different versions that generated different amounts
of “linearity” in the dataset. And there also was a dataset that showed a quite strong linearity. I then thought it
might be better to have something that spreads a little more and has more outliers, because this makes it more
interesting to actually interpret the learning process and the results we get. Also, initially I accidentally uploaded
this strongly linear data set, instead of the final version. By now the above data set should reflect the final dataset
where the data is spread out a bit more. But you still can use the initial one, if you want to compare how the
algorithm works and how well the line fits in this case: ai_bs_trainingset_superlinear.csv
And Jaro provided an improved version of the above script as a Jupyter Notebook, which you can find under the
following gist: https://gist.github.com/anuejn/9d329c9b62e499305202a597bd023ec9
This not only makes use of the dataframe features when calculating the error (no need for a whole for loop), but it
also adds the very nice feature of updating plots in the notebook. This way you can actually see how the line gets
continually fitted. You have to either set up jupyter notebooks locally or use some online book, e.g. on kaggle.com