# Session 7: Getting into linear regression In this session we will familiarise ourselves with the idea of linear regression and start to devise an algorithm from scratch that can learn to find the best approximation for a line in a data set that shows some linear characteristic. We will use our [](ai_bs_trainingset.csv) from the fictional scenario in [](session4.md) ## Some theory first To get a first insight into how linear regression is supposed to work, we will use the first three chapters of the Google's [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course): * [Framing](https://developers.google.com/machine-learning/crash-course/framing) * [Descending into ML](https://developers.google.com/machine-learning/crash-course/descending-into-ml) * [Reducing Loss](https://developers.google.com/machine-learning/crash-course/reducing-loss) If at the end of this session you are still a bit puzzled how all this works, take a look at Linear Regression section of Kylie Ying's Machine Learning for Everybody course, that is linked in the [](references.md) section, which [starts at 2:10:12](https://www.youtube.com/watch?v=i_LwzRVP7bg&t=7812s) of the whole video. ## Coding ML from scratch Before we start to use TensorFlow we will create our very own implementation of a linear regression model, in order to see how those things are (or could be) built, which usually are hidden away by the libraries we use in practice (although the libraries provide very optimised implementations, which will usually be much more performant than when we fiddle together something from scratch). ### Inspecting the data ```python import pandas as pd import matplotlib.pyplot as plt # our training data file = 'ai_bs_trainingset.csv' # read in the data frame from the CSV file and add a colum for ai_mentions / words df = pd.read_csv(file, delimiter=';') df['ai_per_words'] = df['ai_mentions'] / df['words'] # create two subsets for news and journal entries only df_news = df.loc[df.type == 'news'] df_journal = df.loc[df.type == 'journal'] # plot the ai_per_words against bs_factor for our three different dataframes df.plot.scatter(x='ai_per_words', y='bs_factor', title='all types', figsize=(12, 8)) df_news.plot.scatter(x='ai_per_words', y='bs_factor', title='news', figsize=(12, 8)) df_journal.plot.scatter(x='ai_per_words', y='bs_factor', title='journal', figsize=(12, 8)) # and display them plt.show() # come back to the part below, once you trained your model # whether to include a plot of predicted data from a trained model PREDICTION = False # some configurations for the plotting of a trained model MODEL_B = 0.6 MODEL_W1 = 2.0000000000000004 START_X = 0.0 STEP_SIZE = 0.001 STEPS = 85 if PREDICTION: prediction = [] feature = [] x = START_X for i in range(STEPS): prediction.append(MODEL_B + MODEL_W1 * x) feature.append(x) x += STEP_SIZE df_prediction = pd.DataFrame({'ai_mentions_per_word': feature, 'predicted_bs_factor': prediction}) df_prediction.plot.scatter(x='ai_mentions_per_word', y='predicted_bs_factor', title='model prediction', figsize=(12,8)) plt.show() ``` As a file: [](snippets/linreg_plots.py) ### Training the model ```python import pandas as pd VERBOSE = True # our training data file = 'ai_bs_trainingset.csv' # read in the data frame from the CSV file and add a colum for ai_mentions / words df = pd.read_csv(file, delimiter=';') df['ai_per_words'] = df['ai_mentions'] / df['words'] # now randomly shuffle our data set (in case of doubt always a good practice before starting to train and validate) df = df.sample(frac=1).reset_index(drop=True) df = df.loc[df.type == 'journal'].reset_index(drop=True) labels = df['bs_factor'] features = df['ai_per_words'] b = 0.0 w1 = 0.0 learning_rate = 0.1 epochs = 20 def calculate_mse(x, y, b, w1): """Calculate the L2 loss for a data set with given model parameters In this case we work with the following model: y = b + w1 * x1 So we have just one feature (x1) which should be used to predict the label (y) As an error function the L2 loss is used and calculated as: loss = sum of all (ŷ - y)² for every x in the dataset :param x: the x values aka feature :param y: the y values aka labels :param b: the bias of our linear regression model :param w1: the weight of our linear regression model :return: the L2 loss (sum of all squared errors) for the provided data set """ loss = 0 for i in range(len(y)): y_hat = b + w1 * x[i] loss += (y_hat - y[i]) ** 2 return loss for e in range(epochs): # calculate the error (loss) for the current model error = calculate_mse(features, labels, b, w1) # calculate the error (loss) for adapted weights error_w1_plus = calculate_mse(features, labels, b, w1 + learning_rate) error_w1_minus = calculate_mse(features, labels, b, w1 - learning_rate) # calculate the error (loss) for adapted bias error_b_plus = calculate_mse(features, labels, b + learning_rate, w1) error_b_minus = calculate_mse(features, labels, b - learning_rate, w1) # in case we want to see what is going on, provide some output if VERBOSE: print(f'-------------------------\nEpoch {e}:') print(f'current b/w1: {b:{2}.{3}} / {w1:{2}.{3}} error: {error}') # now adjust the values accordingly if error_w1_plus < error_w1_minus: w1 += learning_rate else: w1 -= learning_rate if error_b_plus < error_b_minus: b += learning_rate else: b -= learning_rate # and also output the adapted values in verbose mode if VERBOSE: print(f'adapted b/w1: {b:{2}.{3}} / {w1:{2}.{3}}') # print the results print(f'\n\n-------------------------\nCalculation completed after {e+1} epochs ') print(f'The final model: y_hat = {b} + {w1} * x1') print(f'Have fun predicting some bs!') ``` As a file: [](snippets/linreg_training.py) ## Comparisons and improvements When creating the *ai bs datasets* I initially played around with different versions that generated different amounts of "linearity" in the dataset. And there also was a dataset that showed a quite strong linearity. I then thought it might be better to have something that spreads a little more and has more outliers, because this makes it more interesting to actually interpret the learning process and the results we get. Also, initially I accidentally uploaded this strongly linear data set, instead of the final version. By now the above data set should reflect the final dataset where the data is spread out a bit more. But you still can use the initial one, if you want to compare how the algorithm works and how well the line fits in this case: [](ai_bs_trainingset_superlinear.csv) And Jaro provided an improved version of the above script as a Jupyter Notebook, which you can find under the following gist: https://gist.github.com/anuejn/9d329c9b62e499305202a597bd023ec9
This not only makes use of the dataframe features when calculating the error (no need for a whole for loop), but it also adds the very nice feature of updating plots in the notebook. This way you can actually see how the line gets continually fitted. You have to either set up jupyter notebooks locally or use some online book, e.g. on kaggle.com