# Session 7: Getting into linear regression

In this session we will familiarise ourselves with the idea of linear regression
and start to devise an algorithm from scratch that can learn to find the best
approximation for a line in a data set that shows some linear characteristic.
We will use our [](ai_bs_trainingset.csv) from the fictional scenario in [](session4.md)

## Some theory first

To get a first insight into how linear regression is supposed to work, we will
use the first three chapters of the Google's
[Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course):

* [Framing](https://developers.google.com/machine-learning/crash-course/framing)
* [Descending into ML](https://developers.google.com/machine-learning/crash-course/descending-into-ml)
* [Reducing Loss](https://developers.google.com/machine-learning/crash-course/reducing-loss)

If at the end of this session you are still a bit puzzled how all this works, take
a look at Linear Regression section of Kylie Ying's Machine Learning for Everybody
course, that is linked in the [](references.md) section, which
[starts at 2:10:12](https://www.youtube.com/watch?v=i_LwzRVP7bg&t=7812s) of the whole video.

## Coding ML from scratch

Before we start to use TensorFlow we will create our very own implementation of a
linear regression model, in order to see how those things are (or could be) built,
which usually are hidden away by the libraries we use in practice (although the
libraries provide very optimised implementations, which will usually be much more
performant than when we fiddle together something from scratch).

### Inspecting the data

```python
import pandas as pd
import matplotlib.pyplot as plt

# our training data
file = 'ai_bs_trainingset.csv'

# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']

# create two subsets for news and journal entries only
df_news = df.loc[df.type == 'news']
df_journal = df.loc[df.type == 'journal']

# plot the ai_per_words against bs_factor for our three different dataframes
df.plot.scatter(x='ai_per_words', y='bs_factor', title='all types', figsize=(12, 8))
df_news.plot.scatter(x='ai_per_words', y='bs_factor', title='news', figsize=(12, 8))
df_journal.plot.scatter(x='ai_per_words', y='bs_factor', title='journal', figsize=(12, 8))

# and display them
plt.show()


# come back to the part below, once you trained your model

# whether to include a plot of predicted data from a trained model
PREDICTION = False
# some configurations for the plotting of a trained model
MODEL_B = 0.6
MODEL_W1 = 2.0000000000000004
START_X = 0.0
STEP_SIZE = 0.001
STEPS = 85

if PREDICTION:
    prediction = []
    feature = []
    x = START_X
    for i in range(STEPS):
        prediction.append(MODEL_B + MODEL_W1 * x)
        feature.append(x)
        x += STEP_SIZE
    df_prediction = pd.DataFrame({'ai_mentions_per_word': feature, 'predicted_bs_factor': prediction})
    df_prediction.plot.scatter(x='ai_mentions_per_word', y='predicted_bs_factor', title='model prediction', figsize=(12,8))
    plt.show()
```
As a file: [](snippets/linreg_plots.py)

### Training the model

```python
import pandas as pd

VERBOSE = True

# our training data
file = 'ai_bs_trainingset.csv'

# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']

# now randomly shuffle our data set (in case of doubt always a good practice before starting to train and validate)
df = df.sample(frac=1).reset_index(drop=True)

df = df.loc[df.type == 'journal'].reset_index(drop=True)

labels = df['bs_factor']
features = df['ai_per_words']

b = 0.0
w1 = 0.0
learning_rate = 0.1
epochs = 20


def calculate_mse(x, y, b, w1):
    """Calculate the L2 loss for a data set with given model parameters

    In this case we work with the following model: y = b + w1 * x1
    So we have just one feature (x1) which should be used to predict the label (y)
    As an error function the L2 loss is used and calculated as:
    loss = sum of all (ŷ - y)² for every x in the dataset

    :param x: the x values aka feature
    :param y: the y values aka labels
    :param b: the bias of our linear regression model
    :param w1: the weight of our linear regression model
    :return: the L2 loss (sum of all squared errors) for the provided data set
    """
    loss = 0
    for i in range(len(y)):
        y_hat = b + w1 * x[i]
        loss += (y_hat - y[i]) ** 2
    return loss


for e in range(epochs):
    # calculate the error (loss) for the current model
    error = calculate_mse(features, labels, b, w1)
    # calculate the error (loss) for adapted weights
    error_w1_plus = calculate_mse(features, labels, b, w1 + learning_rate)
    error_w1_minus = calculate_mse(features, labels, b, w1 - learning_rate)
    # calculate the error (loss) for adapted bias
    error_b_plus = calculate_mse(features, labels, b + learning_rate, w1)
    error_b_minus = calculate_mse(features, labels, b - learning_rate, w1)

    # in case we want to see what is going on, provide some output
    if VERBOSE:
        print(f'-------------------------\nEpoch {e}:')
        print(f'current b/w1: {b:{2}.{3}} / {w1:{2}.{3}}   error: {error}')

    # now adjust the values accordingly
    if error_w1_plus < error_w1_minus:
        w1 += learning_rate
    else:
        w1 -= learning_rate
    if error_b_plus < error_b_minus:
        b += learning_rate
    else:
        b -= learning_rate

    # and also output the adapted values in verbose mode
    if VERBOSE:
        print(f'adapted b/w1: {b:{2}.{3}} / {w1:{2}.{3}}')

# print the results
print(f'\n\n-------------------------\nCalculation completed after {e+1} epochs ')
print(f'The final model: y_hat = {b} + {w1} * x1')
print(f'Have fun predicting some bs!')
```
As a file: [](snippets/linreg_training.py)


## Comparisons and improvements

When creating the *ai bs datasets* I initially played around with different versions that generated different amounts
of "linearity" in the dataset. And there also was a dataset that showed a quite strong linearity. I then thought it
might be better to have something that spreads a little more and has more outliers, because this makes it more
interesting to actually interpret the learning process and the results we get. Also, initially I accidentally uploaded
this strongly linear data set, instead of the final version. By now the above data set should reflect the final dataset
where the data is spread out a bit more. But you still can use the initial one, if you want to compare how the
algorithm works and how well the line fits in this case: [](ai_bs_trainingset_superlinear.csv)

And Jaro provided an improved version of the above script as a Jupyter Notebook, which you can find under the
following gist: https://gist.github.com/anuejn/9d329c9b62e499305202a597bd023ec9 <br>
This not only makes use of the dataframe features when calculating the error (no need for a whole for loop), but it
also adds the very nice feature of updating plots in the notebook. This way you can actually see how the line gets
continually fitted. You have to either set up jupyter notebooks locally or use some online book, e.g. on kaggle.com