Markov chains and linear regression

In this session we will delve into our first and quite simple machine learning algorithms, still far from doing any (allegedly) fancy deep neural network stuff.

First we will have some fun with Markov chains, to generate random text art based on some learning input.

Second, we will look into linear regression as a very simple computational way to predict certain values.

After this session you have the basic building blocks to solve exercise 3

Markov chains to generate nonsense(?) texts

A Markov chain - named after Andrey Markov, a Russian mathematician who published his first paper on the topic in 1906 - is a statistical tool that can be used for a lot of different things. The english Wikipedia article linked above tells us:

“A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.”

This can be used neatly for text generation. The “sequence of possible events” is a sequence of characters or words appearing in a text (on which the chain/algorithm is trained). The stochastic model then just describes which character/word follows a defined sequence of N characters/words (N being a parameter for how far backwards the algorithm checks) with which probability.

This can be facilitated in a very efficient way for autocomplete suggestions or also to generate somehow syntactically well-written nonsense text. We will work on the funnnier (and easier side) of creating nonsense text. And we will work along chapter two in Noack & Sanner (2023), listed in the References, which is available as an e-book at the Angewandte library from within the Angewandte network (also e.g. with a VPN connectiuon). It contains some nice visualisations explaining how the Markov chain/process works generally and specifically when working with texts.

In order to make use of the following script, download the text file of Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley from Project Gutenberg and save it in the same folder as the script, under the name ‘frankenstein.txt’. Then start playing around with the values in the script. And try use other texts and create your own input text.

import random

TEXT_FILE = 'frankenstein.txt'  # The name of the file with the training text
DEPTH = 6  # How many characters should be used for prediction of the next one
STARTER = 'The machine'  # Needs to be at least DEPTH characters long
LENGTH = 500  # Maximum length of characters for the resulting text.
DEBUG = False  # Print debug info (e.g. the transitions)

# read in the file line by line into a single string
text = ''
with open(TEXT_FILE, 'r') as file:
    for line in file:
        text += line
        # possible variation:
        # text += line.lower()

# now check for all DEPTH long transitions to a next character in the text
transitions = {}
for i in range(len(text) - DEPTH):
    precursor = text[i:i+DEPTH]
    next_char = text[i+DEPTH]
    # if we haven't found this recursor yet, add it to the dict
    if precursor not in transitions:
        transitions[precursor] = [next_char]  # here a set would be even more efficient
    # otherwise check if the next character was already parsed, and add a new one if not
    else:
        if next_char not in transitions[precursor]:
            transitions[precursor].append(next_char)

if DEBUG:
    for i in transitions:
        print(f'{repr(i)}', transitions[i])

# now generate a text from the collected transitions and the STARTER
generated_text = STARTER
while len(generated_text) < LENGTH:
    precursor = generated_text[-DEPTH:]
    if precursor not in transitions:
        break
    generated_text += random.choice(transitions[precursor])

# print our new machinic work of literary random art
print(generated_text)

As a file: snippets/markov.py

Getting into linear regression

Now we go into a bit more abstraction, away from working with words and characters, towards working directly with numbers (indirectly in computing we are always working with numbers).

In this section we will familiarise ourselves with the idea of linear regression and start to devise an algorithm from scratch that can learn to find the best approximation for a line in a data set that shows some linear characteristic. We will use our ai_bs_trainingset.csv from the Fictional scenario in the self-learning chapter.

Some theory first

To get a first insight into how linear regression is supposed to work, we will use the first three chapters of Google’s Machine Learning Crash Course:

If at the end of this session you are still a bit puzzled how all this works, take a look at Linear Regression section of Kylie Ying’s Machine Learning for Everybody course, that is linked in the References section, which starts at 2:10:12 of the whole video.

Coding ML from scratch

Before we start to use TensorFlow we will create our very own implementation of a linear regression model, in order to see how those things are (or could be) built, which usually are hidden away by the libraries we use in practice (although the libraries provide very optimised implementations, which will usually be much more performant than when we fiddle together something from scratch).

Inspecting the data

import pandas as pd
import matplotlib.pyplot as plt

# our training data
file = 'ai_bs_trainingset.csv'

# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']

# create two subsets for news and journal entries only
df_news = df.loc[df.type == 'news']
df_journal = df.loc[df.type == 'journal']

# plot the ai_per_words against bs_factor for our three different dataframes
df.plot.scatter(x='ai_per_words', y='bs_factor', title='all types', figsize=(12, 8))
df_news.plot.scatter(x='ai_per_words', y='bs_factor', title='news', figsize=(12, 8))
df_journal.plot.scatter(x='ai_per_words', y='bs_factor', title='journal', figsize=(12, 8))

# and display them
plt.show()


# come back to the part below, once you trained your model

# whether to include a plot of predicted data from a trained model
PREDICTION = False
# some configurations for the plotting of a trained model
MODEL_B = 0.6
MODEL_W1 = 2.0000000000000004
START_X = 0.0
STEP_SIZE = 0.001
STEPS = 85

if PREDICTION:
    prediction = []
    feature = []
    x = START_X
    for i in range(STEPS):
        prediction.append(MODEL_B + MODEL_W1 * x)
        feature.append(x)
        x += STEP_SIZE
    df_prediction = pd.DataFrame({'ai_mentions_per_word': feature, 'predicted_bs_factor': prediction})
    df_prediction.plot.scatter(
        x='ai_mentions_per_word',
        y='predicted_bs_factor',
        title='model prediction',
        figsize=(12,8),  # use this to set the size of the graph
        xlim=(0.0, 0.05),  # use this to limit the x axis
        ylim=(0, 1),  # use this to limit the y axis
    )
    plt.show()

As a file: snippets/linreg_plots.py

Training the model

import pandas as pd

VERBOSE = True

# our training data
file = 'ai_bs_trainingset.csv'

# read in the data frame from the CSV file and add a colum for ai_mentions / words
df = pd.read_csv(file, delimiter=';')
df['ai_per_words'] = df['ai_mentions'] / df['words']

# now randomly shuffle our data set (in case of doubt always a good practice before starting to train and validate)
df = df.sample(frac=1).reset_index(drop=True)

df = df.loc[df.type == 'journal'].reset_index(drop=True)

labels = df['bs_factor']
features = df['ai_per_words']

b = 0.0
w1 = 0.0
learning_rate = 0.1
epochs = 20


def calculate_mse(x, y, b, w1):
    """Calculate the L2 loss for a data set with given model parameters

    In this case we work with the following model: y = b + w1 * x1
    So we have just one feature (x1) which should be used to predict the label (y)
    As an error function the L2 loss is used and calculated as:
    loss = sum of all (ŷ - y)² for every x in the dataset

    :param x: the x values aka feature
    :param y: the y values aka labels
    :param b: the bias of our linear regression model
    :param w1: the weight of our linear regression model
    :return: the L2 loss (sum of all squared errors) for the provided data set
    """
    loss = 0
    for i in range(len(y)):
        y_hat = b + w1 * x[i]
        loss += (y_hat - y[i]) ** 2
    return loss


for e in range(epochs):
    # calculate the error (loss) for the current model
    error = calculate_mse(features, labels, b, w1)
    # calculate the error (loss) for adapted weights
    error_w1_plus = calculate_mse(features, labels, b, w1 + learning_rate)
    error_w1_minus = calculate_mse(features, labels, b, w1 - learning_rate)
    # calculate the error (loss) for adapted bias
    error_b_plus = calculate_mse(features, labels, b + learning_rate, w1)
    error_b_minus = calculate_mse(features, labels, b - learning_rate, w1)

    # in case we want to see what is going on, provide some output
    if VERBOSE:
        print(f'-------------------------\nEpoch {e}:')
        print(f'current b/w1: {b:{2}.{3}} / {w1:{2}.{3}}   error: {error}')

    # now adjust the values accordingly
    if error_w1_plus < error_w1_minus:
        w1 += learning_rate
    else:
        w1 -= learning_rate
    if error_b_plus < error_b_minus:
        b += learning_rate
    else:
        b -= learning_rate

    # and also output the adapted values in verbose mode
    if VERBOSE:
        print(f'adapted b/w1: {b:{2}.{3}} / {w1:{2}.{3}}')

# print the results
print(f'\n\n-------------------------\nCalculation completed after {e+1} epochs ')
print(f'The final model: y_hat = {b} + {w1} * x1')
print(f'Have fun predicting some bs!')

As a file: snippets/linreg_training.py

Comparisons and improvements

When creating the ai bs datasets I initially played around with different versions that generated different amounts of “linearity” in the dataset. And there also was a dataset that showed a quite strong linearity. I then thought it might be better to have something that spreads a little more and has more outliers, because this makes it more interesting to actually interpret the learning process and the results we get. Also, initially I accidentally uploaded this strongly linear data set, instead of the final version. By now the above data set should reflect the final dataset where the data is spread out a bit more. But you still can use the initial one, if you want to compare how the algorithm works and how well the line fits in this case: ai_bs_trainingset_superlinear.csv

In the 2023 instance of this course Jaro provided an improved version of the above script as a Jupyter Notebook, which you can find under the following gist: https://gist.github.com/anuejn/9d329c9b62e499305202a597bd023ec9
This not only makes use of the dataframe features when calculating the error (no need for a whole for loop), but it also adds the very nice feature of updating plots in the notebook. This way you can actually see how the line gets continually fitted. You have to either set up jupyter notebooks locally or use some online book, e.g. on kaggle.com