K-Nearest-Neighbours and K-Means Clustering

In this session we will explore two more rather simple but versatile machine learning algorithms: K-Nearest-Neighbours classification (in short: KNN) and K-Means clustering.

We will work analog to what Noack & Sanner (2023) (listed in the References) present in their chapters 8 & 9 on “K-nächste-Nachbarn” and “K-means-Clustering”. Their interactive Javascript-based examples can be found on https://www.maschinennah.de/ki-buch/

We will now create simple Python versions of these algorithms.

KNN (K-nearest-neighbour) classification

To make this work, we need two files in the same folder, and then run python knn.py:

Here is the dataset file:

# adopted from dataset of example in chapter 8 of "Künstliche Intelligenz verstehen"
# https://www.maschinennah.de/ki-buch/
dataset = [
  {'cuteness': 0.21, 'fluffiness': 0.91, 'type': 'tarantula'},
  {'cuteness': 0.27, 'fluffiness': 0.80, 'type': 'tarantula'},
  {'cuteness': 0.15, 'fluffiness': 0.75, 'type': 'tarantula'},
  {'cuteness': 0.37, 'fluffiness': 0.87, 'type': 'tarantula'},
  {'cuteness': 0.29, 'fluffiness': 0.70, 'type': 'tarantula'},

  {'cuteness': 0.90, 'fluffiness': 0.90, 'type': 'bunny'},
  {'cuteness': 0.86, 'fluffiness': 0.80, 'type': 'bunny'},
  {'cuteness': 0.75, 'fluffiness': 0.84, 'type': 'bunny'},
  {'cuteness': 0.95, 'fluffiness': 0.75, 'type': 'bunny'},
  {'cuteness': 0.70, 'fluffiness': 0.65, 'type': 'bunny'},

  {'cuteness': 0.31, 'fluffiness': 0.22, 'type': 'shark'},
  {'cuteness': 0.14, 'fluffiness': 0.13, 'type': 'shark'},
  {'cuteness': 0.21, 'fluffiness': 0.06, 'type': 'shark'},
  {'cuteness': 0.11, 'fluffiness': 0.25, 'type': 'shark'},
  {'cuteness': 0.33, 'fluffiness': 0.11, 'type': 'shark'},

  {'cuteness': 0.90, 'fluffiness': 0.10, 'type': 'hedgehog'},
  {'cuteness': 0.80, 'fluffiness': 0.17, 'type': 'hedgehog'},
  {'cuteness': 0.70, 'fluffiness': 0.11, 'type': 'hedgehog'},
  {'cuteness': 0.75, 'fluffiness': 0.26, 'type': 'hedgehog'},
  {'cuteness': 0.92, 'fluffiness': 0.22, 'type': 'hedgehog'}
]

And this is the algorithm:

from knn_data import dataset
from math import sqrt


def get_distance(item1, item2):
    """Returns the distance between to data points.

    The items have to be dictionaries containing values for cuteness and fluffiness.
    """
    dx = item1['cuteness'] - item2['cuteness']
    dy = item1['fluffiness'] - item2['fluffiness']
    return sqrt(dx**2 + dy**2)


def get_neighbours(item, k, dataset):
    """Returns the k nearest neighbours of item in dataset.

    :param item: The item to find the nearest neighbours for
    :param k: The number of nearest neighbours to find
    :param dataset: The dataset to find neighbours in
    :return: a list with the k nearest neighbours
    """
    all_diffs = [(get_distance(item, d), d) for d in dataset]
    sorted_diffs = sorted(all_diffs, key=lambda x: x[0])
    sorted_dataset = [x[1] for x in sorted_diffs]
    return sorted_dataset[0:k]


def classify(item, k, dataset):
    """Returns the classification of an unknown item.

    :param item: The item to classify
    :param k: The number of neighbours used to classify the item
    :param dataset: The dataset used to classify the item
    :return: Either the classified type of the item or "inconclusive"
    """
    neighbours = get_neighbours(item, k, dataset)
    rank = {}
    for neighbour in neighbours:
        if neighbour['type'] not in rank:
            rank[neighbour['type']] = 1
        else:
            rank[neighbour['type']] += 1
    ranked = sorted(rank.items(), key=lambda x: x[1], reverse=True)
    if len(ranked) == 1 or ranked[0][1] > ranked[1][1]:
        return ranked[0][0]
    else:
        return 'inconclusive'


# now let's define some new items, which we don't know the type of yet
new_items = [
    {'cuteness': 0, 'fluffiness': 0},
    {'cuteness': 1, 'fluffiness': 0},
    {'cuteness': 0, 'fluffiness': 1},
    {'cuteness': 1, 'fluffiness': 1},
    {'cuteness': 0.5, 'fluffiness': 0.5},
    {'cuteness': 0.5, 'fluffiness': 0.46},
    {'cuteness': 0.61, 'fluffiness': 0.5},
]

# and for each item print the prediction based on the dataset
for item in new_items:
    print(classify(item, 5, dataset))

K-means clustering

To make this work, we need two files in the same folder, and then run python kmeans.py:

Here is the dataset file:

# adopted from dataset of example in chapter 9 of "Künstliche Intelligenz verstehen"
# https://www.maschinennah.de/ki-buch/
# every data point represents a temperature (in °C) and an associated precipitation amount (in mm/month)
dataset = [
    [17.6, 0.0], [20.2, 0.0], [19.0, 0.0], [19.9, 0.0], [19.1, 0.0], [18.8, 0.0], [20.3, 0.8], [20.5, 1.0], [21.0, 0.1],
    [20.3, 0.9], [20.9, 0.1], [2.0, 31.0], [1.3, 59.0], [-9.4, 0.0], [-3.6, 21.0], [1.7, 83.0], [2.2, 68.0],
    [-0.1, 4.0], [1.5, 42.0], [1.9, 96.0], [2.1, 72.0], [1.1, 46.0], [0.8, 30.0], [2.0, 18.0], [-21.8, 0.6],
    [-18.5, 1.0], [-7.5, 0.0], [-1.1, 2.0], [-0.7, 1.0], [-9.0, 13.9], [-0.2, 1.0], [-0.3, 0.5], [-6.3, 24.7],
    [-3.1, 11.0], [-31.9, 3.0], [-1.0, 10.0], [-31.7, 1.3], [-35.8, 0.0], [-11.9, 6.1], [-12.5, 8.2], [-14.3, 6.8],
    [-41.8, 0.0], [-6.0, 19.2], [-9.0, 91.3], [-20.2, 59.0], [-10.4, 19.5], [-6.0, 29.2], [-1.1, 40.0], [-0.2, 69.0],
    [-0.3, 45.0], [-2.4, 97.0], [1.6, 57.7], [0.2, 120.1], [1.3, 76.1], [-0.7, 69.1], [-0.8, 66.5], [27.0, 246.0],
    [27.0, 199.0], [27.0, 203.0], [26.7, 109.0], [27.1, 233.0], [27.1, 178.0], [26.6, 147.0], [26.6, 256.0],
    [27.0, 292.0], [27.0, 474.0], [26.8, 547.0], [26.7, 316.0], [26.9, 255.0], [24.9, 251.0], [24.9, 111.0],
    [9.5, 150.0], [26.3, 135.0], [28.1, 127.0], [18.8, 85.0], [12.2, 67.0], [25.8, 309.0], [25.6, 290.0], [27.1, 200.0],
    [18.0, 105.0], [27.3, 164.0], [27.9, 481.9], [7.9, 178.9], [21.8, 197.8], [25.1, 152.0], [27.1, 129.8],
    [20.7, 85.0], [23.6, 283.5], [21.0, 195.7], [25.8, 276.1], [6.0, 92.6], [25.9, 315.0], [25.8, 368.4], [27.5, 83.8],
    [13.2, 42.6], [16.3, 106.5], [25.9, 0.5], [19.8, 47.0], [19.5, 120.6], [19.8, 58.7], [20.4, 41.0], [19.1, 60.2],
    [19.4, 48.0], [17.6, 118.0], [16.8, 54.4], [17.6, 34.0], [18.3, 17.0], [15.3, 254.0], [17.0, 18.0], [15.8, 66.0],
    [15.1, 39.4], [17.7, 41.0], [15.6, 278.5], [16.2, 55.4], [14.5, 119.0], [13.4, 134.4], [10.5, 130.0], [10.1, 116.0],
    [16.6, 38.0], [22.3, 116.0], [22.0, 112.2], [10.5, 36.0], [9.7, 64.0], [7.5, 77.0], [8.4, 96.8], [7.6, 93.1],
    [5.8, 55.0], [11.7, 188.8], [5.9, 111.8], [8.8, 96.1], [10.2, 145.6], [11.4, 152.2], [8.1, 103.5]
]

And this is the algorithm:

from kmeans_data import dataset
from math import sqrt
import random


class KMeansClustering:
    k = 5
    dataset = []
    cluster_data = []
    cluster_centers = []

    def __init__(self, dataset, k=5):
        """Initialise the algorithm with a dataset and the k value"""
        self.k = k
        self.dataset = dataset
        # k random points from the data set have to be chosen as initial centers
        indexes = random.sample(range(len(self.dataset)), k)
        self.cluster_centers = [dataset[idx] for idx in indexes]

    def cluster(self):
        """Put each data point into the nearest of the k clusters"""
        self.cluster_data = [[] for _ in range(self.k)]
        for item in self.dataset:
            min_center = 0
            min_distance = self.get_distance(item, self.cluster_centers[min_center])
            for idx in range(1, self.k):
                distance = self.get_distance(item, self.cluster_centers[idx])
                if distance < min_distance:
                    min_center = idx
                    min_distance = distance
            self.cluster_data[min_center].append(item)

    def set_new_means(self):
        """Calculate new cluster means based on the current cluster data."""
        for idx in range(self.k):
            count = len(self.cluster_data[idx])
            x_sum = sum([item[0] for item in self.cluster_data[idx]])
            y_sum = sum([item[1] for item in self.cluster_data[idx]])
            self.cluster_centers[idx] = [x_sum / count, y_sum / count]

    def get_distance(self, item1, item2):
        """Returns the distance between to data points."""
        dx = item1[0] - item2[0]
        dy = item1[1] - item2[1]
        return sqrt(dx**2 + dy**2)


# Now lets use our algorithm to iteratively find better-fitting clusters
MAX_EPOCHS = 20
K = 5

# Initialise the algorithm and create the first clusters (with random centers)
kmeans = KMeansClustering(dataset, K)
kmeans.cluster()

# Now continually improve the clustering
for n in range(MAX_EPOCHS+1):
    print('--------------')
    print(f'Epoch {n} / {MAX_EPOCHS}, Cluster centers: {kmeans.cluster_centers}')
    print(f'Distribution of cluster data points: {[len(cluster) for cluster in kmeans.cluster_data]}')
    kmeans.set_new_means()
    # TODO: here we should insert a check if the new cluster centers are the same as the old ones
    #       in which case we reached a stable result and can stop further calculations
    kmeans.cluster()

# print the final result
print()
print('Result:')
print('=======')
print('centers:', kmeans.cluster_centers)
print('clusters:')
for cluster in kmeans.cluster_data:
    print(cluster)

# TODO: here we could use matplotlib to visualise the final result