K-Nearest-Neighbours and K-Means Clustering
In this session we will explore two more rather simple but versatile machine learning algorithms: K-Nearest-Neighbours classification (in short: KNN) and K-Means clustering.
We will work analog to what Noack & Sanner (2023) (listed in the References) present in their chapters 8 & 9 on “K-nächste-Nachbarn” and “K-means-Clustering”. Their interactive Javascript-based examples can be found on https://www.maschinennah.de/ki-buch/
We will now create simple Python versions of these algorithms.
KNN (K-nearest-neighbour) classification
To make this work, we need two files in the same folder, and then run python knn.py
:
the dataset:
snippets/knn_data.py
the algorithm:
snippets/knn.py
Here is the dataset file:
# adopted from dataset of example in chapter 8 of "Künstliche Intelligenz verstehen"
# https://www.maschinennah.de/ki-buch/
dataset = [
{'cuteness': 0.21, 'fluffiness': 0.91, 'type': 'tarantula'},
{'cuteness': 0.27, 'fluffiness': 0.80, 'type': 'tarantula'},
{'cuteness': 0.15, 'fluffiness': 0.75, 'type': 'tarantula'},
{'cuteness': 0.37, 'fluffiness': 0.87, 'type': 'tarantula'},
{'cuteness': 0.29, 'fluffiness': 0.70, 'type': 'tarantula'},
{'cuteness': 0.90, 'fluffiness': 0.90, 'type': 'bunny'},
{'cuteness': 0.86, 'fluffiness': 0.80, 'type': 'bunny'},
{'cuteness': 0.75, 'fluffiness': 0.84, 'type': 'bunny'},
{'cuteness': 0.95, 'fluffiness': 0.75, 'type': 'bunny'},
{'cuteness': 0.70, 'fluffiness': 0.65, 'type': 'bunny'},
{'cuteness': 0.31, 'fluffiness': 0.22, 'type': 'shark'},
{'cuteness': 0.14, 'fluffiness': 0.13, 'type': 'shark'},
{'cuteness': 0.21, 'fluffiness': 0.06, 'type': 'shark'},
{'cuteness': 0.11, 'fluffiness': 0.25, 'type': 'shark'},
{'cuteness': 0.33, 'fluffiness': 0.11, 'type': 'shark'},
{'cuteness': 0.90, 'fluffiness': 0.10, 'type': 'hedgehog'},
{'cuteness': 0.80, 'fluffiness': 0.17, 'type': 'hedgehog'},
{'cuteness': 0.70, 'fluffiness': 0.11, 'type': 'hedgehog'},
{'cuteness': 0.75, 'fluffiness': 0.26, 'type': 'hedgehog'},
{'cuteness': 0.92, 'fluffiness': 0.22, 'type': 'hedgehog'}
]
And this is the algorithm:
from knn_data import dataset
from math import sqrt
def get_distance(item1, item2):
"""Returns the distance between to data points.
The items have to be dictionaries containing values for cuteness and fluffiness.
"""
dx = item1['cuteness'] - item2['cuteness']
dy = item1['fluffiness'] - item2['fluffiness']
return sqrt(dx**2 + dy**2)
def get_neighbours(item, k, dataset):
"""Returns the k nearest neighbours of item in dataset.
:param item: The item to find the nearest neighbours for
:param k: The number of nearest neighbours to find
:param dataset: The dataset to find neighbours in
:return: a list with the k nearest neighbours
"""
all_diffs = [(get_distance(item, d), d) for d in dataset]
sorted_diffs = sorted(all_diffs, key=lambda x: x[0])
sorted_dataset = [x[1] for x in sorted_diffs]
return sorted_dataset[0:k]
def classify(item, k, dataset):
"""Returns the classification of an unknown item.
:param item: The item to classify
:param k: The number of neighbours used to classify the item
:param dataset: The dataset used to classify the item
:return: Either the classified type of the item or "inconclusive"
"""
neighbours = get_neighbours(item, k, dataset)
rank = {}
for neighbour in neighbours:
if neighbour['type'] not in rank:
rank[neighbour['type']] = 1
else:
rank[neighbour['type']] += 1
ranked = sorted(rank.items(), key=lambda x: x[1], reverse=True)
if len(ranked) == 1 or ranked[0][1] > ranked[1][1]:
return ranked[0][0]
else:
return 'inconclusive'
# now let's define some new items, which we don't know the type of yet
new_items = [
{'cuteness': 0, 'fluffiness': 0},
{'cuteness': 1, 'fluffiness': 0},
{'cuteness': 0, 'fluffiness': 1},
{'cuteness': 1, 'fluffiness': 1},
{'cuteness': 0.5, 'fluffiness': 0.5},
{'cuteness': 0.5, 'fluffiness': 0.46},
{'cuteness': 0.61, 'fluffiness': 0.5},
]
# and for each item print the prediction based on the dataset
for item in new_items:
print(classify(item, 5, dataset))
K-means clustering
To make this work, we need two files in the same folder, and then run python kmeans.py
:
the dataset:
snippets/kmeans_data.py
the algorithm:
snippets/kmeans.py
Here is the dataset file:
# adopted from dataset of example in chapter 9 of "Künstliche Intelligenz verstehen"
# https://www.maschinennah.de/ki-buch/
# every data point represents a temperature (in °C) and an associated precipitation amount (in mm/month)
dataset = [
[17.6, 0.0], [20.2, 0.0], [19.0, 0.0], [19.9, 0.0], [19.1, 0.0], [18.8, 0.0], [20.3, 0.8], [20.5, 1.0], [21.0, 0.1],
[20.3, 0.9], [20.9, 0.1], [2.0, 31.0], [1.3, 59.0], [-9.4, 0.0], [-3.6, 21.0], [1.7, 83.0], [2.2, 68.0],
[-0.1, 4.0], [1.5, 42.0], [1.9, 96.0], [2.1, 72.0], [1.1, 46.0], [0.8, 30.0], [2.0, 18.0], [-21.8, 0.6],
[-18.5, 1.0], [-7.5, 0.0], [-1.1, 2.0], [-0.7, 1.0], [-9.0, 13.9], [-0.2, 1.0], [-0.3, 0.5], [-6.3, 24.7],
[-3.1, 11.0], [-31.9, 3.0], [-1.0, 10.0], [-31.7, 1.3], [-35.8, 0.0], [-11.9, 6.1], [-12.5, 8.2], [-14.3, 6.8],
[-41.8, 0.0], [-6.0, 19.2], [-9.0, 91.3], [-20.2, 59.0], [-10.4, 19.5], [-6.0, 29.2], [-1.1, 40.0], [-0.2, 69.0],
[-0.3, 45.0], [-2.4, 97.0], [1.6, 57.7], [0.2, 120.1], [1.3, 76.1], [-0.7, 69.1], [-0.8, 66.5], [27.0, 246.0],
[27.0, 199.0], [27.0, 203.0], [26.7, 109.0], [27.1, 233.0], [27.1, 178.0], [26.6, 147.0], [26.6, 256.0],
[27.0, 292.0], [27.0, 474.0], [26.8, 547.0], [26.7, 316.0], [26.9, 255.0], [24.9, 251.0], [24.9, 111.0],
[9.5, 150.0], [26.3, 135.0], [28.1, 127.0], [18.8, 85.0], [12.2, 67.0], [25.8, 309.0], [25.6, 290.0], [27.1, 200.0],
[18.0, 105.0], [27.3, 164.0], [27.9, 481.9], [7.9, 178.9], [21.8, 197.8], [25.1, 152.0], [27.1, 129.8],
[20.7, 85.0], [23.6, 283.5], [21.0, 195.7], [25.8, 276.1], [6.0, 92.6], [25.9, 315.0], [25.8, 368.4], [27.5, 83.8],
[13.2, 42.6], [16.3, 106.5], [25.9, 0.5], [19.8, 47.0], [19.5, 120.6], [19.8, 58.7], [20.4, 41.0], [19.1, 60.2],
[19.4, 48.0], [17.6, 118.0], [16.8, 54.4], [17.6, 34.0], [18.3, 17.0], [15.3, 254.0], [17.0, 18.0], [15.8, 66.0],
[15.1, 39.4], [17.7, 41.0], [15.6, 278.5], [16.2, 55.4], [14.5, 119.0], [13.4, 134.4], [10.5, 130.0], [10.1, 116.0],
[16.6, 38.0], [22.3, 116.0], [22.0, 112.2], [10.5, 36.0], [9.7, 64.0], [7.5, 77.0], [8.4, 96.8], [7.6, 93.1],
[5.8, 55.0], [11.7, 188.8], [5.9, 111.8], [8.8, 96.1], [10.2, 145.6], [11.4, 152.2], [8.1, 103.5]
]
And this is the algorithm:
from kmeans_data import dataset
from math import sqrt
import random
class KMeansClustering:
k = 5
dataset = []
cluster_data = []
cluster_centers = []
def __init__(self, dataset, k=5):
"""Initialise the algorithm with a dataset and the k value"""
self.k = k
self.dataset = dataset
# k random points from the data set have to be chosen as initial centers
indexes = random.sample(range(len(self.dataset)), k)
self.cluster_centers = [dataset[idx] for idx in indexes]
def cluster(self):
"""Put each data point into the nearest of the k clusters"""
self.cluster_data = [[] for _ in range(self.k)]
for item in self.dataset:
min_center = 0
min_distance = self.get_distance(item, self.cluster_centers[min_center])
for idx in range(1, self.k):
distance = self.get_distance(item, self.cluster_centers[idx])
if distance < min_distance:
min_center = idx
min_distance = distance
self.cluster_data[min_center].append(item)
def set_new_means(self):
"""Calculate new cluster means based on the current cluster data."""
for idx in range(self.k):
count = len(self.cluster_data[idx])
x_sum = sum([item[0] for item in self.cluster_data[idx]])
y_sum = sum([item[1] for item in self.cluster_data[idx]])
self.cluster_centers[idx] = [x_sum / count, y_sum / count]
def get_distance(self, item1, item2):
"""Returns the distance between to data points."""
dx = item1[0] - item2[0]
dy = item1[1] - item2[1]
return sqrt(dx**2 + dy**2)
# Now lets use our algorithm to iteratively find better-fitting clusters
MAX_EPOCHS = 20
K = 5
# Initialise the algorithm and create the first clusters (with random centers)
kmeans = KMeansClustering(dataset, K)
kmeans.cluster()
# Now continually improve the clustering
for n in range(MAX_EPOCHS+1):
print('--------------')
print(f'Epoch {n} / {MAX_EPOCHS}, Cluster centers: {kmeans.cluster_centers}')
print(f'Distribution of cluster data points: {[len(cluster) for cluster in kmeans.cluster_data]}')
kmeans.set_new_means()
# TODO: here we should insert a check if the new cluster centers are the same as the old ones
# in which case we reached a stable result and can stop further calculations
kmeans.cluster()
# print the final result
print()
print('Result:')
print('=======')
print('centers:', kmeans.cluster_centers)
print('clusters:')
for cluster in kmeans.cluster_data:
print(cluster)
# TODO: here we could use matplotlib to visualise the final result