# K-Nearest-Neighbours and K-Means Clustering In this session we will explore two more rather simple but versatile machine learning algorithms: K-Nearest-Neighbours classification (in short: KNN) and K-Means clustering. We will work analog to what Noack & Sanner (2023) (listed in the [](./references.md)) present in their chapters 8 & 9 on "K-nächste-Nachbarn" and "K-means-Clustering". Their interactive Javascript-based examples can be found on https://www.maschinennah.de/ki-buch/ We will now create simple Python versions of these algorithms. ## KNN (K-nearest-neighbour) classification To make this work, we need two files in the same folder, and then run `python knn.py`: * the dataset: [](snippets/knn_data.py) * the algorithm: [](snippets/knn.py) Here is the dataset file: ```python # adopted from dataset of example in chapter 8 of "Künstliche Intelligenz verstehen" # https://www.maschinennah.de/ki-buch/ dataset = [ {'cuteness': 0.21, 'fluffiness': 0.91, 'type': 'tarantula'}, {'cuteness': 0.27, 'fluffiness': 0.80, 'type': 'tarantula'}, {'cuteness': 0.15, 'fluffiness': 0.75, 'type': 'tarantula'}, {'cuteness': 0.37, 'fluffiness': 0.87, 'type': 'tarantula'}, {'cuteness': 0.29, 'fluffiness': 0.70, 'type': 'tarantula'}, {'cuteness': 0.90, 'fluffiness': 0.90, 'type': 'bunny'}, {'cuteness': 0.86, 'fluffiness': 0.80, 'type': 'bunny'}, {'cuteness': 0.75, 'fluffiness': 0.84, 'type': 'bunny'}, {'cuteness': 0.95, 'fluffiness': 0.75, 'type': 'bunny'}, {'cuteness': 0.70, 'fluffiness': 0.65, 'type': 'bunny'}, {'cuteness': 0.31, 'fluffiness': 0.22, 'type': 'shark'}, {'cuteness': 0.14, 'fluffiness': 0.13, 'type': 'shark'}, {'cuteness': 0.21, 'fluffiness': 0.06, 'type': 'shark'}, {'cuteness': 0.11, 'fluffiness': 0.25, 'type': 'shark'}, {'cuteness': 0.33, 'fluffiness': 0.11, 'type': 'shark'}, {'cuteness': 0.90, 'fluffiness': 0.10, 'type': 'hedgehog'}, {'cuteness': 0.80, 'fluffiness': 0.17, 'type': 'hedgehog'}, {'cuteness': 0.70, 'fluffiness': 0.11, 'type': 'hedgehog'}, {'cuteness': 0.75, 'fluffiness': 0.26, 'type': 'hedgehog'}, {'cuteness': 0.92, 'fluffiness': 0.22, 'type': 'hedgehog'} ] ``` And this is the algorithm: ```python from knn_data import dataset from math import sqrt def get_distance(item1, item2): """Returns the distance between to data points. The items have to be dictionaries containing values for cuteness and fluffiness. """ dx = item1['cuteness'] - item2['cuteness'] dy = item1['fluffiness'] - item2['fluffiness'] return sqrt(dx**2 + dy**2) def get_neighbours(item, k, dataset): """Returns the k nearest neighbours of item in dataset. :param item: The item to find the nearest neighbours for :param k: The number of nearest neighbours to find :param dataset: The dataset to find neighbours in :return: a list with the k nearest neighbours """ all_diffs = [(get_distance(item, d), d) for d in dataset] sorted_diffs = sorted(all_diffs, key=lambda x: x[0]) sorted_dataset = [x[1] for x in sorted_diffs] return sorted_dataset[0:k] def classify(item, k, dataset): """Returns the classification of an unknown item. :param item: The item to classify :param k: The number of neighbours used to classify the item :param dataset: The dataset used to classify the item :return: Either the classified type of the item or "inconclusive" """ neighbours = get_neighbours(item, k, dataset) rank = {} for neighbour in neighbours: if neighbour['type'] not in rank: rank[neighbour['type']] = 1 else: rank[neighbour['type']] += 1 ranked = sorted(rank.items(), key=lambda x: x[1], reverse=True) if len(ranked) == 1 or ranked[0][1] > ranked[1][1]: return ranked[0][0] else: return 'inconclusive' # now let's define some new items, which we don't know the type of yet new_items = [ {'cuteness': 0, 'fluffiness': 0}, {'cuteness': 1, 'fluffiness': 0}, {'cuteness': 0, 'fluffiness': 1}, {'cuteness': 1, 'fluffiness': 1}, {'cuteness': 0.5, 'fluffiness': 0.5}, {'cuteness': 0.5, 'fluffiness': 0.46}, {'cuteness': 0.61, 'fluffiness': 0.5}, ] # and for each item print the prediction based on the dataset for item in new_items: print(classify(item, 5, dataset)) ``` ## K-means clustering To make this work, we need two files in the same folder, and then run `python kmeans.py`: * the dataset: [](snippets/kmeans_data.py) * the algorithm: [](snippets/kmeans.py) Here is the dataset file: ```python # adopted from dataset of example in chapter 9 of "Künstliche Intelligenz verstehen" # https://www.maschinennah.de/ki-buch/ # every data point represents a temperature (in °C) and an associated precipitation amount (in mm/month) dataset = [ [17.6, 0.0], [20.2, 0.0], [19.0, 0.0], [19.9, 0.0], [19.1, 0.0], [18.8, 0.0], [20.3, 0.8], [20.5, 1.0], [21.0, 0.1], [20.3, 0.9], [20.9, 0.1], [2.0, 31.0], [1.3, 59.0], [-9.4, 0.0], [-3.6, 21.0], [1.7, 83.0], [2.2, 68.0], [-0.1, 4.0], [1.5, 42.0], [1.9, 96.0], [2.1, 72.0], [1.1, 46.0], [0.8, 30.0], [2.0, 18.0], [-21.8, 0.6], [-18.5, 1.0], [-7.5, 0.0], [-1.1, 2.0], [-0.7, 1.0], [-9.0, 13.9], [-0.2, 1.0], [-0.3, 0.5], [-6.3, 24.7], [-3.1, 11.0], [-31.9, 3.0], [-1.0, 10.0], [-31.7, 1.3], [-35.8, 0.0], [-11.9, 6.1], [-12.5, 8.2], [-14.3, 6.8], [-41.8, 0.0], [-6.0, 19.2], [-9.0, 91.3], [-20.2, 59.0], [-10.4, 19.5], [-6.0, 29.2], [-1.1, 40.0], [-0.2, 69.0], [-0.3, 45.0], [-2.4, 97.0], [1.6, 57.7], [0.2, 120.1], [1.3, 76.1], [-0.7, 69.1], [-0.8, 66.5], [27.0, 246.0], [27.0, 199.0], [27.0, 203.0], [26.7, 109.0], [27.1, 233.0], [27.1, 178.0], [26.6, 147.0], [26.6, 256.0], [27.0, 292.0], [27.0, 474.0], [26.8, 547.0], [26.7, 316.0], [26.9, 255.0], [24.9, 251.0], [24.9, 111.0], [9.5, 150.0], [26.3, 135.0], [28.1, 127.0], [18.8, 85.0], [12.2, 67.0], [25.8, 309.0], [25.6, 290.0], [27.1, 200.0], [18.0, 105.0], [27.3, 164.0], [27.9, 481.9], [7.9, 178.9], [21.8, 197.8], [25.1, 152.0], [27.1, 129.8], [20.7, 85.0], [23.6, 283.5], [21.0, 195.7], [25.8, 276.1], [6.0, 92.6], [25.9, 315.0], [25.8, 368.4], [27.5, 83.8], [13.2, 42.6], [16.3, 106.5], [25.9, 0.5], [19.8, 47.0], [19.5, 120.6], [19.8, 58.7], [20.4, 41.0], [19.1, 60.2], [19.4, 48.0], [17.6, 118.0], [16.8, 54.4], [17.6, 34.0], [18.3, 17.0], [15.3, 254.0], [17.0, 18.0], [15.8, 66.0], [15.1, 39.4], [17.7, 41.0], [15.6, 278.5], [16.2, 55.4], [14.5, 119.0], [13.4, 134.4], [10.5, 130.0], [10.1, 116.0], [16.6, 38.0], [22.3, 116.0], [22.0, 112.2], [10.5, 36.0], [9.7, 64.0], [7.5, 77.0], [8.4, 96.8], [7.6, 93.1], [5.8, 55.0], [11.7, 188.8], [5.9, 111.8], [8.8, 96.1], [10.2, 145.6], [11.4, 152.2], [8.1, 103.5] ] ``` And this is the algorithm: ```python from kmeans_data import dataset from math import sqrt import random class KMeansClustering: k = 5 dataset = [] cluster_data = [] cluster_centers = [] def __init__(self, dataset, k=5): """Initialise the algorithm with a dataset and the k value""" self.k = k self.dataset = dataset # k random points from the data set have to be chosen as initial centers indexes = random.sample(range(len(self.dataset)), k) self.cluster_centers = [dataset[idx] for idx in indexes] def cluster(self): """Put each data point into the nearest of the k clusters""" self.cluster_data = [[] for _ in range(self.k)] for item in self.dataset: min_center = 0 min_distance = self.get_distance(item, self.cluster_centers[min_center]) for idx in range(1, self.k): distance = self.get_distance(item, self.cluster_centers[idx]) if distance < min_distance: min_center = idx min_distance = distance self.cluster_data[min_center].append(item) def set_new_means(self): """Calculate new cluster means based on the current cluster data.""" for idx in range(self.k): count = len(self.cluster_data[idx]) x_sum = sum([item[0] for item in self.cluster_data[idx]]) y_sum = sum([item[1] for item in self.cluster_data[idx]]) self.cluster_centers[idx] = [x_sum / count, y_sum / count] def get_distance(self, item1, item2): """Returns the distance between to data points.""" dx = item1[0] - item2[0] dy = item1[1] - item2[1] return sqrt(dx**2 + dy**2) # Now lets use our algorithm to iteratively find better-fitting clusters MAX_EPOCHS = 20 K = 5 # Initialise the algorithm and create the first clusters (with random centers) kmeans = KMeansClustering(dataset, K) kmeans.cluster() # Now continually improve the clustering for n in range(MAX_EPOCHS+1): print('--------------') print(f'Epoch {n} / {MAX_EPOCHS}, Cluster centers: {kmeans.cluster_centers}') print(f'Distribution of cluster data points: {[len(cluster) for cluster in kmeans.cluster_data]}') kmeans.set_new_means() # TODO: here we should insert a check if the new cluster centers are the same as the old ones # in which case we reached a stable result and can stop further calculations kmeans.cluster() # print the final result print() print('Result:') print('=======') print('centers:', kmeans.cluster_centers) print('clusters:') for cluster in kmeans.cluster_data: print(cluster) # TODO: here we could use matplotlib to visualise the final result ```