RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://developers.google.com/machine-learning/clustering/kmeans/overview below:

What is k-means clustering? | Machine Learning

Skip to main content What is k-means clustering?

Stay organized with collections Save and categorize content based on your preferences.

As previously mentioned, many clustering algorithms don't scale to the datasets used in machine learning, which often have millions of examples. For example, agglomerative or divisive hierarchical clustering algorithms look at all pairs of points and have complexities of \(O(n^2 log(n))\) and \(O(n^2)\), respectively.

This course focuses on k-means because it scales as \(O(nk)\), where \(k\) is the number of clusters chosen by the user. This algorithm groups points into \(k\) clusters by minimizing the distances between each point and its cluster's centroid (see Figure 1).

As a result, k-means effectively treats data as composed of a number of roughly circular distributions, and tries to find clusters corresponding to these distributions. But real-world data contains outliers and density-based clusters and might not match the assumptions underlying k-means.

k-means clustering algorithm

The algorithm follows these steps:

Provide an initial guess for \(k\), which can be revised later. For this example, we choose \(k = 3\).
Randomly choose \(k\) centroids.
Figure 1: k-means at initialization.
Assign each point to the nearest centroid to get \(k\) initial clusters.
Figure 2: Initial clusters.
For each cluster, calculate a new centroid by taking the mean position of all points in the cluster. The arrows in Figure 4 show the change in centroid positions.
Figure 3: Recomputed centroids.
Reassign each point to the nearest new centroid.
Figure 4: Clusters after reassignment.
Repeat steps 4 and 5, recalculating centroids and cluster membership, until points no longer change clusters. In the case of large datasets, you can stop the algorithm before convergence based on other criteria.

Because the centroid positions are initially chosen at random, k-means can return significantly different results on successive runs. To solve this problem, run k-means multiple times and choose the result with the best quality metrics. (We'll describe quality metrics later in this course.) You'll need an advanced version of k-means to choose better initial centroid positions.

Though a deep understanding of the math is not necessary, for those who are curious, k-means is a special case of the expectation-maximization algorithm. See lecture notes on the topic from UPenn.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-02-25 UTC.

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-02-25 UTC."],[[["The k-means clustering algorithm groups data points into clusters by minimizing the distance between each point and its cluster's centroid."],["K-means is efficient, scaling as O(nk), making it suitable for large datasets in machine learning, unlike hierarchical clustering methods."],["The algorithm iteratively refines clusters by recalculating centroids and reassigning points until convergence or a stopping criteria is met."],["Due to random initialization, k-means can produce varying results; running it multiple times and selecting the best outcome based on quality metrics is recommended."],["K-means assumes data is composed of circular distributions, which may not be accurate for all real-world data containing outliers or density-based clusters."]]],[]]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4