Stay organized with collections Save and categorize content based on your preferences.
Machine learning datasets can have millions of examples, but not all clustering algorithms scale efficiently. Many clustering algorithms compute the similarity between all pairs of examples, which means their runtime increases as the square of the number of examples \(n\), denoted as \(O(n^2)\) in complexity notation. \(O(n^2)\) algorithms are not practical for datasets with millions of examples.
The k-means algorithm has a complexity of \(O(n)\), meaning that the algorithm scales linearly with \(n\). This algorithm will be the focus of this course.
Types of clusteringFor an exhaustive list of different approaches to clustering, see A Comprehensive Survey of Clustering Algorithms Xu, D. & Tian, Y. Ann. Data. Sci. (2015) 2: 165. Each approach is best suited to a particular data distribution. This course briefly discusses four common approaches.
Centroid-based clusteringThe centroid of a cluster is the arithmetic mean of all the points in the cluster. Centroid-based clustering organizes the data into non-hierarchical clusters. Centroid-based clustering algorithms are efficient but sensitive to initial conditions and outliers. Of these, k-means is the most widely used. It requires users to define the number of centroids, k, and works well with clusters of roughly equal size.
Figure 1: Example of centroid-based clustering. Density-based clusteringDensity-based clustering connects contiguous areas of high example density into clusters. This allows for the discovery of any number of clusters of any shape. Outliers are not assigned to clusters. These algorithms have difficulty with clusters of different density and data with high dimensions.
Figure 2: Example of density-based clustering. Distribution-based clusteringThis clustering approach assumes data is composed of probabilistic distributions, such as Gaussian distributions. In Figure 3, the distribution-based algorithm clusters data into three Gaussian distributions. As distance from the distribution's center increases, the probability that a point belongs to the distribution decreases. The bands show that decrease in probability. When you're not comfortable assuming a particular underlying distribution of the data, you should use a different algorithm.
Figure 3: Example of distribution-based clustering. Hierarchical clusteringHierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies. See Comparison of 61 Sequenced Escherichia coli Genomes by Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example. Any number of clusters can be chosen by cutting the tree at the right level.
Figure 4: Example of a hierarchical tree clustering animals. Key terms:Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-06-09 UTC.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-06-09 UTC."],[[["Many clustering algorithms have a complexity of O(n^2), making them impractical for large datasets, while the k-means algorithm scales linearly with a complexity of O(n)."],["Clustering approaches include centroid-based, density-based, distribution-based, and hierarchical clustering, each suited for different data distributions and structures."],["Centroid-based clustering, particularly k-means, is efficient for grouping data into non-hierarchical clusters based on the mean of data points, but is sensitive to initial conditions and outliers."],["Density-based clustering connects areas of high data density, effectively discovering clusters of varying shapes, but struggles with clusters of differing densities and high-dimensional data."],["Distribution-based clustering assumes data follows specific distributions (e.g., Gaussian), assigning points based on probability, while hierarchical clustering creates a tree of clusters, suitable for hierarchical data."]]],[]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4