RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://sergio-gomez.github.io/mdendro/ below:

A quick introduction to mdendro

Basics

Let us start by using the linkage function to calculate the complete linkage AHC of the UScitiesD dataset, a matrix of distances between a few US cities:

library(mdendro)
lnk <- linkage(UScitiesD, method = "complete")

Now we can plot the resulting dendrogram:

The summary of this dendrogram is:

## Call:
## linkage(prox = UScitiesD,
##         type.prox = "distance",
##         digits = 0,
##         method = "complete",
##         group = "variable")
## 
## Number of objects: 10 
## 
## Binary dendrogram: TRUE
## 
## Descriptive measures:
##       cor       sdr        ac        cc        tb 
## 0.8077859 1.0000000 0.7738478 0.3055556 0.9316262

In particular, you can recognize the calculated descriptors:

cor: cophenetic correlation coefficient
sdr: space distortion ratio
ac: agglomerative coefficient
cc: chaining coefficient
tb: tree balance

It is possible to work with similarity data without having to convert them to distances, provided they are in range [0.0, 1.0]. A typical example would be a matrix of non-negative correlations:

sim <- as.dist(Harman23.cor$cov)
lnk <- linkage(sim, type.prox = "sim")
plot(lnk, main = "Harman23")

There is also the option to choose between unweighted (default) and weighted methods:

par(mfrow = c(1, 2))
cars <- round(dist(scale(mtcars)), digits = 3)
nodePar <- list(cex = 0, lab.cex = 0.4)

lnk1 <- linkage(cars, method = "arithmetic")
plot(lnk1, main = "unweighted", nodePar = nodePar)

lnk2 <- linkage(cars, method = "arithmetic", weighted = TRUE)
plot(lnk2, main = "weighted", nodePar = nodePar)

When there are tied minimum distances in the agglomeration process, you may ignore them and proceed choosing a random pair (pair-group methods) or agglomerate them all at once (variable-group multidendrograms). With linkage you can use both approaches, being multidendrograms the default one:

par(mfrow = c(1, 2))
cars <- round(dist(scale(mtcars)), digits = 1)
nodePar <- list(cex = 0, lab.cex = 0.4)

lnk1 <- linkage(cars, method = "complete")
plot(lnk1, main = "multidendrogram", nodePar = nodePar)

lnk2 <- linkage(cars, method = "complete", group = "pair")
plot(lnk2, main = "pair-group", nodePar = nodePar)

While multidendrograms are unique, you may obtain structurally different pair-group dendrograms by just reordering the data. As a consequence, the descriptors are invariant to permutations for multidendrograms, but not for pair-group dendrograms:

cars <- round(dist(scale(mtcars)), digits = 1)
lnk1 <- linkage(cars, method = "complete")
lnk2 <- linkage(cars, method = "complete", group = "pair")

# apply random permutation to data
set.seed(666)
ord <- sample(attr(cars, "Size"))
cars_p <- as.dist(as.matrix(cars)[ord, ord])
lnk1p <- linkage(cars_p, method = "complete")
lnk2p <- linkage(cars_p, method = "complete", group = "pair")

# compare original and permuted cophenetic correlation
c(lnk1$cor, lnk1p$cor)
## [1] 0.7782257 0.7782257
c(lnk2$cor, lnk2p$cor)
## [1] 0.7780010 0.7780994

# compare original and permuted tree balance
c(lnk1$tb, lnk1p$tb)
## [1] 0.9564568 0.9564568
c(lnk2$tb, lnk2p$tb)
## [1] 0.9472909 0.9424148

In multidendrograms, the ranges (rectangles) show the heterogeneity between distances within the group, but they are optional in the plots:

par(mfrow = c(1, 2))
cars <- round(dist(scale(mtcars)), digits = 1)
nodePar <- list(cex = 0, lab.cex = 0.4)
lnk <- linkage(cars, method = "complete")
plot(lnk, col.rng = "bisque", main = "with ranges", nodePar = nodePar)
plot(lnk, col.rng = NULL, main = "without ranges", nodePar = nodePar)

Plots including ranges are only available if you directly use the plot.linkage function from mdendro. Anyway, you may still take advantage of other dendrogram plotting packages, such as dendextend and ape:

par(mfrow = c(1, 2))
cars <- round(dist(scale(mtcars)), digits = 1)
lnk <- linkage(cars, method = "complete")

lnk.dend <- as.dendrogram(lnk)
plot(dendextend::set(lnk.dend, "branches_k_color", k = 4),
     main = "dendextend package",
     nodePar = list(cex = 0.4, lab.cex = 0.5))

lnk.hcl <- as.hclust(lnk)
pal4 <- c("red", "forestgreen", "purple", "orange")
clu4 <- cutree(lnk.hcl, 4)
plot(ape::as.phylo(lnk.hcl),
     type = "fan",
     main = "ape package",
     tip.color = pal4[clu4],
     cex = 0.5)

In addition, you can also use the linkage function to plot heatmaps containing multidendrograms:

heatmap(scale(mtcars), hclustfun = linkage)

Linkage methods

The list of available AHC linkage methods is the following: single, complete, arithmetic, geometric, harmonic, versatile, ward, centroid and flexible. Their equivalences with the methods in other packages can be found below. The default method is arithmetic, which is also commonly known as average linkage or UPGMA.

par(mfrow = c(3, 3))
methods <- c("single", "complete", "arithmetic",
             "geometric", "harmonic", "versatile",
             "ward", "centroid", "flexible")

for (m in methods) {
  lnk <- linkage(UScitiesD, method = m)
  plot(lnk, cex = 0.6, main = m)
}

Two of the methods, versatile and flexible, depend on a parameter that takes values in (-Inf, +Inf) for versatile, and in [-1.0, +1.0] for flexible. Here come some examples:

par(mfrow = c(2, 3))

vals <- c(-10.0, 0.0, 10.0)
for (v in vals) {
  lnk <- linkage(UScitiesD, method = "versatile", par.method = v)
  plot(lnk, cex = 0.6, main = sprintf("versatile (%.1f)", v))
}

vals <- c(-0.8, 0.0, 0.8)
for (v in vals) {
  lnk <- linkage(UScitiesD, method = "flexible", par.method = v)
  plot(lnk, cex = 0.6, main = sprintf("flexible (%.1f)", v))
}

It is interesting to know how the descriptors depend on those parameters. Package mdendro provides two specific functions for this task, namely descval and descplot, which return just the numerical values or also the corresponding plot, respectively. For example, using versatile linkage:

par(mfrow = c(2, 3))
measures <- c("cor", "sdr", "ac", "cc", "tb")
vals <- c(-Inf, (-20:+20), +Inf)
for (m in measures)
  descplot(UScitiesD, method = "versatile",
           measure = m, par.method = vals,
           type = "o",  main = m, col = "blue")

Similarly for the flexible method:

par(mfrow = c(2, 3))
measures <- c("cor", "sdr", "ac", "cc", "tb")
vals <- seq(from = -1, to = +1, by = 0.1)
for (m in measures)
  descplot(UScitiesD, method = "flexible",
           measure = m, par.method = vals,
           type = "o",  main = m, col = "blue")

Comparison with other packages

For comparison, the same AHC can be found using functions hclust and agnes, where the default plots just show some differences in aesthetics:

library(mdendro)
lnk <- linkage(UScitiesD, method = "complete")

library(cluster)
agn <- agnes(UScitiesD, method = "complete")

# library(stats)   # unneeded, stats included by default
hcl <- hclust(UScitiesD, method = "complete")

par(mfrow = c(1, 3))
plot(lnk)
plot(agn, which.plots = 2)
plot(hcl)

Converting to class dendrogram, we can see all three are structurally equivalent:

lnk.dend <- as.dendrogram(lnk)
agn.dend <- as.dendrogram(agn)
hcl.dend <- as.dendrogram(hcl)
identical(lnk.dend, agn.dend)
## [1] TRUE

par(mfrow = c(1, 2))
plot(lnk.dend, main = "lnk.dend = agn.dend", cex = 0.7)
plot(hcl.dend, main = "hcl.dend", cex = 0.7)

The cophenetic (ultrametric) matrix is readily available as component coph of the returned linkage object, and coincides with those obtained using the other functions:

hcl.coph <- cophenetic(hcl)
agn.coph <- cophenetic(agn)

all(lnk$coph == hcl.coph)
## [1] TRUE
all(lnk$coph == agn.coph)
## [1] TRUE

The coincidence also applies to the cophenetic correlation coefficient and agglomerative coefficient, with the advantage that linkage has them all already calculated:

hcl.cor <- cor(UScitiesD, hcl.coph)
all.equal(lnk$cor, hcl.cor)
## [1] TRUE

all.equal(lnk$ac, agn$ac)
## [1] TRUE

The computational efficiency of the three functions is compared next, both in linear scale (left) and in log-log scale (right). It can be observed that the time cost of functions linkage and hclust is quadratic, whereas that of function agnes is cubic:

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4