A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/AbdalkarimA/iClusterVB below:

GitHub - AbdalkarimA/iClusterVB

iClusterVB allows for fast integrative clustering and feature selection for high dimensional data.

Using a variational Bayes approach, its key features - clustering of mixed-type data, automated determination of the number of clusters, and feature selection in high-dimensional settings - address the limitations of traditional clustering methods while offering an alternative and potentially faster approach than MCMC algorithms, making iClusterVB a valuable tool for contemporary data analysis challenges.

You can install iClusterVB from CRAN with:

install.packages("iClusterVB")

You can install the development version of iClusterVB from GitHub with:

# install.packages("devtools")
devtools::install_github("AbdalkarimA/iClusterVB")
iClusterVB - The Main Function

Mandatory arguments

Optional arguments

We will demonstrate the clustering and feature selection performance of iClusterVB using a simulated dataset comprising $N = 240$ individuals and $R = 4$ data views with different data types. Two views were continuous,and one was count – a setup commonly found in genomics data where gene or mRNA expression (continuous), and DNA copy number (count) are observed. The true number of clusters ( $K$ ) was set to 4, with balanced cluster proportions ( $\pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25$ ). Each data view consisted of $p_r = 500$ features ( $r = 1, \dots, 3$ ), totaling $p = \sum_{r=1}^3 p_r = 1500$ features across all views. Within each view, only 50 features (10%) were relevant for clustering, while the remaining features were noise. The relevant features were distributed across clusters as described in the table below:

Data View Cluster Distribution 1 (Continuous) Cluster 1 $\mathcal{N}(10, 1)$ (Relevant) Cluster 2 $\mathcal{N}(5, 1)$ (Relevant) Cluster 3 $\mathcal{N}(-5, 1)$ (Relevant) Cluster 4 $\mathcal{N}(-10, 1)$ (Relevant) $\mathcal{N}(0, 1)$ (Noise) 2 (Continuous) Cluster 1 $\mathcal{N}(-10, 1)$ (Relevant) Cluster 2 $\mathcal{N}(-5, 1)$ (Relevant) Cluster 3 $\mathcal{N}(5, 1)$ (Relevant) Cluster 4 $\mathcal{N}(10, 1)$ (Relevant) $\mathcal{N}(0, 1)$ (Noise) 3 (Count) Cluster 1 $\text{Poisson}(50)$ (Relevant) Cluster 2 $\text{Poisson}(35)$ (Relevant) Cluster 3 $\text{Poisson}(20)$ (Relevant) Cluster 4 $\text{Poisson}(10)$ (Relevant) $\text{Poisson}(2)$ (Noise)

Distribution of relevant and noise features across clusters in each data view

The simulated dataset is included as a list in the package.

library(iClusterVB)

# Input data must be a list

dat1 <- list(gauss_1 = sim_data$continuous1_data,
             gauss_2 = sim_data$continuous2_data,
             multinomial_1 = sim_data$binary_data)

dist <- c("gaussian", "gaussian",
          "multinomial")
set.seed(123)
fit_iClusterVB <- iClusterVB(
  mydata = dat1,
  dist = dist,
  K = 8,
  initial_method = "VarSelLCM",
  VS_method = 1, # Variable Selection is on
  max_iter = 100,
  per = 100
)
#> ------------------------------------------------------------
#> Pre-processing and initializing the model
#> ------------------------------------------------------------
#> ------------------------------------------------------------
#> Running the CAVI algorithm
#> ------------------------------------------------------------
#> iteration = 100 elbo = -21293757.232508
Comparing to True Cluster Membership
table(fit_iClusterVB$cluster, sim_data$cluster_true)
#>    
#>      1  2  3  4
#>   4  0  0 60  0
#>   5  0 60  0  0
#>   6  0  0  0 60
#>   8 60  0  0  0
# We can obtain a summary using summary()
summary(fit_iClusterVB)
#> Total number of individuals:
#> [1] 240
#> 
#> User-inputted maximum number of clusters: 8
#> Number of clusters determined by algorithm: 4
#> 
#> Cluster Membership:
#>  4  5  6  8 
#> 60 60 60 60 
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 1 - gaussian
#> [1] "54 out of a total of 500"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 2 - gaussian
#> [1] "59 out of a total of 500"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 3 - multinomial
#> [1] "500 out of a total of 500"

Probability of Inclusion Plots
# The `piplot` function can be used to visualize the probability of inclusion

piplot(fit_iClusterVB)

Heat maps to visualize the clusters
# The `chmap` function can be used to display heat maps for each data view

list_of_plots <- chmap(fit_iClusterVB, rho = 0,
      cols = c("green", "blue",
               "purple", "red"),
      scale = "none")
# The `grid.arrange` function from gridExtra can be used to display all the 
# plots together
gridExtra::grid.arrange(grobs = list_of_plots, ncol = 2, nrow = 2)


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4