Type: Package
Encoding: UTF-8
Title: Robust Trimmed Clustering
Version: 2.1-0
VersionNote: Released 2.0-5 on 2024-10-15 on CRAN
Maintainer: Valentin Todorov <valentin.todorov@chello.at>
Description: Provides functions for robust trimmed clustering. The methods are described in Garcia-Escudero (2008) <doi:10.1214/07-AOS515>, Fritz et al. (2012) <doi:10.18637/jss.v047.i12>, Garcia-Escudero et al. (2011) <doi:10.1007/s11222-010-9194-z> and others.
Depends: R(≥ 3.6.2)
Imports: Rcpp (≥ 1.0.7), doParallel, parallel, foreach, MASS, rlang
Suggests: mclust, cluster, sn
LazyLoad: yes
License: GPL-3
LinkingTo: Rcpp, RcppArmadillo
NeedsCompilation: yes
RoxygenNote: 7.3.2
URL: https://github.com/valentint/tclust
BugReports: https://github.com/valentint/tclust/issues
Repository: CRAN
Author: Valentin Todorov ORCID iD [aut, cre], Luis Angel García Escudero [aut], Agustín Mayo Iscar [aut], Javier Crespo Guerrero [aut], Heinrich Fritz [aut]
Packaged: 2025-02-19 10:03:10 UTC; valen
Date/Publication: 2025-02-19 11:40:02 UTC

Classification Trimmed Likelihood Curves

Description

The function applies tclust several times on a given dataset while parameters alpha and k are altered. The resulting object gives an idea of the optimal trimming level and number of clusters considering a particular dataset.

Usage

ctlcurves(
  x,
  k = 1:4,
  alpha = seq(0, 0.2, len = 6),
  restr.fact = 50,
  parallel = FALSE,
  trace = 1,
  ...
)

Arguments

x

A matrix or data frame of dimension n x p, containing the observations (row-wise).

k

A vector of cluster numbers to be checked. By default cluster numbers from 1 to 5 are examined.

alpha

A vector containing the alpha levels to be checked. By default alpha levels from 0 to 0.2 (continuously increased by 0.01), are checked.

restr.fact

The restriction factor passed to tclust.

parallel

A logical value, to be passed further to tclust().

trace

Defines the tracing level, which is set to 1 by default. Tracing level 2 gives additional information on the current iteration.

...

Further arguments (as e.g. restr), passed to tclust

Details

These curves show the values of the trimmed classification (log-)likelihoods when altering the trimming proportion alpha and the number of clusters k. The careful examination of these curves provides valuable information for choosing these parameters in a clustering problem. For instance, an appropriate k to be chosen is one that we do not observe a clear increase in the trimmed classification likelihood curve for k with respect to the k+1 curve for almost all the range of alpha values. Moreover, an appropriate choice of parameter alpha may be derived by determining where an initial fast increase of the trimmed classification likelihood curve stops for the final chosen k. A more detailed explanation can be found in García-Escudero et al. (2011).

Value

The function returns an S3 object of type ctlcurves containing the following components:

References

García-Escudero, L.A.; Gordaliza, A.; Matrán, C. and Mayo-Iscar, A. (2011), "Exploring the number of groups in robust model-based clustering." Statistics and Computing, 21 pp. 585-599, <doi:10.1007/s11222-010-9194-z>

Examples




 #--- EXAMPLE 1 ------------------------------------------

 sig <- diag (2)
 cen <- rep (1, 2)
 x <- rbind(MASS::mvrnorm(108, cen * 0,   sig),
 	       MASS::mvrnorm(162, cen * 5,   sig * 6 - 2),
 	       MASS::mvrnorm(30, cen * 2.5, sig * 50))

 ctl <- ctlcurves(x, k = 1:4)
 ctl

   ##  ctl-curves 
 plot(ctl)  ##  --> selecting k = 2, alpha = 0.08

   ##  the selected model 
 plot(tclust(x, k = 2, alpha = 0.08, restr.fact = 7))

 #--- EXAMPLE 2 ------------------------------------------

 data(geyser2)
 ctl <- ctlcurves(geyser2, k = 1:5)
 ctl
 
   ##  ctl-curves 
 plot(ctl)  ##  --> selecting k = 3, alpha = 0.08

   ##  the selected model
 plot(tclust(geyser2, k = 3, alpha = 0.08, restr.fact = 5))


 #--- EXAMPLE 3 ------------------------------------------
 
 data(swissbank)
 ctl <- ctlcurves(swissbank, k = 1:5, alpha = seq (0, 0.3, by = 0.025))
 ctl
 
   ##  ctl-curves 
 plot(ctl)  ##  --> selecting k = 2, alpha = 0.1
 
   ##  the selected model
 plot(tclust(swissbank, k = 2, alpha = 0.1, restr.fact = 50))
 



Discriminant Factor analysis for tclust objects

Description

Analyzes a tclust-object by calculating discriminant factors and comparing the quality of the actual cluster assignments to that of the second best possible assignment for each observation. Cluster assignments of observations with large discriminant factors are considered "doubtful" decisions. Silhouette plots give a graphical overview of the discriminant factors distribution (see plot.DiscrFact). More details can be found in García-Escudero et al. (2011).

Usage

DiscrFact(x, threshold = 1/10)

Arguments

x

A tclust object.

threshold

A cluster assignment or a trimming decision for an observation with a discriminant factor larger than log(threshold) is considered a "doubtful" decision.

Value

The function returns an S3 object of type DiscrFact containing the following components:

References

García-Escudero, L.A.; Gordaliza, A.; Matrán, C. and Mayo-Iscar, A. (2011), "Exploring the number of groups in robust model-based clustering." Statistics and Computing, 21 pp. 585-599, <doi:10.1007/s11222-010-9194-z>


Function to perform the E-step for a Gaussian mixture distribution

Description

Compute the log PDF for each observation, the posterior probabilities and the objective function (total log-likelihood) for a Gaussian mixture distribution

Arguments

ll

Rcpp::NumericMatrix, n-by-k where n is the number of observations and k is the number of clusters.

Details

Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. Mixture models are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population-identity information. Mixture modeling approaches assume that data at hand $y_1, ..., y_n in R^p come from a probability distribution with density given by the sum of k components

\sum_{j=1}^k \pi_j \phi( \cdot, \theta_j)

with \phi( \cdot, \theta_j) being the p-variate (generally multivariate normal) densities with parameters \theta_j, j=1, \ldots, k. Generally \theta_j= (\mu_j, \Sigma_j) where \mu_j is the population mean and \Sigma_j is the covariance matrix for component j. \pi_j is the (prior) probability of component j. The objective function is obj is equal to

obj = \log \left( \prod_{i=1}^n \sum_{j=1}^k \pi_j \phi (y_i; \; \theta_j) \right)

or

obj = \sum_{i=1}^n \log \left( \sum_{j=1}^k \pi_j \phi (y_i; \; \theta_j) \right)

where k is the number of components of the mixture and \pi_j are the component probabilitites and \theta_j are the parameters of the j-th mixture component.

Value

The function returns a list with the following elements:

References

McLachlan, G.J.; Peel, D. (2000). Finite Mixture Models. Wiley. ISBN 0-471-00626-2

Examples

##      Generate two Gaussian normal distributions
##      and do not produce plots

       mu1 = c(1,2)
       sigma1 = matrix(c(2, 0, 0, .5), nrow=2, byrow=TRUE)    #[2 0; 0 .5];
       mu2 = c(-3, -5)
       sigma2 = matrix(c(1, 0, 0, 1), nrow=2, byrow=TRUE)
       n1 = 100
       n2 = 200
       Y = rbind(MASS::mvrnorm(n1, mu1, sigma1), 
                 MASS::mvrnorm(n2, mu2, sigma2))
       k = 2
       pi = c(1/3, 2/3)
       mu = rbind(mu1, mu2)
       sigma = array(0, dim=c(2,2,2))
       sigma[,,1] = sigma1
       sigma[,,2] = sigma2
       
       ll = matrix(0, nrow=n1+n2, ncol=2)
       for(j in 1:k)
           ll[,j] = log(pi[j]) +  tclust:::dmvnrm(Y, mu[j,], sigma[,,j])

       dd = tclust:::estepRR(ll)
       dd$obj
       dd$logpdf
       dd$postprob

Flea

Description

Flea-beetle measurements

Usage

data(flea)

Format

A data frame with 74 rows and 7 variables: six explanatory and one response variable - species. The variables are as follows:

References

A. A. Lubischew (1962), On the Use of Discriminant Functions in Taxonomy, Biometrics, 184 pp.455–477.

Examples

 data(flea)
 head(flea)


Old Faithful Geyser Data

Description

A bivariate data set obtained from the Old Faithful Geyser, containing the eruption length and the length of the previous eruption for 271 eruptions of this geyser in minutes.

Usage

data(geyser2)

Format

A data frame containing 272 observations in 2 variables. The variables are as follows:

Source

This particular data structure can be obtained by applying the following code to the "Old Faithful Geyser" (faithful data set (Härdle 1991) in the package datasets):
f1 <- faithful[,1]
geyser2 <- cbind(f1[-length(f1)], f1[-1])
colnames(geyser2) <- c("Eruption length",
"Previous eruption length")

References

García-Escudero, L.A. and Gordaliza, A. (1999). Robustness properties of k-means and trimmed k-means. Journal of the American Statistical Assoc., Vol.94, No.447, 956–969.

Härdle, W. (1991). Smoothing Techniques with Implementation in S., New York: Springer.


LG5data data

Description

A data set in dimension 10 with three clusters around affine subspaces of common intrinsic dimension. A 10% background noise is added uniformly distributed in a rectangle containing the three main clusters.

Usage

data(LG5data)

Format

The first 10 columns are the variables. The last column is the true classification vector where symbol "0" stands for the contaminating data points.

Examples

#--- EXAMPLE 1 ------------------------------------------ 
data (LG5data)
x <- LG5data[, 1:10]
clus <- rlg(x, d = c(2,2,2), alpha=0.1, trace=TRUE)
plot(x, col=clus$cluster+1)

M5data data

Description

A bivariate data set obtained from three normal bivariate distributions with different scales and proportions 1:2:2. One of the components is very overlapped with another one. A 10% background noise is added uniformly distributed in a rectangle containing the three normal components and not very overlapped with the three mixture components. A precise description of the M5 data set can be found in García-Escudero et al. (2008).

Usage

data(M5data)

Format

The first two columns are the two variables. The last column is the true classification vector where symbol "0" stands for the contaminating data points.

Source

García-Escudero, L.A.; Gordaliza, A.; Matrán, C. and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis". Annals of Statistics, Vol.36, pp. 1324-1345.

Examples

#--- EXAMPLE 1 ------------------------------------------ 
data (M5data) 
x <- M5data[, 1:2] 
clus <- tclust(x, k=3, alpha=0.1, nstart=200, niter1=3, niter2=17, 
   nkeep=10, opt="HARD", equal.weights=FALSE, restr.fact=50, trace=TRUE) 
plot (x, col=clus$cluster+1)

Pinus nigra dataset

Description

To study the growth of the wood mass in a cultivated forest of Pinus nigra located in the north of Palencia (Spain), a sample of 362 trees was studied. The data set is made of measurements of heights (in meters), in variable "HT", and diameters (in millimetres), in variable "Diameter", of these trees. The presence of three linear groups can be guessed apart from a small group of trees forming its own cluster with larger heights and diameters one isolated tree with the largest diameter but small height. More details on the interpretation of this dataset in García-Escudero et al (2010).

Usage

data(pine)

Format

A data frame containing 362 observations in 2 variables. The variables are as follows:

References

García-Escudero, L. A., Gordaliza, A., Mayo-Iscar, A., and San Martín, R. (2010). Robust clusterwise linear regression through trimming. Computational Statistics & Data Analysis, 54(12), 3057–3069.


The plot method for objects of class ctlcurves

Description

The plot method for class ctlcurves: This function implements a series of plots, which display characteristic values of the each model, computed with different values for k and alpha.

Usage

## S3 method for class 'ctlcurves'
plot(
  x,
  what = c("obj", "min.weights", "doubtful"),
  main,
  xlab,
  ylab,
  xlim,
  ylim,
  col,
  lty = 1,
  ...
)

Arguments

x

The ctlcurves object to be shown

what

A string indicating which type of plot shall be drawn. See the details section for more information.

main

A character-string containing the title of the plot.

xlab, ylab, xlim, ylim

Arguments passed to plot().

col

A single value or vector of line colors passed to lines.

lty

A single value or vector of line colors passed to lines.

...

Arguments to be passed to or from other methods.

Details

These curves show the values of the trimmed classification (log-)likelihoods when altering the trimming proportion alpha and the number of clusters k. The careful examination of these curves provides valuable information for choosing these parameters in a clustering problem. For instance, an appropriate k to be chosen is one that we do not observe a clear increase in the trimmed classification likelihood curve for k with respect to the k+1 curve for almost all the range of alpha values. Moreover, an appropriate choice of parameter alpha may be derived by determining where an initial fast increase of the trimmed classification likelihood curve stops for the final chosen k. A more detailed explanation can be found in García-Escudero et al. (2011).

This function implements a series of plots, which display characteristic values of the each model, computed with different values for k and alpha.

"obj"

Objective function values.

"min.weights"

The minimum cluster weight found for each computed model. This plot is intended to spot spurious clusters, which in general yield quite small weights.

"doubtful"

The number of "doubtful" decisions identified by DiscrFact.

References

García-Escudero, L.A.; Gordaliza, A.; Matrán, C. and Mayo-Iscar, A. (2011), "Exploring the number of groups in robust model-based clustering." Statistics and Computing, 21 pp. 585-599, <doi:10.1007/s11222-010-9194-z>

Examples


 #--- EXAMPLE 1 ------------------------------------------

 
 sig <- diag (2)
 cen <- rep (1, 2)
 x <- rbind(MASS::mvrnorm(108, cen * 0,   sig),
 	       MASS::mvrnorm(162, cen * 5,   sig * 6 - 2),
 	       MASS::mvrnorm(30, cen * 2.5, sig * 50))

 (ctl <- ctlcurves(x, k = 1:4))

 plot(ctl)
 


The plot method for objects of class DiscrFact

Description

The plot method for class DiscrFact: Next to a plot of the tclust object which has been used for creating the DiscrFact object, a silhouette plot indicates the presence of groups with a large amount of doubtfully assigned observations. A third plot similar to the standard tclust plot serves to highlight the identified doubtful observations.

Usage

## S3 method for class 'DiscrFact'
plot(
  x,
  enum.plots = FALSE,
  xlab = "Discriminant Factor",
  ylab = "Clusters",
  print.DiscrFact = TRUE,
  xlim,
  col.nodoubt = grey(0.8),
  by.cluster = FALSE,
  ...
)

Arguments

x

An object of class DiscrFact as returned from DiscrFact()

enum.plots

A logical value indicating whether the plots shall be enumerated in their title ("(a)", "(b)", "(c)").

xlab, ylab, xlim

Arguments passed to funcion plot.tclust()

print.DiscrFact

A logical value indicating whether each clusters mean discriminant factor shall be plotted

col.nodoubt

Color of all observations not considered as to be assigned doubtfully.

by.cluster

Logical value indicating whether optional parameters pch and col (if present) refer to observations (FALSE) or clusters (TRUE)

...

Arguments to be passed to or from other methods

Details

plot_DiscrFact_p2 displays a silhouette plot based on the discriminant factors of the observations. A solution with many large discriminant factors is not reliable. Such clusters can be identified with this silhouette plot. Thus plot_DiscrFact_p3 displays the dataset, highlighting observations with discriminant factors greater than the given threshold. The function plot.DiscrFact() combines the standard plot of a tclust object, and the two plots introduced here.

References

García-Escudero, L.A.; Gordaliza, A.; Matrán, C. and Mayo-Iscar, A. (2011), "Exploring the number of groups in robust model-based clustering." Statistics and Computing, 21 pp. 585-599, <doi:10.1007/s11222-010-9194-z>

Examples

 sig <- diag (2)
 cen <- rep (1, 2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
 	       MASS::mvrnorm(540, cen * 5,   sig * 6 - 2),
 	       MASS::mvrnorm(100, cen * 2.5, sig * 50))

 clus.1 <- tclust(x, k = 2, alpha=0.1, restr.fact=12)
 clus.2 <- tclust(x, k = 3, alpha=0.1, restr.fact=1)

 dsc.1 <- DiscrFact(clus.1)
 plot(dsc.1)

 dsc.2 <- DiscrFact(clus.2)
 plot(dsc.2)


Plot an 'rlg' object

Description

Different plots for the results of 'rlg' analysis, stored in an rlg object, see Details.

Usage

## S3 method for class 'rlg'
plot(
  x,
  which = c("all", "scores", "loadings", "eigenvalues"),
  sort = TRUE,
  ask = (which == "all" && dev.interactive(TRUE)),
  ...
)

Arguments

x

An rlg object to plot.

which

Select the required plot.

sort

Whether to sort.

ask

if TRUE, the user is asked before each plot, see par(ask=.). Default is ask = which=="all" && dev.interactive().

...

Other parameters to be passed to the lower level functions.

Examples

 data (LG5data)
 x <- LG5data[, 1:10]
 clus <- rlg(x, d = c(2,2,2), alpha=0.1)
 plot(clus, which="eigenvalues") 
 plot(clus, which="scores") 


Plot Method for tclust and tkmeans Objects

Description

One and two dimensional structures are treated separately (e.g. tolerance intervals/ellipses are displayed). Higher dimensional structures are displayed by plotting the two first Fisher's canonical coordinates (evaluated by tclust::discr_coords) and derived from the final cluster assignments (trimmed observations are not taken into account). plot.tclust.Nd can be called with one or two-dimensional tclust- or tkmeans-objects too. The function fails, if store.x = FALSE is specified in the tclust() or tkmeans() call, because the original data matrix is required here.

Usage

## S3 method for class 'tclust'
plot(x, ...)

## S3 method for class 'tkmeans'
plot(x, ...)

Arguments

x

The tclust or tkmeans object to be displayed

...

Further (optional) arguments which specify the details of the resulting plot (see section "Further Arguments").

Details

The plot method for classes tclust and tkmeans.

Further Arguments

Examples

 #--- EXAMPLE 1------------------------------
 sig <- diag (2)
 cen <- rep (1, 2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
 	       MASS::mvrnorm(540, cen * 5,   sig * 6 - 2),
 	       MASS::mvrnorm(100, cen * 2.5, sig * 50))
 # Two groups and 10\% trimming level
 a <- tclust(x, k = 2, alpha = 0.1, restr.fact = 12)
 plot (a)
 plot (a, labels = "observation")
 plot (a, labels = "cluster")
 plot (a, by.cluster = TRUE)
 #--- EXAMPLE 2------------------------------
 sig <- diag (2)
 cen <- rep (1, 2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
 	       MASS::mvrnorm(540, cen * 5,   sig),
 	       MASS::mvrnorm(100, cen * 2.5, sig))
 # Two groups and 10\% trimming level
 a <- tkmeans(x, k = 2, alpha = 0.1)
 plot (a)
 plot (a, labels = "observation")
 plot (a, labels = "cluster")
 plot (a, by.cluster = TRUE)


The plot method for objects of class tclustIC

Description

The plot method for class tclustIC: This function implements a series of plots, which display characteristic values of each model, computed with different values for k and c for a fixed alpha.

Usage

## S3 method for class 'tclustIC'
plot(x, whichIC, main, xlab, ylab, xlim, ylim, col, lty, ...)

Arguments

x

The tclustIC object to be shown

whichIC

A string indicating which information criterion will be used. See the details section for more information.

main

A character-string containing the title of the plot.

xlab, ylab, xlim, ylim

Arguments passed to plot().

col

A single value or vector of line colors passed to lines.

lty

A single value or vector of line types passed to lines.

...

Arguments to be passed to or from other methods.

References

Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, Journal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.

Examples


 
 sig <- diag (2)
 cen <- rep (1, 2)
 x <- rbind(MASS::mvrnorm(108, cen * 0,   sig),
 	       MASS::mvrnorm(162, cen * 5,   sig * 6 - 2),
 	       MASS::mvrnorm(30, cen * 2.5, sig * 50))

 (out <- tclustIC(x, whichIC="ALL"))

 plot(out)
 


Calculates Rand type Indices to compare two partitions

Description

Calculates Rand type Indices to compare two partitions

Usage

randIndex(c1, c2 = NULL, noisecluster = NULL)

Arguments

c1

labels of the first partition or contingency table. A numeric vector or factor containining the class labels of the first partition or a 2-dimensional numeric matrix which contains the cross-tabulation of cluster assignments.

c2

labels of the second partition. A numeric vector or a factor containining the class labels of the second partition. The length of the vector c2 must be equal to the length of the vector c1. The second parameter is required only if c1 is not a 2-dimensional numeric matrix.

noisecluster

label or number associated to the 'noise class' or 'noise level'. Number or character label which denotes the points which do not belong to any cluster. These points are not takern into account for the computation of the Rand type indexes. The default is to consider all points.

Value

A list with Rand type indexes:

Examples

##  1. randindex with the contingency table as input.
T <- matrix(c(1, 1, 0, 1, 2, 1, 0, 0, 4), nrow=3)
(ARI <- randIndex(T))

##  2. randindex with the two vectors as input.
c <- matrix(c(1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), ncol=2, byrow=TRUE)
## c1 = numeric vector containing the labels of the first partition
c1 <- c[,1]
## c2 = numeric vector containing the labels of the second partition
c2 <- c[,2]

(ARI <- randIndex(c1,c2))

##  3. Compare ARI for iris data (true classification against tclust classification)
library(tclust)
c1 <- iris$Species  # first partition c1 is the true partition
out <- tclust(iris[, 1:4], k=3, alpha=0, restr.fact=100)
c2 <- out$cluster   # second partition c2 is the output of tclust clustering procedure

randIndex(c1,c2)

##  4. Compare ARI for iris data (exclude unassigned units from tclust).

c1 <- iris$Species      # first partition c1 is the true partition
out <- tclust(iris[,1:4], k=3, alpha=0.1, restr.fact=100)
c2 <- out$cluster       #  second partition c2 is the output of tclust clustering procedure

## Units inside c2 which contain number 0 are referred to trimmed observations
noisecluster <- 0
randIndex(c1, c2, noisecluster=0)

Robust Linear Grouping

Description

The function rlg() searches for clusters around affine subspaces of dimensions given by vector d (the length of that vector is the number of clusters). For instance d=c(1,2) means that we are clustering around a line and a plane. For robustifying the estimation, a proportion alpha of observations is trimmed. In particular, the trimmed k-means method is represented by the rlg method, if d=c(0,0,..0) (a vector of length k with zeroes).

Usage

rlg(
  x,
  d,
  alpha = 0.05,
  nstart = 500,
  niter1 = 3,
  niter2 = 20,
  nkeep = 5,
  scale = FALSE,
  parallel = FALSE,
  n.cores = -1,
  trace = FALSE
)

Arguments

x

A matrix or data.frame of dimension n x p, containing the observations (rowwise).

d

A numeric vector of length equal to the number of clusters to be detected. Each component of vector d indicates the intrinsic dimension of the affine subspace where observations on that cluster are going to be clustered. All the elements of vector d should be smaller than the problem dimension minus 1.

alpha

The proportion of observations to be trimmed.

nstart

The number of random initializations to be performed.

niter1

The number of concentration steps to be performed for the nstart initializations.

niter2

The maximum number of concentration steps to be performed for the nkeep solutions kept for further iteration. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

nkeep

The number of iterated initializations (after niter1 concentration steps) with the best values in the target function that are kept for further iterations

scale

A robust centering and scaling (using the median and MAD) is done if TRUE.

parallel

A logical value, specifying whether the nstart initializations should be done in parallel.

n.cores

The number of cores to use when paralellizing, only taken into account if parallel=T.

trace

Defines the tracing level, which is set to 0 by default. Tracing level 1 gives additional information on the stage of the iterative process.

Details

The procedure allows to deal with robust clustering around affine subspaces with an alpha proportion of trimming level by minimizing the trimmed sums of squared orthogonal residuals. Each component of vector d indicates the intrinsic dimension of the affine subspace where observations on that cluster are going to be clustered. Therefore a component equal to 0 on that vector implies clustering around centres, equal to 1 around lines, equal to 2 around planes and so on. The procedure so allows simultaneous clustering and dimensionality reduction.

This iterative algorithm performs "concentration steps" to improve the current cluster assignments. For approximately obtaining the global optimum, the procedure is randomly initialized nstart times and niter1 concentration steps are performed for them. The nkeep most “promising” iterations, i.e. the nkeep iterated solutions with the initial best values for the target function, are then iterated until convergence or until niter2 concentration steps are done.

Value

Returns an object of class rlg which is basically a list with the following elements:

Author(s)

Javier Crespo Guerrero, Jesús Fernández Iglesias, Luis Angel Garcia Escudero, Agustin Mayo Iscar.

References

García‐Escudero, L. A., Gordaliza, A., San Martin, R., Van Aelst, S., & Zamar, R. (2009). Robust linear clustering. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71, 301-318.

Examples

##--- EXAMPLE 1 ------------------------------------------
data (LG5data)
x <- LG5data[, 1:10]
clus <- rlg(x, d = c(2,2,2), alpha=0.1)
plot(x, col=clus$cluster+1)
plot(clus, which="eigenvalues") 
plot(clus, which="scores") 

##--- EXAMPLE 2 ------------------------------------------
 data (pine) 
 clus <- rlg(pine, d = c(1,1,1), alpha=0.035)
 plot(pine, col=clus$cluster+1)
 

Simulate contaminated data set for applying rlg

Description

Simulate alpha*100% contaminated data set for applying rlg by generating a k=3 components with equal size and # common underlying dimension q_1=q_2=q_3=q

Usage

simula.rlg(q = 2, p = 10, n = 200, var = 0.01, sep.means = 0, alpha = 0.05)

Arguments

q

intrinsic dimension

p

dimension (p >= 2 and p > q)

n

number of observations

var

The smaller 'var' the smaller the scatter around the lower dimensional space

sep.means

Parameter controlling the location vectors separation

alpha

contamination level

Value

a list with the following items

Examples

 res <- simula.rlg(q=5, p=200, n=150, var=0.01, sep.means=0.00)
 plot(res$x,col=res$true+1)


Simulate contaminated data set for applying TCLUST

Description

Simulate 10% contaminated data set for applying TCLUST

Usage

simula.tclust(n, p = 4, k = 3, type = 2, balanced = 1)

Arguments

n

number of observations

p

dimension (p>=2 and p>q)

k

number of clusters (only k=3 and k=6 are allowed!!!)

type

1 (spherical for rest.fact=1) or 2 (elliptical for rest.fact=9^2)

balanced

1 (all clusters equal size) or 2 [proportions (25,30,35)% if k=3 and (12.5,15,17.5,12.5,15,17.5)% if k=6]

Value

a list with the following items

Examples

res <- simula.tclust(n=400,k=3,p=8,type=2,balanced=1)
plot(res$x,col=res$true+1)


The summary method for objects of class DiscrFact

Description

The summary method for class DiscrFact.

Usage

## S3 method for class 'DiscrFact'
summary(object, hide.emtpy = TRUE, show.clust, show.alt, ...)

Arguments

object

An object of class DiscrFact as returned from DiscrFact().

hide.emtpy

A logical value specifying whether clusters without doubtful assignment shall be hidden.

show.clust

A logical value specifying whether the number of doubtful assignments per cluster shall be displayed.

show.alt

A logical value specifying whether the alternative cluster assignment shall be displayed.

...

Arguments passed to or from other methods.

References

García-Escudero, L.A.; Gordaliza, A.; Matrán, C. and Mayo-Iscar, A. (2011), "Exploring the number of groups in robust model-based clustering." Statistics and Computing, 21 pp. 585-599, <doi:10.1007/s11222-010-9194-z>

Examples

 sig <- diag (2)
 cen <- rep (1, 2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
 	       MASS::mvrnorm(540, cen * 5,   sig * 6 - 2),
 	       MASS::mvrnorm(100, cen * 2.5, sig * 50)
 )

 clus.1 <- tclust(x, k = 2, alpha=0.1, restr.fact=12)
 clus.2 <- tclust(x, k = 3, alpha=0.1, restr.fact=1)

 dsc.1 <- DiscrFact(clus.1)
 summary(dsc.1)

 dsc.2 <- DiscrFact(clus.2)
 summary(dsc.2)


Swiss banknotes data

Description

Six variables measured on 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes (Flury and Riedwyl, 1988).

Usage

data(swissbank)

Format

A data frame containing 200 observations in 6 variables. The variables are as follows:

Details

Observations 1–100 are the genuine bank notes and the other 100 observations are the counterfeit bank notes.

Source

Flury, B. and Riedwyl, H. (1988). Multivariate Statistics, A Practical Approach, Cambridge University Press.


TCLUST method for robust clustering

Description

This function searches for k (or less) clusters with different covariance structures in a data matrix x. Relative cluster scatter can be restricted when restr="eigen" by constraining the ratio between the largest and the smallest of the scatter matrices eigenvalues by a constant value restr.fact. Relative cluster scatters can be also restricted with restr="deter" by constraining the ratio between the largest and the smallest of the scatter matrices' determinants.

For robustifying the estimation, a proportion alpha of observations is trimmed. In particular, the trimmed k-means method is represented by the tclust() method, by setting parameters restr.fact=1, opt="HARD" and equal.weights=TRUE.

Usage

tclust(
  x,
  k,
  alpha = 0.05,
  nstart = 500,
  niter1 = 3,
  niter2 = 20,
  nkeep = 5,
  iter.max,
  equal.weights = FALSE,
  restr = c("eigen", "deter"),
  restr.fact = 12,
  cshape = 1e+10,
  opt = c("HARD", "MIXT"),
  center = FALSE,
  scale = FALSE,
  store_x = TRUE,
  parallel = FALSE,
  n.cores = -1,
  zero_tol = 1e-16,
  drop.empty.clust = TRUE,
  trace = 0
)

Arguments

x

A matrix or data.frame of dimension n x p, containing the observations (row-wise).

k

The number of clusters initially searched for.

alpha

The proportion of observations to be trimmed.

nstart

The number of random initializations to be performed.

niter1

The number of concentration steps to be performed for the nstart initializations.

niter2

The maximum number of concentration steps to be performed for the nkeep solutions kept for further iteration. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

nkeep

The number of iterated initializations (after niter1 concentration steps) with the best values in the target function that are kept for further iterations

iter.max

(deprecated, use the combination nkeep, niter1 and niter2) The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

equal.weights

A logical value, specifying whether equal cluster weights shall be considered in the concentration and assignment steps.

restr

Restriction type to control relative cluster scatters. The default value is restr="eigen", so that the maximal ratio between the largest and the smallest of the scatter matrices eigenvalues is constrained to be smaller then or equal to restr.fact (Garcia-Escudero, Gordaliza, Matran, and Mayo-Iscar, 2008). Alternatively, restr="deter" imposes that the maximal ratio between the largest and the smallest of the scatter matrices determinants is smaller or equal than restr.fact (see Garcia-Escudero, Mayo-Iscar and Riani, 2020)

restr.fact

The constant restr.fact >= 1 constrains the allowed differences among group scatters in terms of eigenvalues ratio (if restr="eigen") or determinant ratios (if restr="deter"). Larger values imply larger differences of group scatters, a value of 1 specifies the strongest restriction.

cshape

constraint to apply to the shape matrices, cshape >= 1, (see Garcia-Escudero, Mayo-Iscar and Riani, 2020)). This options only works if restr=='deter'. In this case the default value is cshape=1e10 to ensure the procedure is (virtually) affine equivariant. On the other hand, cshape values close to 1 would force the clusters to be almost spherical (without necessarily the same scatters if restr.fact is strictly greater than 1).

opt

Define the target function to be optimized. A classification likelihood target function is considered if opt="HARD" and a mixture classification likelihood if opt="MIXT".

center

Optional centering of the data: a function or a vector of length p which can optionally be specified for centering x before calculation

scale

Optional scaling of the data: a function or a vector of length p which can optionally be specified for scaling x before calculation

store_x

A logical value, specifying whether the data matrix x shall be included in the result object. By default this value is set to TRUE, because some of the plotting functions depend on this information. However, when big data matrices are handled, the result object's size can be decreased noticeably when setting this parameter to FALSE.

parallel

A logical value, specifying whether the nstart initializations should be done in parallel.

n.cores

The number of cores to use when paralellizing, only taken into account if parallel=TRUE.

zero_tol

The zero tolerance used. By default set to 1e-16.

drop.empty.clust

Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center and covariance estimates of empty clusters anymore. Cluster names are reassigned such that the first l clusters (l <= k) always have at least one observation.

trace

Defines the tracing level, which is set to 0 by default. Tracing level 1 gives additional information on the stage of the iterative process.

Details

The procedure allows to deal with robust clustering with an alpha proportion of trimming level and searching for k clusters. We are considering classification trimmed likelihood when using opt=”HARD” so that “hard” or “crisp” clustering assignments are done. On the other hand, mixture trimmed likelihood are applied when using opt=”MIXT” so providing a kind of clusters “posterior” probabilities for the observations. Relative cluster scatter can be restricted when restr="eigen" by constraining the ratio between the largest and the smallest of the scatter matrices eigenvalues by a constant value restr.fact. Setting restr.fact=1, yields the strongest restriction, forcing all clusters to be spherical and equally scattered. Relative cluster scatters can be also restricted with restr="deter" by constraining the ratio between the largest and the smallest of the scatter matrices' determinants.

This iterative algorithm performs "concentration steps" to improve the current cluster assignments. For approximately obtaining the global optimum, the procedure is randomly initialized nstart times and niter1 concentration steps are performed for them. The nkeep most “promising” iterations, i.e. the nkeep iterated solutions with the initial best values for the target function, are then iterated until convergence or until niter2 concentration steps are done.

The parameter restr.fact defines the cluster scatter matrices restrictions, which are applied on all clusters during each concentration step. It restricts the ratio between the maximum and minimum eigenvalue of all clusters' covariance structures to that parameter. Setting restr.fact=1, yields the strongest restriction, forcing all clusters to be spherical and equally scattered.

Cluster components with similar sizes are favoured when considering equal.weights=TRUE while equal.weights=FALSE admits possible different prior probabilities for the components and it can easily return empty clusters when the number of clusters is greater than apparently needed.

Value

The function returns the following values:

Author(s)

Javier Crespo Guerrero, Luis Angel Garcia Escudero, Agustin Mayo Iscar.

References

Fritz, H.; Garcia-Escudero, L.A.; Mayo-Iscar, A. (2012), "tclust: An R Package for a Trimming Approach to Cluster Analysis". Journal of Statistical Software, 47(12), 1-26. URL http://www.jstatsoft.org/v47/i12/

Garcia-Escudero, L.A.; Gordaliza, A.; Matran, C. and Mayo-Iscar, A. (2008), "A General Trimming Approach to Robust Cluster Analysis". Annals of Statistics, Vol.36, 1324–1345.

García-Escudero, L. A., Gordaliza, A. and Mayo-Íscar, A. (2014). A constrained robust proposal for mixture modeling avoiding spurious solutions. Advances in Data Analysis and Classification, 27–43.

García-Escudero, L. A., and Mayo-Íscar, A. and Riani, M. (2020). Model-based clustering with determinant-and-shape constraint. Statistics and Computing, 30, 1363–1380.]

Examples


 
 ##--- EXAMPLE 1 ------------------------------------------
 sig <- diag(2)
 cen <- rep(1,2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
            MASS::mvrnorm(540, cen * 5,   sig * 6 - 2),
            MASS::mvrnorm(100, cen * 2.5, sig * 50))
 
 ## Two groups and 10\% trimming level
 clus <- tclust(x, k = 2, alpha = 0.1, restr.fact = 8)
 
 plot(clus)
 plot(clus, labels = "observation")
 plot(clus, labels = "cluster")
 
 ## Three groups (one of them very scattered) and 0\% trimming level
 clus <- tclust(x, k = 3, alpha=0.0, restr.fact = 100)
 
 plot(clus)
 
 ##--- EXAMPLE 2 ------------------------------------------
 data(geyser2)
 (clus <- tclust(geyser2, k = 3, alpha = 0.03))
 
 plot(clus)
 


 ##--- EXAMPLE 3 ------------------------------------------
 data(M5data)
 x <- M5data[, 1:2]
 
 clus.a <- tclust(x, k = 3, alpha = 0.1, restr.fact =  1,
                   restr = "eigen", equal.weights = TRUE)
 clus.b <- tclust(x, k = 3, alpha = 0.1, restr.fact =  50,
                    restr = "eigen", equal.weights = FALSE)
 clus.c <- tclust(x, k = 3, alpha = 0.1, restr.fact =  1,
                   restr = "deter", equal.weights = TRUE)
 clus.d <- tclust(x, k = 3, alpha = 0.1, restr.fact = 50,
                   restr = "deter", equal.weights = FALSE)
 
 pa <- par(mfrow = c (2, 2))
 plot(clus.a, main = "(a)")
 plot(clus.b, main = "(b)")
 plot(clus.c, main = "(c)")
 plot(clus.d, main = "(d)")
 par(pa)
 
 ##--- EXAMPLE 4 ------------------------------------------

 data (swissbank)
 ## Two clusters and 8\% trimming level
 (clus <- tclust(swissbank, k = 2, alpha = 0.08, restr.fact = 50))
 
 ## Pairs plot of the clustering solution
 pairs(swissbank, col = clus$cluster + 1)
 ## Two coordinates
 plot(swissbank[, 4], swissbank[, 6], col = clus$cluster + 1,
      xlab = "Distance of the inner frame to lower border",
      ylab = "Length of the diagonal")
 plot(clus)
 
 ## Three clusters and 0\% trimming level
 clus<- tclust(swissbank, k = 3, alpha = 0.0, restr.fact = 110)
 
 ## Pairs plot of the clustering solution
 pairs(swissbank, col = clus$cluster + 1)
 
 ## Two coordinates
 plot(swissbank[, 4], swissbank[, 6], col = clus$cluster + 1, 
       xlab = "Distance of the inner frame to lower border", 
       ylab = "Length of the diagonal")
 
 plot(clus)
 
 ##--- EXAMPLE 5 ------------------------------------------
  data(M5data)
  x <- M5data[, 1:2]
  
  ## Classification trimmed likelihood approach
  clus.a <- tclust(x, k = 3, alpha = 0.1, restr.fact =  50,
                     opt="HARD", restr = "eigen", equal.weights = FALSE)
 ## Mixture trimmed likelihood approach
  clus.b <- tclust(x, k = 3, alpha = 0.1, restr.fact =  50,
                     opt="MIXT", restr = "eigen", equal.weights = FALSE)
 
 ## Hard 0-1 cluster assignment (all 0 if trimmed unit)
 head(clus.a$posterior)
 
 ## Posterior probabilities cluster assignment for the
 ##  mixture approach (all 0 if trimmed unit)
 head(clus.b$posterior)
 



Performs cluster analysis by calling tclust for different number of groups k and restriction factors c

Description

Computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different values of k (number of groups) and different values of c (restriction factor), for a prespecified level of trimming (the last two letters in the name stand for 'Information Criterion').

Usage

tclustIC(
  x,
  kk = 1:5,
  cc = c(1, 2, 4, 8, 16, 32, 64, 128),
  alpha = 0.05,
  whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"),
  parallel = FALSE,
  n.cores = -1,
  trace = FALSE,
  ...
)

Arguments

x

A matrix or data frame of dimension n x p, containing the observations (row-wise).

kk

an integer vector specifying the number of mixture components (clusters) for which the information criteria are be calculated. By default kk=1:5.

cc

an vector specifying the values of the restriction factor which have to be considered. By default cc=c(1, 2, 4, 8, 16, 32, 64, 128).

alpha

The proportion of observations to be trimmed.

whichIC

A character value which specifies which information criteria must be computed for each k (number of groups) and each value of the restriction factor c. Possible values for whichIC are:

  • "MIXMIX": a mixture model is fitted and for computing the information criterion the mixture likelihood is used. This option corresponds to the use of the Bayesian Information criterion (BIC). In output just the matrix MIXMIX is given.

  • "MIXCLA": a mixture model is fitted but to compute the information criterion the classification likelihood is used. This option corresponds to the use of the Integrated Complete Likelihood (ICL). In the output just the matrix MIXCLA is given.

  • "CLACLA": everything is based on the classification likelihood. This information criterion will be called CLA. In the output just the matrix CLACLA is given.

  • "ALL": both classification and mixture likelihood are used. In this case all three information criteria CLA, ICL and BIC are computed. In the output all three matrices MIXMIX, MIXCLA and CLACLA are given.

parallel

A logical value, specifying whether the calls to tclust should be done in parallel.

n.cores

The number of cores to use when paralellizing, only taken into account if parallel=TRUE.

trace

Whether to print intermediate results. Default is trace=FALSE.

...

Further arguments (as e.g. restr), passed to tclust

Value

The functions print() and summary() are used to obtain and print a summary of the results. The function returns an S3 object of type tclustIC containing the following components:

References

Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017). Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods, Journal of Computational and Graphical Statistics, pp. 404-416, https://doi.org/10.1080/10618600.2017.1390469.

See Also

tclust

Examples


 #--- EXAMPLE 1 ------------------------------------------
 
 data(geyser2)
 (out <- tclustIC(geyser2, whichIC="MIXMIX", alpha=0.1))
 summary(out)
 ## Find the smallest value inside the table and write the corresponding
 ## values of k (number of groups) and c (restriction factor)
 inds <- which(out$MIXMIX == min(out$MIXMIX), arr.ind=TRUE)
 vals <- out$MIXMIX[inds]
 cat("\nThe smallest value of the IC is ", vals, 
     " and takes place for k=", out$kk[inds[1]], " and c=",   
     out$cc[inds[2]], "\n")
 

 #--- EXAMPLE 2 ------------------------------------------
 
 data(flea)
 Y <- as.matrix(flea[, 1:(ncol(flea)-1)])    # select only the numeric variables
 rownames(Y) <- 1:nrow(Y)
 head(Y)

 (out <- tclustIC(Y, whichIC="CLACLA", alpha=0.1))
 summary(out)
 ## Find the smallest value inside the table and write the corresponding
 ## values of k (number of groups) and c (restriction factor)
 inds <- which(out$CLACLA == min(out$CLACLA), arr.ind=TRUE)
 vals <- out$CLACLA[inds]
 cat("\nThe Smallest value of the IC is ", vals, 
     " and takes place for k=", out$kk[inds[1]], " and c=",   
     out$cc[inds[2]], "\n")
 

 #--- EXAMPLE 3 ------------------------------------------
 
 data(swissbank)
 (out <- tclustIC(swissbank, whichIC="ALL"))
 
 plot(out)  ##  --> selecting k=3, c=128
 
 ##  the selected model
 plot(tclust(swissbank, k = 3, alpha = 0.1, restr.fact = 128))
 
 


TKMEANS method for robust K-means clustering

Description

This function searches for k (or less) spherical clusters in a data matrix x, whereas the ceiling(alpha n) most outlying observations are trimmed.

Usage

tkmeans(
  x,
  k,
  alpha = 0.05,
  nstart = 500,
  niter1 = 3,
  niter2 = 20,
  nkeep = 5,
  iter.max,
  points = NULL,
  center = FALSE,
  scale = FALSE,
  store_x = TRUE,
  parallel = FALSE,
  n.cores = -1,
  zero_tol = 1e-16,
  drop.empty.clust = TRUE,
  trace = 0
)

Arguments

x

A matrix or data.frame of dimension n x p, containing the observations (row-wise).

k

The number of clusters initially searched for.

alpha

The proportion of observations to be trimmed.

nstart

The number of random initializations to be performed.

niter1

The number of concentration steps to be performed for the nstart initializations.

niter2

The maximum number of concentration steps to be performed for the nkeep solutions kept for further iteration. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

nkeep

The number of iterated initializations (after niter1 concentration steps) with the best values in the target function that are kept for further iterations

iter.max

(deprecated, use the combination nkeep, niter1 and niter2) The maximum number of concentration steps to be performed. The concentration steps are stopped, whenever two consecutive steps lead to the same data partition.

points

Optional initial mean vectors, NULL or a matrix with k vectors used as means to initialize the algorithm. If initial mean vectors are specified, nstart should be 1 (otherwise the same initial means are used for all runs).

center

Optional centering of the data: a function or a vector of length p which can optionally be specified for centering x before calculation

scale

Optional scaling of the data: a function or a vector of length p which can optionally be specified for scaling x before calculation

store_x

A logical value, specifying whether the data matrix x shall be included in the result object. By default this value is set to TRUE, because some of the plotting functions depend on this information. However, when big data matrices are handled, the result object's size can be decreased noticeably when setting this parameter to FALSE.

parallel

A logical value, specifying whether the nstart initializations should be done in parallel.

n.cores

The number of cores to use when paralellizing, only taken into account if parallel=TRUE.

zero_tol

The zero tolerance used. By default set to 1e-16.

drop.empty.clust

Logical value specifying, whether empty clusters shall be omitted in the resulting object. (The result structure does not contain center estimates of empty clusters anymore. Cluster names are reassigned such that the first l clusters (l <= k) always have at least one observation.

trace

Defines the tracing level, which is set to 0 by default. Tracing level 1 gives additional information on the stage of the iterative process.

Value

The function returns the following values:

Author(s)

Valentin Todorov, Luis Angel Garcia Escudero, Agustin Mayo Iscar.

References

Cuesta-Albertos, J. A.; Gordaliza, A. and Matrán, C. (1997), "Trimmed k-means: an attempt to robustify quantizers". Annals of Statistics, Vol. 25 (2), 553-576.

Examples


 
 ##--- EXAMPLE 1 ------------------------------------------
 sig <- diag(2)
 cen <- rep(1,2)
 x <- rbind(MASS::mvrnorm(360, cen * 0,   sig),
            MASS::mvrnorm(540, cen * 5,   sig),
            MASS::mvrnorm(100, cen * 2.5, sig))
 
 ## Two groups and 10\% trimming level
 (clus <- tkmeans(x, k = 2, alpha = 0.1))

 plot(clus)
 plot(clus, labels = "observation")
 plot(clus, labels = "cluster")

 #--- EXAMPLE 2 ------------------------------------------
 data(geyser2)
 (clus <- tkmeans(geyser2, k = 3, alpha = 0.03))
 plot(clus)
 

Wholesale customers dataset

Description

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units on diverse product categories.

Usage

data(wholesale)

Format

A data frame containing 440 observations in 8 variables (6 numerical and two categorical). The variables are as follows:

Source

Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon. url=https://api.semanticscholar.org/CorpusID:124027622

Examples

#--- EXAMPLE 1 ------------------------------------------ 
data (wholesale) 
x <- wholesale[, -c(1, ncol(wholesale))] 
clus <- tclust(x, k=3, alpha=0.1, nstart=200, niter1=3, niter2=17, 
   nkeep=10, opt="HARD", equal.weights=FALSE, restr.fact=50, trace=TRUE) 
 plot (x, col=clus$cluster+1)
 plot(clus)