Home | Working Group | Research | Courses | Software/data | Links | Contact/bio |

Adrian Raftery: Model-Based Clustering Research

Cluster analysis is the automatic numerical grouping of objects into cohesive groups based on measured characteristics. It was invented in the late 1950s by Sokal, Sneath and others, and has developed mainly as a set of heuristic methods. More recently it has been found that basing cluster analysis on a probability model can be useful both for understanding when existing methods are likely to be successful, and for suggesting new and better methods. It also provides answers to several questions that often arise in practice but are hard to answer with heuristic methods: How many clusters are there? Which clustering method should be used? How should we deal with outliers?

Most clustering methods have been developed for the situation where the groups to be identified are well separated "blobs" in p-space. I have been interested in the case where the groups are defined by their shape, may be clustered around lines or even thin nonlinear curves, and may even intersect. Examples are groups of boundary pixels in images, groups of earthquakes clustered along seismic faults, and stars grouped in galaxies.

For a review of model-based clustering, see our 2019 book, Model-Based Clustering and Classification for Data Science, with Applications in R, as well as Fraley and Raftery (2002). Free software to carry it out, MCLUST, is available for R. For more information on the software, see our 2023 book, Model-Based Clustering, Classification, and Density Estimation Using mclust in R.

More recent research projects in this area include model-based clustering for social networks, variable selection for model-based clustering, merging Gaussian mixture components to represent non-Gaussian clusters, and Bayesian model averaging for model-based clustering.

Books

Scrucca, L., Fraley, C., Murphy, T.B. and Raftery, A.E. (2023). Model-Based Clustering, Classification, and Density Estimation Using mclust in R. Chapman & Hall / CRC Press.

Bouveyron, C., Celeux, G., Murphy, T.B. and Raftery, A.E. (2019). Model-Based Clustering and Classification for Data Science, with Applications in R. Cambridge University Press (Cambridge Series in Statistical and Probabilistic Mathematics). Free download here.

Papers

Gormley, I.C., Murphy, T.B. and Raftery, A.E. (2023). Model-Based Clustering. Annual Review of Statistics and its Applications 10: 573--595.

Scrucca, L. and Raftery, A.E. (2018). clustvarsel: A package implementing variable selection for model-based clustering in R. Journal of Statistical Software 84(1):1--28.

Young, W.C., Raftery, A.E. and Yeung, K.Y. (2017). Model-based clustering with data correction for removing artifacts in gene expression data.. Annals of Applied Statistics 11:1998--2026. (Open access).

Scrucca, L., Fop, M., Murphy, T.B. and Raftery, A.E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R Journal 8:289-317.

Russell, N., Murphy, T.B. and Raftery, A.E. (2015). Bayesian model averaging in model-based clustering and density estimation. Technical Report no. 635, Department of Statistics, University of Washington. Also arXiv:1506.09035.

Scrucca, L. and Raftery, A.E. (2015). Improved initialisation of model-based clustering using a Gaussian hierarchical partition. Advances in Data Analysis and Classification 9:447-460.

Scrucca, L. and Raftery, A.E. (2014). clustvarsel: A Package Implementing Variable Selection for Model-based Clustering in R. Technical Report no. 629, Department of Statistics, University of Washington. Also arXiv:1411.0606.

Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. and Raftery, A.E. (2014). Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering. Journal de la Société Française de Statistique, 155(2):57-71.

Raftery, A.E., Niu, X., Hoff, P.D. and Yeung, K.Y. (2012). Fast Inference for the Latent Space Network Model Using a Case-Control Approximate Likelihood. Journal of Computational and Graphical Statistics, 21:909-919.

Baudry, J.-P., Raftery, A.E., Celeux, G., Lo, K. and Gottardo, R. (2010). Combining Mixture Components for Clustering. Journal of Computational and Graphical Statistics 19:332-353.

Murphy, T.B., Dean. N. and Raftery, A.E. (2010). Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications. Annals of Applied Statistics 4:396-421.

Dean, N. and Raftery, A.E. (2010). Latent Class Analysis Variable Selection. Annals of the Institute of Statistical Mathematics 62:11-35.

Steele, R.J., Wang, N. and Raftery, A.E. (2010). Inference from multiple imputation for missing data using mixtures of normals. Statistical Methodology 7:351-365.

Steele, R.J. and Raftery, A.E. (2010). Performance of Bayesian Model Selection Criteria for Gaussian Mixture Models. In Frontiers of Statistical Decision Making and Bayesian Analysis (edited by M.-H. Chen et al), pages 113-130, New York: Springer. Earlier version.

Krivitsky, P., Handcock, M.S., Raftery, A.E. and Hoff, P. (2009). Representing Degree Distributions, Clustering, and Homophily in Social Networks With Latent Cluster Random Ects Models. Social Networks 31:204-213.

Handcock, M.S., Raftery, A.E. and Tantrum, J. (2007). Model-based clustering for social networks (with Discussion). Journal of the Royal Statistical Society, Series A, 170, 301-354.

Oh, M.-S. and Raftery, A.E. (2007). Model-based Clustering with Dissimilarities: A Bayesian Approach. Journal of Computational and Graphical Statistics, 16, 559-585.

Fraley, C. and Raftery, A.E. (2007). Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering. Journal of Classification, 24, 155-181.

Fraley C. and Raftery A.E. (2007). Model-based methods of classification: Using the mclust software in chemometrics. Journal of Statistical Software, 18, paper i06.

Fraley, C. and Raftery, A.E. (2006). MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering. Technical Report no. 504, Department of Statistics, University of Washington.

Raftery, A.E. and Dean, N. (2006). Variable Selection for Model-Based Clustering. Journal of the American Statistical Assocation, 101, 168-178.

Steele, R., Raftery, A.E. and Emond, M. (2006). Computing Normalizing Constants for Finite Mixture Models via Incremental Mixture Importance Sampling (IMIS). Journal of Computational and Graphical Statistics, 15, 712-734.

Forbes, F., Peyrard, N., Fraley, C., Georgian-Smith, D., Goldhaber, D.M., and Raftery, A.E. (2006). Model-Based Region-of-Interest Selection in Dynamic Breast MRI. Journal of Computer Assisted Tomography, 30, 675-687.

Fraley, C. and Raftery, A.E. (2006). Some applications of model-based clustering in chemistry. R News, 6, no. 3, 17-23.

Fraley, C. and Raftery, A.E. (2006). Model-based microarray image analysis. R News, 6, no. 5, 60-63.

Fraley, C., Raftery, A.E. and Wehrens, R. (2005). Incremental Model-Based Clustering for Large Datasets with Small Clusters. Journal of Computational and Graphical Statistics, 14, 529-546.

Murtagh, F., Raftery, A.E., and J.L. Starck (2005). Bayesian inference for multiband image segmentation via model-based cluster trees. Image and Vision Computing, 23, 587-596.

Dean, N. and Raftery, A.E. (2005). ``Normal uniform mixture differential gene expression detection for cDNA microarrays.'' BMC Bioinformatics, 6, 173. (doi:10.1186/1471-2105-6-173).

Li, Q., Fraley, C., Bumgarner, R.E., Yeung, K.Y. and Raftery, A.E. (2005). ``Donuts, Scratches and Blanks: Robust Model-Based Segmentation of Microarray Images.'' Bioinformatics, 21(12), 2875-2882 (doi:10.1093/bioinformatics/bti447).

Walsh, D.C.I. and Raftery, A.E. (2005). Classification of mixtures of spatial point processes via partial Bayes factors. Journal of Computational and Graphical Statistics, 14, 139-154.

Wehrens, R., Buydens, L.M.C., Fraley, C. and Raftery, A.E. (2004). Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling. Journal of Classification, 21, 231-253.

Fraley, C. and Raftery, A.E. (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association, 97, 611-631.

Chris Fraley and Adrian E. Raftery (2002). "MCLUST: Software for Model-Based Clustering, Density Estimation and Discriminant Analysis" Technical Report no. 415, Department of Statistics, University of Washington.

Fionn Murtagh, Adrian E. Raftery and Jean-Luc Starck (2001). "Bayesian Inference for Color Quantization via Model-Based Clustering Trees". Technical Report no. 402, Department of Statistics, University of Washington.

Yeung K.Y., Fraley C., Murua A, Raftery, A.E. and Ruzzo, W.L. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977-987.
This paper was identified by ISI Science Citation Index/Web of Science as one of the most highly-cited papers in Gene Expression Data. Here is a commentary on the paper by lead author Ka Yee Yeung, published by ISI in its publication Fast Moving Fronts.

Stanford, D.C. and Raftery, A.E. (2000). Principal curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Analysis, 22, 601-609.

Fraley, C. and Raftery, A.E. (1999). MCLUST: Software for Model-Based Cluster Analysis. Journal of Classification, 16, 297-306.

Campbell, J.G., Fraley, C., Stanford, D., Murtagh, F. and Raftery, A.E. (1999). Model-based methods for textile fault detection. International Journal of Imaging Science and Technology, 10, 339-346.

Mukherjee, S., Feigelson, E.D., Babu, G.J., Murtagh, F., Fraley, C. and Raftery, A.E. (1998). Three types of gamma ray bursts. Astrophysical Journal, 508, 314-327.

Fraley, C. and Raftery, A.E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis. Computer Journal, 41, 578-588.

Dasgupta, A. and Raftery, A.E. (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93, 294-302.

Campbell, J.G., Fraley, C., Murtagh, F. and Raftery, A.E. (1997). Linear flaw detection in woven textiles using model-based clustering. Pattern Recognition Letters, 18, 1539-1548.

Bensmail, H., Celeux, G., Raftery, A.E. and Robert, C. (1997). Inference in model-based cluster analysis. Statistics and Computing, 7, 1-10.

Banfield, J.D. and Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821.

Banfield, J.D. and Raftery, A.E. (1992). Ice floe identification in satellite images using mathematical morphology and clustering about principal curves. Journal of the American Statistical Association, 8, 7-16.

Murtagh, F. and Raftery, A.E. (1984). Fitting straight lines to point patterns. Pattern Recognition, 17, 479-483.

These papers are being made available here to facilitate the timely dissemination of scholarly work; copyright and all related rights are retained by the copyright holders.

Updated May 10, 2023.

Copyright 2005-2023 by Adrian E. Raftery; all rights reserved.