High Dimensional Sparse Count Data Clustering Using Finite Mixture Models


Download High Dimensional Sparse Count Data Clustering Using Finite Mixture Models PDF/ePub or read online books in Mobi eBooks. Click Download or Read Online button to get High Dimensional Sparse Count Data Clustering Using Finite Mixture Models book now. This website allows unlimited access to, at the time of writing, more than 1.5 million titles, including hundreds of thousands of titles in various foreign languages.

Download

High-dimensional Sparse Count Data Clustering Using Finite Mixture Models


High-dimensional Sparse Count Data Clustering Using Finite Mixture Models

Author: Nuha E. Zamzami

language: en

Publisher:

Release Date: 2021


DOWNLOAD





Due to the massive amount of available digital data, automating its analysis and modeling for different purposes and applications has become an urgent need. One of the most challenging tasks in machine learning is clustering, which is defined as the process of assigning observations sharing similar characteristics to subgroups. Such a task is significant, especially in implementing complex algorithms to deal with high-dimensional data. Thus, the advancement of computational power in statistical-based approaches is increasingly becoming an interesting and attractive research domain. Among the successful methods, mixture models have been widely acknowledged and successfully applied in numerous fields as they have been providing a convenient yet flexible formal setting for unsupervised and semi-supervised learning. An essential problem with these approaches is to develop a probabilistic model that represents the data well by taking into account its nature. Count data are widely used in machine learning and computer vision applications where an object, e.g., a text document or an image, can be represented by a vector corresponding to the appearance frequencies of words or visual words, respectively. Thus, they usually suffer from the well-known curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion phenomena, which both cannot be handled with a generic multinomial distribution, typically used to model count data, due to its dependency assumption. This thesis is constructed around six related manuscripts, in which we propose several approaches for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive word occurrences. In such frameworks, a suitable distribution is used to introduce the prior information into the construction of the statistical model, based on a conjugate distribution to the multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous computational advantages. Thus, we proposed a novel model that we call the Multinomial Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow more modeling flexibility. Although these frameworks can model burstiness and overdispersion well, they share similar disadvantages making their estimation procedure is very inefficient when the collection size is large. To handle high-dimensionality, we considered two approaches. First, we derived close approximations to the distributions in a hierarchical structure to bring them to the exponential-family form aiming to combine the flexibility and efficiency of these models with the desirable statistical and computational properties of the exponential family of distributions, including sufficiency, which reduce the complexity and computational efforts especially for sparse and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach for count data to overcome several issues that may be caused by the high dimensionality of the feature space, such as over-fitting, low efficiency, and poor performance. Furthermore, we handled two significant aspects of mixture based clustering methods, namely, parameters estimation and performing model selection. We considered the Expectation-Maximization (EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model parameters, with incorporating several techniques to avoid its initialization dependency and poor local maxima. For model selection, we investigated different approaches to find the optimal number of components based on the Minimum Message Length (MML) philosophy. The effectiveness of our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV shows, facial expression recognition, face identification, and age estimation.

Mixture Models and Applications


Mixture Models and Applications

Author: Nizar Bouguila

language: en

Publisher: Springer

Release Date: 2019-08-13


DOWNLOAD





This book focuses on recent advances, approaches, theories and applications related to mixture models. In particular, it presents recent unsupervised and semi-supervised frameworks that consider mixture models as their main tool. The chapters considers mixture models involving several interesting and challenging problems such as parameters estimation, model selection, feature selection, etc. The goal of this book is to summarize the recent advances and modern approaches related to these problems. Each contributor presents novel research, a practical study, or novel applications based on mixture models, or a survey of the literature. Reports advances on classic problems in mixture modeling such as parameter estimation, model selection, and feature selection; Present theoretical and practical developments in mixture-based modeling and their importance in different applications; Discusses perspectives and challenging future works related to mixture modeling.

Intelligent Information and Database Systems


Intelligent Information and Database Systems

Author: Ngoc Thanh Nguyen

language: en

Publisher: Springer Nature

Release Date: 2021-04-04


DOWNLOAD





This book constitutes the refereed proceedings of the 13th Asian Conference on Intelligent Information and Database Systems, ACIIDS 2021, held in Phuket, Thailand, in April 2021.* The 67 full papers accepted for publication in these proceedings were carefully reviewed and selected from 291 submissions. The papers of the first volume are organized in the following topical sections: data mining methods and applications; machine learning methods; decision support and control systems; natural language processing; cybersecurity intelligent methods; computer vision techniques; computational imaging and vision; advanced data mining techniques and applications; intelligent and contextual systems; commonsense knowledge, reasoning and programming in artificial intelligence; data modelling and processing for industry 4.0; innovations in intelligent systems. *The conference was held virtually.