Ensemble of deep learning prediction models for data analytics

Zicari, Paolo; Fortino, Giancarlo; Folino, Gianluigi

Mostra/Apri

(7.539Mb)

Creato da

Zicari, Paolo

Fortino, Giancarlo

Folino, Gianluigi

Metadata

Mostra tutti i dati dell'item

URI

https://hdl.handle.net/10955/5627

Descrizione

Formato

Università della Calabria. Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica Dottorato di Ricerca in Information and Communication Technologies CICLO XXXV; The abundance of available unstructured or raw text requires the automatic extraction of information for di↵erent tasks. One of the most relevant, Text Classification, extracts this information by assigning informative labels to raw texts from a pre-defined set. Deep Learning (DL) o↵ers challenging solutions to the automatic text classification problem. Despite the great potentialities of DL-based text classifiers, current solutions are exposed to a number of challenging issues that frequently occur in scenarios where text categorization is used in reallife applications. First of all, a large number of labelled data are usually necessary to train a deep model adequately, while labelling texts is timeconsuming, expensive, and very often requires specific knowledge. Moreover, configuring the structure and hyper-parameters of a Deep Neural Network (DNN) architecture is a difficult task, which entails long and careful design and tuning activities to make the DNN perform well. Typical scenarios are characterized by the fact that classes are often imbalanced. These issues entail a high risk of eventually obtaining a DNN-based classifier that overfits the training data and relies on non-general, biased and unreliable classification patterns. On the other hand, the black-box nature of a DNN model does not allow for easy reasoning on which features of a data instance drove the model to its classification decision. The work in this thesis, starting from the general problem of text classification, focuses on some challenging aspects associated with using an ensemble of deep learning methods to classify raw texts. More in detail, this work focuses on the analysis, exploration, study and test of algorithms and learning models to be employed in the proposal of novel techniques of Ensemble Deep Learning (EDL) aimed at performing classification and explanation tasks and on the research of semi-supervised strategies based on pseudo-labelling for improving classifier prediction performances in case of scarcity of labelled data. To this aim, this thesis proposes a complete framework based on the paradigm of ensembles of deep learning algorithms. The proposed framework is designed to furnish a valid instrument for exploring, validating and testing the proposed novel deep ensemble techniques contextualised in reallife applications, covering the entire classification process, including preprocessing, learning model building, explanation of the results, self-training for scarce labelled data, human-in-the-loop validating and model refining. Even though the methods proposed in this work could be used in any field of interest, the problem of extracting information from the raw text was specialised for two specific application contexts: automatic customer support ticket classification and the problem of fake detection. The first application scenario deals with the necessity of the Customer Care Department of most companies to answer their customer requests applied as tickets through several common channels like email, short message texts, social posts, etc. Ticket classification is necessary for automatic answer generation and routing to the specific human operator. Limiting the spread of misinformation, related to the high growth of social media dissemination and sharing of information, has raised the issue of distinguishing true news from fakes, with the challenging problem of processing long texts like news for fake detection. For this reason, the second scenario deals with the critical problem of discerning fake news from the vast amount of information circulating on the Web. In these research areas, the ensemble paradigm has been adopted only recently; thus, discovering the possible advantages when applying this technique is challenging. Experimental tests conducted on real data collected by two Customer Relationship Management (CRM) systems have proven the framework’s effectiveness in di↵erent ticket categorisation tasks and the practical value of their associated explanations. In addition, experiments conducted on two fake news datasets have proven the e↵ectiveness of the proposed semisupervised self-training ensemble-based strategy for improving performances when a few labelled data are available.

Soggetto

Deep learning; Ensemble; Text classification; Fake detection; Ticket classification

Relazione

ING-INF/02;