Data mining techniques for large and complex data

Narvaez Vilema, Miryan Estela; Crupi, Felice; Angiulli, Fabrizio

Mostra/Apri

(1.562Mb)

Creato da

Narvaez Vilema, Miryan Estela

Crupi, Felice

Angiulli, Fabrizio

Metadata

Mostra tutti i dati dell'item

URI

http://hdl.handle.net/10955/1875
https://doi.org/10.13126/unical.it/dottorati/1875

Descrizione

Formato

Dottorato di Ricerca in Information and Communication Engineering For Pervasive Intelligent Environments, Ciclo XXIX; During these three years of research I dedicated myself to the study and design of data mining techniques for large quantities of data. Particular attention was devoted to training set condensing techniques for the nearest-neighbor classification rule and to techniques for node anomaly detection in networks. The first part of this thesis was focused on the design of strategies to reduce the size of the subset extracted from condensing techniques and to their experimentation. The training set condensing techniques aim to determine a subset of the original training set having the property of allowing to correctly classify all the training set examples. The subset extracted from these techniques also known as consistent subset. The result of the research was the development of various strategies of subset selection, designed to determine during the training phase the most promising subset based on different methods of estimating test accuracy. Among them, the PACOPT strategy is based on Pessimistic Error Estimate (PEE) to estimate generalization as a trade-off between training set accuracy and model complexity. The experimental phase has had for reference the FCNN technique of condensation. Among the methods of condensation based on the nearest neighbor decision rule (NN rule), FCNN (for Fast Condensed NN) it is one of the most advantageous technique, particularly in terms of time performance. We showed that the designed selection strategies guarantee to preserve the accuracy of a consistent subset. We also demonstrated that the proposed selection strategies guarantee to significantly reduce the size of the model. Comparison with notable training-set reduction techniques for the NN rule witness for state-of-the-art performances of the here introduced strategies. The second part of the thesis is directed towards the design of analysis tools for network structured data. Anomaly detection is an area that has received much attention in recent years. It has a wide variety of applications, including fraud detection and network intrusion detection. The techniques focused on anomaly detection in static graphs assume that the networks do not change and are capable of representing only a single snapshot of data. As real-world networks are constantly changing, there has been a shift in focus to dynamic graphs, which evolve over time. We present a technique for node anomaly detection in networks where arcs are annotated with time of creation. The technique aims at singling out anomalies by taking simultaneously into account information concerning both the structure of the network and the order in which connections have been established. The latter information is obtained by timestamps associated with arcs. A set of temporal structures is induced by checking certain conditions on the order of arc appearance denoting different kinds of user behaviors. The distribution of these structures is computed for each node and used to detect anomalies. We point out that the approach here investigated is substantially different from techniques dealing with dynamic networks. Indeed, our aim is not to determine the points in time in which a certain portion of the networks (typically a community or a subgraph) exhibited a significant change, as usually done by dynamic-graph anomaly detection techniques. Rather, our primary aim is to analyze each single node by taking simultaneously into account its temporal footprint.; Università della Calabria

Soggetto

Data mining

Relazione

ING-INF/06;