Data mining techniques for large and complex data

Narvaez Vilema, Miryan Estela; Crupi, Felice; Angiulli, Fabrizio

Please use this identifier to cite or link to this item: https://hdl.handle.net/10955/1875

Full metadata record

DC Field	Value	Language
dc.contributor.author	Narvaez Vilema, Miryan Estela
dc.contributor.author	Crupi, Felice
dc.contributor.author	Angiulli, Fabrizio
dc.date.accessioned	2020-02-19T11:15:06Z
dc.date.available	2020-02-19T11:15:06Z
dc.date.issued	2017-11-13
dc.identifier.uri	http://hdl.handle.net/10955/1875
dc.identifier.uri	https://doi.org/10.13126/unical.it/dottorati/1875	en
dc.description	Dottorato di Ricerca in Information and Communication Engineering For Pervasive Intelligent Environments, Ciclo XXIX	en_US
dc.description.abstract	During these three years of research I dedicated myself to the study and design of data mining techniques for large quantities of data. Particular attention was devoted to training set condensing techniques for the nearest-neighbor classification rule and to techniques for node anomaly detection in networks. The first part of this thesis was focused on the design of strategies to reduce the size of the subset extracted from condensing techniques and to their experimentation. The training set condensing techniques aim to determine a subset of the original training set having the property of allowing to correctly classify all the training set examples. The subset extracted from these techniques also known as consistent subset. The result of the research was the development of various strategies of subset selection, designed to determine during the training phase the most promising subset based on different methods of estimating test accuracy. Among them, the PACOPT strategy is based on Pessimistic Error Estimate (PEE) to estimate generalization as a trade-off between training set accuracy and model complexity. The experimental phase has had for reference the FCNN technique of condensation. Among the methods of condensation based on the nearest neighbor decision rule (NN rule), FCNN (for Fast Condensed NN) it is one of the most advantageous technique, particularly in terms of time performance. We showed that the designed selection strategies guarantee to preserve the accuracy of a consistent subset. We also demonstrated that the proposed selection strategies guarantee to significantly reduce the size of the model. Comparison with notable training-set reduction techniques for the NN rule witness for state-of-the-art performances of the here introduced strategies. The second part of the thesis is directed towards the design of analysis tools for network structured data. Anomaly detection is an area that has received much attention in recent years. It has a wide variety of applications, including fraud detection and network intrusion detection. The techniques focused on anomaly detection in static graphs assume that the networks do not change and are capable of representing only a single snapshot of data. As real-world networks are constantly changing, there has been a shift in focus to dynamic graphs, which evolve over time. We present a technique for node anomaly detection in networks where arcs are annotated with time of creation. The technique aims at singling out anomalies by taking simultaneously into account information concerning both the structure of the network and the order in which connections have been established. The latter information is obtained by timestamps associated with arcs. A set of temporal structures is induced by checking certain conditions on the order of arc appearance denoting different kinds of user behaviors. The distribution of these structures is computed for each node and used to detect anomalies. We point out that the approach here investigated is substantially different from techniques dealing with dynamic networks. Indeed, our aim is not to determine the points in time in which a certain portion of the networks (typically a community or a subgraph) exhibited a significant change, as usually done by dynamic-graph anomaly detection techniques. Rather, our primary aim is to analyze each single node by taking simultaneously into account its temporal footprint.	en_US
dc.description.sponsorship	Università della Calabria	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	ING-INF/06;
dc.subject	Data mining	en_US
dc.title	Data mining techniques for large and complex data	en_US
dc.type	Thesis	en_US
Appears in Collections:	Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica - Tesi di Dottorato

Files in This Item:

File	Description	Size	Format
thesisNarvaezEstela.pdf		1,6 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets