Effective Histogram-based Techniques for Summarizing Multi-dimensional Data
Mostra/ Apri
Creato da
Mazzeo, Giuseppe Massimiliano
Saccà, Domenico
Talia, Domenico
Metadata
Mostra tutti i dati dell'itemDescrizione
Formato
/
Tesi di dottorato di ricerca in Ingegneria dei sistemi ed informatica, XIX Ciclo; This thesis is an effort towards the definition of effective summarization techniques
for multi-dimensional data. The state-of-the-art techniques are examviii
Preface
ined and the issues related to their inability in effectively summarizing multidimensional
data are pointed out. The attention is particularly focused on
histogram-based summarization techniques, which are the most flexible, the
most studied and the most adopted in commercial systems. In particular, hierarchical
binary partitions are studied as a basis for effective multi-dimensional
histograms, focusing the attention on two aspects which turn out to be crucial
for histogram accuracy: the representation model and the strategy adopted for
partitioning data into buckets. As regards the former, a very specific spaceefficient
representation model is proposed where bucket boundaries are represented
implicitly by storing the partition tree. Histograms adopting this representation
model (which will be said to be Hierarchical Binary Histograms -
HBH) can store a larger number of buckets within a given amount of memory
w.r.t. histograms using a “flat” explicit storage of bucket boundaries. On top
of that, the introduction of a constraint on the hierarchical partition scheme
is studied, allowing each bucket to be partitioned only by splits lying onto a
regular grid defined on it: histograms adopting such a constrained partitioning
paradigm will be said to be Grid Hierarchical Binary Histograms (GHBH).
The introduction of the grid-constrained partitioning of GHBHs can be exploited
to further enhance the physical representation efficiency of HBHs. As
regards the construction of effective partitions, some new heuristics are introduced,
guiding the data summarization by locating inhomogeneous regions of
the data where a finer-grain partition is needed.
The two physical representation schemes adopted by HBH and GHBH can
be viewed as a form of lossless compression to be used on top of the summarization
accomplished by histograms (which is a form of lossy compression).
The combination of these forms of compression are shown to result in a relevant
improvement of histograms effectiveness. On the one hand, the proposed
compression-based representation models provide a mechanism for efficiently
locating the buckets involved in query estimation, thus reducing the amount
of time needed to estimate queries w.r.t. traditional flat representation models.
On the other hand, applying lossless compression on top of summarization
reduces the loss of information due to summarization, as it enables a larger
amount of summary data to be stored within a given storage space bound:
this turns out to yield lower error rates of query estimates.
By means of experiments, a thorough analysis of different classes of histograms
based on hierarchical partitions is provided: the accuracy provided
by combining different heuristics (both the new proposals and the “classical”
heuristics of two well-known techniques, namely MHIST and Min-Skew)
with either the traditional MBR-based representation model or the novel specific
tree-based ones (both the unconstrained and the grid-constrained one)
is studied. These results provide an insight into the value of compression in
the context of histograms based on hierarchical partitions. Interestingly, it
is shown that the impact of both HBH and GHBH representation models
on the accuracy of query estimates is not simply orthogonal to the adopted
heuristic. Thus, the best combination of these different features is identified,
Preface ix
which turns out from adopting the grid-constrained hierarchical partitioning
of GHBHs guided by one of the new heuristics.
GHBH is compared with state-of-the-art techniques (MHIST, Min-Skew,
GENHIST, as well as other wavelet-based summarization approaches), showing
that new technique results in much lower error rates and satisfiable degree
of accuracy also at high-dimensionality scenarios.
Another important contribution of this thesis consists in the proposal of a
new approach for constructing effective histograms. The superiority of GHBH
w.r.t. the other histogram-based techniques has been found to depend primarily
on the most accurate adopted criterion for guiding the data domain
partitioning. In fact, traditional techniques for constructing histograms often
yield partitions where dense and sparse regions are put together in the same
bucket, thus yielding poor accuracy in estimating queries on summarized data.
Despite GHBH adopts a criterion which in theory avoid this situation,
there is an intrinsic limit in all the top-down partitioning techniques. That
is, histograms which are obtained by iteratively splitting blocks by starting
from the block coinciding with the whole data domain, could not have the
actual possibility to reach all the dense regions in order to isolate them. In
fact, each split yields an increase in the number of bucket and, as the number
of buckets is bounded, the number of split that can be performed is bounded
as well. Therefore, in large domains, where data are particularly skewed, the
number of available splits could not be large enough to reach in a top-down
split-sequence all the dense regions. Thus, it could happen that GHBH starts
the partitioning of data domain following the correct direction which leads to
isolating dense regions, but at a certain point the number of available buckets,
and thus of available splits, is saturated.
This problem could be avoided by adopting a bottom-up strategy, which
first locates the dense region of the data, and then aggregates them into
buckets according to some suitable strategy. The problem of searching dense
regions is very close to the data clustering problem, that is the problem of
grouping database objects into a set of meaningful classes.
The enhancement of the histogram construction has been tried by exploiting
the capability of clustering techniques to locate dense regions. A new technique,
namely CHIST (Clustering-based Histograms), for constructing multidimensional
histograms on the basis of a well known density-based clustering
algorithm, namely DBSCAN, is proposed. CHIST algorithm first invokes
DBSCAN algorithm for partitioning the data into dense and sparse regions,
and then further refines this partitioning by adopting a grid-based paradigm.
CHIST is compared to GHBH and it is shown to provide lower error rates,
especially in “critical” settings, that is when query selectivity is particularly
low or the compression ratio is very high. It is worth remarking that in these
settings, experiments comparing GHBH to the other techniques showed that
GHBH still provides acceptable error rates, while those provided by other
techniques are completely unacceptable. CHIST is also extended to the case
that data to be summarized are dynamic. In this case, the re-execution of the clustering algorithm at each data update could result prohibitive, due to
the high computational cost of this task. Thus, on the basis of Incremental
DBSCAN algorithm, a strategy for efficiently propagating data updates to
the histogram is proposed. By means of experiments it is shown that the incremental
approach, for small updates (i.e., bulk of updates 100 times smaller
than the overall data size) can be computed from 100 to 200 times faster than
the from-scratch re-computation of the histogram, and the accuracy remains
almost unaffected.; Università della CalabriaSoggetto
compressione dati multidimensionali; OLAP applications
Relazione
SSD ING-INF/05;