Data Cleaning. Ihab F. Ilyas

Читать онлайн.
Название Data Cleaning
Автор произведения Ihab F. Ilyas
Жанр Базы данных
Серия
Издательство Базы данных
Год выпуска 0
isbn 9781450371544



Скачать книгу

avoid searching the exponentially large number of regions and computing S(D) for each region in a brute-force manner, evolutionary algorithms are used [Aggarwal and Yu 2001] to select regions with low S(D). In evolutionary methods, every solution or candidate to an optimization is treated as an individual in an evolutionary system. The “fitness” of an individual is exactly the objective function value of the corresponding solution. An evolutionary algorithm uses mechanisms inspired by biological evolution, including mutation, recombination, and selection. Accordingly, the candidate solutions to an optimization problem are mutated, recombined, and selected according to the objective function until a convergence criterion is met. Every candidate solution is usually encoded as a string. Algorithm 2.3 describes the procedure for detecting outliers in a high-dimensional dataset using the evolutionary method. For a d-dimensional dataset, the encoding string will be of length d and contain k specified positions, where k is a user provided input specifying the number of dimensions of interest. The fitness for the corresponding solution is computed using the sparsity coefficient S(D). The evolutionary search procedure starts with a population of p randomly selected solutions and iteratively uses the process of selection, crossover, and mutation until a convergence criterion is met (e.g., after a maximum number of iterations). The selection process works by ranking current solutions according to the S(D), and select higher ranked solution with a higher probability; the crossover process uses a two-point crossover procedure. At the end of the evolutionary algorithm, all data points contained in the final solutions are reported as outliers.

Image

      Example 2.10 Consider a dataset with d = 5 dimensions in total and ϕ = 10 ranges for each dimension, and we are interested in subspaces with k = 3 dimensions. Every candidate solution will be encoded using a string of length 5, and every position in a string represents which range is selected for every dimension. For instance, a candidate solution, encoded as 3*2*1, represents a region with three dimensions, where the first dimension takes the third range, the third dimension takes the second range, and the fifth dimension takes the first range.

      Performing a two-point crossover between two solutions 3*2*1 and 1*33* after the third position will result in two new solutions 3*23* and 1*3*1; and the mutation process randomly flips a position to a number between 1 to φ with a predefined mutation probability.

      Algorithm 2.3 is an example of the first use case (cf. Section 2.5.1). OutRank [Muller et al. 2008] is another example of the first use case, which relies on subspace clustering approaches to find dense clusters in subspaces [Assent et al. 2007, Assent et al. 2008] and treats data points that are not in those dense clusters as outliers. Clusters are frequent objects (as opposed to outliers), and hence are amenable to the downward closure property. The outlyingness of an object is calculated based on how often the object is part of dense clusters, the dimensionality of the dense clusters, and the size of the dense clusters. In OutRank, outliers are side-products of a subspace clustering algorithm, which may not be satisfactory. For example, in a dataset where no subspace clusters are present, all objects may be reported as subspace outliers.

      Instead of searching for relevant subspaces first and then finding outliers within those subspaces, HOS-Miner [Zhang et al. 2004] and SOD [Kriegel et al. 2009] are designed to find subspaces in which a given data point is an outlier. For a data point x, HOS-Miner aims at finding all subspaces such that the sum of its k-nearest neighbor distances in that subspace is at least δ. This approach does not normalize the distances with the number of dimensions, and hence a subspace with more dimensions is more likely to be in the output. One advantage of HOS-Miner is that the outlier definition it adopts exhibits the downward closure property—any subspace of a non-outlying subspace is also not outlying and every superset of an outlying subspace is also outlying. Only minimal subspaces, in which the given data point χ is an outlier, are interesting. HOS-Miner users an X-Tree to perform k-nearest neighbor searches efficiently. HOS-Miner also uses the fixed threshold δ to discern outliers in all subspaces, which could be problematic since these scores are rather incomparable in different subspaces. SOD [Kriegel et al. 2009] also aims at finding subspaces in which a given data point x is an outlier. Given the data point x, a set of reference data points S (x) are determined, which represent the proximity of the current data point x. Based on S (x), the relevant subspace for S (x) is determined as the set of dimensions in which the variance is small among points in S (x). If the query point x deviates considerably from the S(x) in the relevant subspace, then x is said to be an outlier in that subspace.

      While subspace outlier detection seems a viable approach for detecting outliers in high-dimensional data, it still faces many challenges: (1) In spite of recent advances, the combinatorial nature of the problem requires more efficient search procedures of possible subspaces. (2) Almost all the current techniques adopt their own definitions of outliers in subspaces mainly for the purpose of effective enumeration or search of potential subspaces. For example, the sparsity coefficient is used as the fitness function in the evolutionary algorithm [Aggarwal and Yu 2001], and HOS-Miner [Zhang et al. 2004] is able to exploit the downward closure property based on its unique definition of subspace outliers. How to efficiently search possible subspaces under a general definition of multivariate outliers remains an open question. (3) A point might be considered an outlier in different subspaces. Therefore, one may combine results from these (incomparable) subspaces and rank subspace outliers effectively. This can also be seen as an opportunity since results from multiple subspaces may indicate more robust outliers.

      The notion of contextual outliers was first studied in the context of time-series data [Weigend et al. 1995, Salvador et al. 2004] and spatial data [Shekhar et al. 2001, Kou et al. 2006], where the contextual attribute is the time dimension or the spatial dimension. For example, a certain temperature might not be considered high through the year, but is considered high for the winter months. Recently, many proposals have been made on discovering contextual outliers with general environmental attributes; some target the data exploration use case [Wei et al. 2003, Song et al. 2007, Tang et al. 2013, Liang and Parthasarathy 2016] (cf. Section 2.5.1), whereas some target the explanation and diagnosis use case [Angiulli et al. 2009, Zhu et al. 2004]. For proposals targeting the data exploration case, some assume that the environmental attributes and the behavioral attributes are given as input [Song et al. 2007, Liang and Parthasarathy 2016], while others explore the space of possible environmental and behavioral attributes to identify contextual outliers [Wei et al. 2003, Tang et al. 2013]. The goal of the proposals targeting the explanation and diagnosis use case is to discover the environmental attributes and the behavioral attributes in which a given object is an outlier. In what follows, we give an example algorithm for when the environmental attributes and the behavioral attributes are given [Song et al. 2007], and an example algorithm for when they are not [Wei et al. 2003].

      Song et al. [2007] learn a correlation between the provided environmental attributes and the provided behavioral attributes, and define contextual outliers as those data points that deviate from the learned correlation. The correlation is modeled using a Gaussian mixture model, and the parameters of the model are learned using expectation-maximization methods. Example 2.11 shows an application where the correlation between the environmental attributes and the behavioral attributes can be used to detect interesting contextual outliers.

Image