Data Cleaning. Ihab F. Ilyas

Читать онлайн.
Название Data Cleaning
Автор произведения Ihab F. Ilyas
Жанр Базы данных
Серия
Издательство Базы данных
Год выпуска 0
isbn 9781450371544



Скачать книгу

contextual outlier detection algorithm would identify income to be the environmental attributes and tax to be the behavioral attributes. t5 is then reported as a contextual outlier with respect to the tax attribute within the context income > 100; in other words, among the tuples with income > 1000, t5 has an abnormal tax.

      In Example 2.9, the same outlier could be detected by both subspace outlier detection techniques and contextual outlier detection techniques. Subspace outlier detection techniques usually need to enumerate all potentially interesting subspaces. Contextual outlier detection techniques not only need to enumerate possible environmental attributes and behavioral attributes, but also need to enumerate contexts based on the environmental attributes. Contextual outliers are generally more interpretable than subspace outliers (e.g., t5 in Example 2.9).

      We first lay out two commonly found use cases for high-dimensional outlier detections in Section 2.5.1. We then discuss in detail subspace outlier detection techniques in Section 2.5.2 and contextual outlier detection techniques in Section 2.5.3.

      Techniques for detecting outliers in high-dimensional data often find two general use cases, regardless of whether they are looking for subspace outliers or contextual outliers.

      Use Case 1: High-dimensional outlier detection for data exploration: One of the main characteristics of big data exploration tasks, in contrast to querying, is the fact that analysts do not necessarily know what they are looking for. Traditional outlier detection techniques are limited since they require users to specify the interesting attributes. Consider an analyst performing market research who wishes to determine which companies are unusually profitable. Since companies from different sectors may not be comparable in profits, the analyst might answer this query by running traditional outlier detection queries multiple times, each time on a subset of companies belonging to a specific sector (e.g., technology and media) to identify the most profitable companies within each sector. There are potentially a large number of interesting subgroups of which technology and media companies are only two possible subgroups. In addition, other grouping criteria (e.g., based on location) might also reveal outliers. Instead of relying on analysts to come up with subspaces and contexts, analysts need tools that can discover subspaces and contexts which contain outliers. In the previous example, while Company X might have “normal” reported profit compared to all technology companies—with normal defined, for example, as being within two standard deviations from the mean—it may have very unusual (outlying) profit when compared to companies with less than 10,000 employees. In this case, if a tool could produce [Business = “technology” л EmployeeCount < 1000], this may be an automatically discovered context of interest to the analyst.

      Use Case 2: High-dimensional outlier detection for targeted explanation and diagnosis: Analysts often would like to answer the question “What is so special about this entity or record?,” a key question in explaining analytics results or diagnosing reported errors or anomalies. For example, an analyst performing troubleshooting at a call center may wish to understand why an important customer in industrial manufacturing is calling the troubleshooting hotline. Using conventional tools, the analyst might check whether a few of the customer’s key performance metrics (e.g., quality of service) are abnormal. However, this approach is brittle; for example, high-value clients may naturally exhibit deviations from the overall population. Instead, high-dimensional outlier analysis can take into account other dimensions with the performance metric dimensions to reveal the subspaces or the contexts in which the client is meaningfully outlying. For example, a tool might discover that the client is experiencing an unusually high rate of poor-quality service compared to users in his location using his hardware make and model.

      Problem formulations for high-dimensional outlier detection generally fall into one of the two use cases, as we show in the following when we discuss in detail subspace outlier detection techniques and contextual outlier detection techniques.

      To better appreciate the challenges in detecting subspace outliers, consider the four different two-dimensional views of a hypothetical dataset in Figure 2.6 as an example [Aggarwal 2013]. We see that Point A is considered an outlier in View 1 and Point B is considered an outlier in View 4, whereas neither point is considered an outlier in View 2 and View 3. The example shows that different data points may be considered outliers in different subspaces. For datasets with high dimensionality, it is very likely that only a small fraction of all possible subspaces contains interesting outliers.

Image

      Detecting outliers in subspaces is a challenging problem mainly due to the fact that the number of possible subspaces is exponential with respect to the number of dimensions in the data. It is worth noting that there have been many proposals for finding clusters in subspaces [Parsons et al. 2004], and many techniques exploit the downward closure properties (a cluster is dense is subspace AB only if it is dense in subspace A and subspace B). Finding outliers in subspaces is even more challenging than finding clusters in subspaces. This is because outliers, by definition, are rare events in contrast to clusters, which are dense regions; hence, the downward closure property is usually not applicable to detecting outliers in subspaces. An effective outlier detection method needs to search the subspaces in such a way that the outliers are revealed as quickly as possible. In what follows, we present in detail a first algorithm [Aggarwal and Yu 2001] that achieves this goal, and we briefly discuss other proposals [Zhang et al. 2004, Muller et al. 2008, Kriegel et al. 2009].

      Aggarwal and Yu [2001] discover outliers in subspaces by finding localized regions of the data in subspaces with abnormally low density. Data points contained in low density regions are identified as outliers. To determine abnormally low density regions in subspaces, a grid-based approach is used to perform a discretization of the data. Each dimension of the data is divided into ϕ ranges of equi-depth, and hence each range contains image of the records. These ranges from one dimension form the localized regions in subspaces. Consider a k-dimensional region that is created by picking grid ranges from k different dimensions. The expected fraction of records in that region is fk, assuming that the attributes are statistically independent of each other. Of course, the attributes are far from statistically independent of each other, and thus the distribution of points in that region would differ significantly from the expected behavior. It is precisely those regions with abnormally low density that are useful for identifying outliers.

      Formally, under the independence assumption, the presence of any data point in a k-dimensional region is a Bernoulli random variable with probability fk. Therefore, the expected fraction and standard deviation of data points in the k-dimensional region are N · fk and image, respectively, where N is the total number of data points. Let n(D) be the actual number of data points in a k-dimensional region D. The sparsity coefficient S(D) of D is defined as follows:

image

      Assuming n(D) fits a normal distribution, the normal distribution tables can be used to quantify the probabilistic level of significance of its deviation. Aggarwal and Yu [2001] aim to find regions with low S(D).