Outliers in Data Mining

Table of Contents

Whenever we consider the topic of data analysis there comes the outliers in picture which indicate a process of data mining and analysis which involves the analysis of data by providing the information where the data holds.There is a certain object of data set which changes the path from others.These divert objects are termed as outliers.These are generated by the errors for measurement.

These mistakes are computational errors due to incorrect entry of an object that cause outliers.There are many types of outliers 

  • When some random error will occur in some measured variable or there is variance in the measured variable. Then it is termed as noise.
  • When we detect errors earlier which are present in the dataset,it is advisable to remove the noise.

The outliers types

The outliers which are mainly categorized as univariate outliers and multivariate outliers.

  • There is a one dimensional space has the outliers that occur in the feature space who are known as univariate outliers.
  • The outliers that occur in a feature space of n-dimension who are known as multivariate outliers

Based on the kind or a way of outliers  it may be classified into

  • Point outliers
  • Contextual outliers
  • Collective outliers
  1. Point outliers-These are called as single points of data that are located at the point that is far away from the distribution of data.
  2. Contextual outliers-This outlier occur within the context like single background noise speech recognition. This type of outliers are basically occur as there is some anomaly in the data instance of a context or any specific condition. There are two types of attributes of the objects of data namely contextual attributes and behavioral attributes. The context is defined by the former type whereas the latter type will be defined by the objectā€™s characteristics.
  3. Collective outliers-here is a type of outliers that occur if there is some anomalous behavior of data points collectively. This recognizes some novelty in the data.

Analysis of Outliers

Outliers are always discarded whenever the data mining methods will be applied.Itā€™s still used in certain applications like fraud detection. This mainly because of the events that may rarely occur can store much more interesting facts like more regularly.

There are many applications for outlier detection has a major role

  1. Here the detection of frauds merely in the insurance sector,credit cards and the healthcare sectors.
  2. The fraud detection in telecom.
  3. The cyber security for detecting any from of intrusion.
  4. Here in the field of medical analysis
  5. The detection of faults in the safety-critical systems.
  6. The marketing, outlier analysis will always helps in recognizing the customerā€™s nature of spending
  7. Here any kind of unusual responses that occurs due to some medical treatments that will be analyzed by outlier analysis of data mining.

Outliers detection techniques are

Many different techniques are combined with the different approaches that are applied for detecting the anomalous behavior in dataset. There are few techniques

1.Sorting

The easiest  best one of the ways of detecting outliers in data mining.

here  the method consists of sorting the data according to their magnitude in any of the tools used for any data manipulation.

Observation of data that could be lead towards identifying any objects that may have value of quite a higher range or lower range.These objects will be treated as outliers.

2.Data graphing for detecting outliers

The technique includes the use of graph to plot all the data points.This allow the observer to visualize where the data points that are actually diverging from other objects in the dataset.These outliers  will be observed in the easier path.The types of plots will be used for detecting outliers of the data mining that include histogram,scatter plot and box plot.There are bulk observation of data points the another side represents the outliers in a histogram.It consider two numerical values then the degree of association that is understood well through a scatterplot.

3.Z-Score for detecting outliers

Here the Z-score is employed to recognize how much data points will be deviating from the mean of the sample through the calculating the standard deviations of the points.If value of Z-score is 2 then it indicates that will be an object is lying above the mean with any standard deviation of two,while a value of 2 indicates that may be observation that will be deviating from below the mean with standard deviation of two.

4.DbScan

The method which is clustering approach and is referred to as the density-based spatial clustering of applications with the noise.This clustering methods that seem very useful for any better visualizations and also understanding the data, The Dbscan that may be used to graphically that represent the relationships existing between the features and also trends in the dataset.

The density dependent algorithm of clustering identifiers that the neighboring objects by the density in the sphere of n-dimensional having a radius and the cluster identified in the feature space during method which is set of points which is connected by the density.

Questions

1. What is Outliers?

2. What are the advantages of Outliers?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class