Whenever we consider the topic of data analysis there comes the outliers in picture which indicate a process of data mining and analysis which involves the analysis of data by providing the information where the data holds.There is a certain object of data set which changes the path from others.These divert objects are termed as outliers.These are generated by the errors for measurement.
These mistakes are computational errors due to incorrect entry of an object that cause outliers.There are many types of outliers
- When some random error will occur in some measured variable or there is variance in the measured variable. Then it is termed as noise.
- When we detect errors earlier which are present in the dataset,it is advisable to remove the noise.
The outliers types
The outliers which are mainly categorized as univariate outliers and multivariate outliers.
- There is a one dimensional space has the outliers that occur in the feature space who are known as univariate outliers.
- The outliers that occur in a feature space of n-dimension who are known as multivariate outliers
Based on the kind or a way of outliers it may be classified into
- Point outliers
- Contextual outliers
- Collective outliers
- Point outliers-These are called as single points of data that are located at the point that is far away from the distribution of data.
- Contextual outliers-This outlier occur within the context like single background noise speech recognition. This type of outliers are basically occur as there is some anomaly in the data instance of a context or any specific condition. There are two types of attributes of the objects of data namely contextual attributes and behavioral attributes. The context is defined by the former type whereas the latter type will be defined by the object’s characteristics.
- Collective outliers-here is a type of outliers that occur if there is some anomalous behavior of data points collectively. This recognizes some novelty in the data.
Analysis of Outliers
Outliers are always discarded whenever the data mining methods will be applied.It’s still used in certain applications like fraud detection. This mainly because of the events that may rarely occur can store much more interesting facts like more regularly.
There are many applications for outlier detection has a major role
- Here the detection of frauds merely in the insurance sector,credit cards and the healthcare sectors.
- The fraud detection in telecom.
- The cyber security for detecting any from of intrusion.
- Here in the field of medical analysis
- The detection of faults in the safety-critical systems.
- The marketing, outlier analysis will always helps in recognizing the customer’s nature of spending
- Here any kind of unusual responses that occurs due to some medical treatments that will be analyzed by outlier analysis of data mining.
Outliers detection techniques are
Many different techniques are combined with the different approaches that are applied for detecting the anomalous behavior in dataset. There are few techniques
1.Sorting
The easiest best one of the ways of detecting outliers in data mining.
here the method consists of sorting the data according to their magnitude in any of the tools used for any data manipulation.
Observation of data that could be lead towards identifying any objects that may have value of quite a higher range or lower range.These objects will be treated as outliers.
2.Data graphing for detecting outliers
The technique includes the use of graph to plot all the data points.This allow the observer to visualize where the data points that are actually diverging from other objects in the dataset.These outliers will be observed in the easier path.The types of plots will be used for detecting outliers of the data mining that include histogram,scatter plot and box plot.There are bulk observation of data points the another side represents the outliers in a histogram.It consider two numerical values then the degree of association that is understood well through a scatterplot.
3.Z-Score for detecting outliers
Here the Z-score is employed to recognize how much data points will be deviating from the mean of the sample through the calculating the standard deviations of the points.If value of Z-score is 2 then it indicates that will be an object is lying above the mean with any standard deviation of two,while a value of 2 indicates that may be observation that will be deviating from below the mean with standard deviation of two.
4.DbScan
The method which is clustering approach and is referred to as the density-based spatial clustering of applications with the noise.This clustering methods that seem very useful for any better visualizations and also understanding the data, The Dbscan that may be used to graphically that represent the relationships existing between the features and also trends in the dataset.
The density dependent algorithm of clustering identifiers that the neighboring objects by the density in the sphere of n-dimensional having a radius and the cluster identified in the feature space during method which is set of points which is connected by the density.
Questions
1. What is Outliers?
2. What are the advantages of Outliers?