Data Cleansing Using Pandas

Data Cleansing Using Pandas

Table of Contents

Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. A lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job. Therefore, if you are just stepping into this field or planning to step into this field, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.

In this article, we will cover a few pandas libraries that are used to clean the data.

Functions Used for Data Cleaning

After reading the data set into a data frame using .read_csv( ) we will try to clean the Data using different functions

Why Should we Rename Columns and Index

If your data was generated by a computer program, it probably has some computer-generated column names, too. Those can be hard to read and understand while working, so if you want to rename a column to something more user-friendly, you can do it using df.rename()

Consider the following DataFrame

   A  B
014
125

df.rename(columns={"A": "a", "B": "b"})

Output:

   a  b
014
125

We can also rename the index using .rename()

df.rename(index={0: "x", 1: "y"})

   a  b
x  14
y  25

Missing data is always a problem in real-life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of the poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

When and Why Is Data Missed?

Let us consider an online survey for a product. Many times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way, a part of data is always missing, and this is very common in real-time.

In Pandas missing data is represented by two value: 

None: None is a Python singleton object that is often used for missing data in Python code. 

NaN: NaN (an acronym for Not a Number), is a special floating point value recognized by all systems that use the standard IEEE floating-point representation.

Let us now see how to identify missing values in our Data Set

Consider the following Data Set

    one        two       three    
a   -1.3590631.613255-0.669396
b         NaN      NaN        NaN
c   0.8851170.6092710.330818
d         NaN      NaN        NaN
e   -0.1360861.1328080.496091
f   0.2100650.5331740.111560
g         NaN      NaN        NaN
h   1.0276890.6300370.727022

Now we will find the missing values in the data set using the function    .isnull( )

df.isnull()

Output:

     one    two  three
a  False  False  FALSE
b  True  True  True
c  False  False  FALSE
d  True  True  True
e  False  False  FALSE
f  False  False  FALSE
g  True  True  True
h  False  False  FALSE

.isnull() checks every column for NULL values and a boolean series is returned by the isnull() method which stores True for NaN value and False for a Not null value.

We can also use .notnull() function to find the null values. It is opposite of .isnull()

df.notnull()

Output:

     one    two  three
a  True  True  True
b  False  False  FALSE
c  True  True  True
d  False  False  FALSE
e  True  True  True
f  True  True  True
g  False  False  FALSE
h  True  True  True

.notnull() checks every column is checked for NULL values and a boolean series is returned by the notnull() method which stores True for every NON-NULL value and False for a null value.

How to Drop rows with Nan values

There are several options for handling missing values each with its PROS and CONS. However, the choice of what should be done is largely dependent on the nature of our data and the missing values. Below is a summary highlight of several options we have for handling missing values.

  • Drop the missing values
  • Fill the missing values

Drop the missing values

.dropna() function this function drop Rows Columns off datasets with Null values

Consider the following Data Set

    one        two       three    
a   -1.3590631.613255-0.669396
b         NaN      NaN        NaN
c   0.8851170.6092710.330818
d         NaN      NaN        NaN
e   -0.1360861.1328080.496091
f   0.2100650.5331740.111560
g         NaN      NaN        NaN
h   1.0276890.6300370.727022

df.dropna()

Output:

    one        two       three    
a   -1.3590631.613255-0.669396
c   0.8851170.6092710.330818
e   -0.1360861.1328080.496091
f   0.2100650.5331740.111560
h   1.0276890.6300370.727022

Fill the missing values

Pandas df.replace() function is used to replace a string, regex, list, dictionary, series, number etc. from a dataframe. This is a very rich function as it has many variations.

Using this .replace() function we can replace we replace all the NaN with whatever value’s we like

Now we will replace all the NaN with ‘0’

df.replace(np.nan,0) //(orginal value,replaced value)

Output:

    one        two       three    
a   -1.3590631.613255-0.669396
b   000
c   0.8851170.6092710.330818
d   000
e   -0.1360861.1328080.496091
f   0.2100650.5331740.111560
g   000
h   1.0276890.6300370.727022

You can also replace the NaN with mean, median, mode and we can also impute values using machine learning models .

We will learn about statistical summary calculations in next article’s

Now we will learn how to convert the datatypes of the variables 

When doing data analysis, it is important to make sure you are using the correct data types; otherwise, you may get unexpected results or errors. In the case of pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic.

Despite how well pandas works, at some point in your data analysis processes, you will likely need to explicitly convert data from one type to another. 

Now we will discuss the basic pandas data types, how they map to python and numpy data types, and the options for converting from one pandas type to another.

Pandas Data Types

A data type is essentially an internal construct that a programming language uses to understand how to store and manipulate data. For instance, a program needs to understand that you can add two numbers together like 5 + 10 to get 15. Or, if you have two strings such as “cat” and “dog” you could concatenate (add) them together to get “catdog.”

A possible confusing point about pandas data types is that there is some overlap between pandas, and numpy. This table summarizes the key points:

Pandas dtypeNumPy typeUsage
objectstring_, unicode_, mixed typesText or mixed numeric
int64int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64Integer numbers
float64float_, float16, float32, float64Floating point numbers
boolbool_True/False values
datetime64datetime64[ns]Date and time values
categoryNAFinite list of text values

Now we will focus on the following pandas data types and learn how to convert them from one form to another form 

  • object
  • int64
  • float64

Consider a DataFrame

         Age  Height  Weight
Krishna  25.55.4045.5
Ram      45.05.1150.0

As we all know that age is an integer value but if we closely observe the data frame age values is in float form

We can check the datatypes of every column by using .dtypes

df.dtypes

Age       float64
Height    float64
Weight    float64
dtype: object

It is showing that Age is a float in-order to convert it into int we will use .astype()

df[‘Age’].astype(‘int32’)

Ram        25
Krishna    45
Name: Age, dtype: int32

Now we can see that Age column is now converted into int datatype

We will see the rest of the pandas applications in the next article……

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class