Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. A lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job. Therefore, if you are just stepping into this field or planning to step into this field, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.
In this article, we will cover a few pandas libraries that are used to clean the data.
Functions Used for Data Cleaning
After reading the data set into a data frame using .read_csv( ) we will try to clean the Data using different functions
Why Should we Rename Columns and Index
If your data was generated by a computer program, it probably has some computer-generated column names, too. Those can be hard to read and understand while working, so if you want to rename a column to something more user-friendly, you can do it using df.rename()
Consider the following DataFrame
A | B | |
0 | 1 | 4 |
1 | 2 | 5 |
df.rename(columns={"A": "a", "B": "b"})
Output:
a | b | |
0 | 1 | 4 |
1 | 2 | 5 |
We can also rename the index using .rename()
df.rename(index={0: "x", 1: "y"})
a | b | |
x | 1 | 4 |
y | 2 | 5 |
Missing data is always a problem in real-life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of the poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.
When and Why Is Data Missed?
Let us consider an online survey for a product. Many times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way, a part of data is always missing, and this is very common in real-time.
In Pandas missing data is represented by two value:
None: None is a Python singleton object that is often used for missing data in Python code.
NaN: NaN (an acronym for Not a Number), is a special floating point value recognized by all systems that use the standard IEEE floating-point representation.
Let us now see how to identify missing values in our Data Set
Consider the following Data Set
one | two | three | |
a | -1.359063 | 1.613255 | -0.669396 |
b | NaN | NaN | NaN |
c | 0.885117 | 0.609271 | 0.330818 |
d | NaN | NaN | NaN |
e | -0.136086 | 1.132808 | 0.496091 |
f | 0.210065 | 0.533174 | 0.111560 |
g | NaN | NaN | NaN |
h | 1.027689 | 0.630037 | 0.727022 |
Now we will find the missing values in the data set using the function .isnull( )
df.isnull()
Output:
one | two | three | |
a | False | False | FALSE |
b | True | True | True |
c | False | False | FALSE |
d | True | True | True |
e | False | False | FALSE |
f | False | False | FALSE |
g | True | True | True |
h | False | False | FALSE |
.isnull() checks every column for NULL values and a boolean series is returned by the isnull() method which stores True for NaN value and False for a Not null value.
We can also use .notnull() function to find the null values. It is opposite of .isnull()
df.notnull()
Output:
one | two | three | |
a | True | True | True |
b | False | False | FALSE |
c | True | True | True |
d | False | False | FALSE |
e | True | True | True |
f | True | True | True |
g | False | False | FALSE |
h | True | True | True |
.notnull() checks every column is checked for NULL values and a boolean series is returned by the notnull() method which stores True for every NON-NULL value and False for a null value.
How to Drop rows with Nan values
There are several options for handling missing values each with its PROS and CONS. However, the choice of what should be done is largely dependent on the nature of our data and the missing values. Below is a summary highlight of several options we have for handling missing values.
- Drop the missing values
- Fill the missing values
Drop the missing values
.dropna() function this function drop Rows Columns off datasets with Null values
Consider the following Data Set
one | two | three | |
a | -1.359063 | 1.613255 | -0.669396 |
b | NaN | NaN | NaN |
c | 0.885117 | 0.609271 | 0.330818 |
d | NaN | NaN | NaN |
e | -0.136086 | 1.132808 | 0.496091 |
f | 0.210065 | 0.533174 | 0.111560 |
g | NaN | NaN | NaN |
h | 1.027689 | 0.630037 | 0.727022 |
df.dropna()
Output:
one | two | three | |
a | -1.359063 | 1.613255 | -0.669396 |
c | 0.885117 | 0.609271 | 0.330818 |
e | -0.136086 | 1.132808 | 0.496091 |
f | 0.210065 | 0.533174 | 0.111560 |
h | 1.027689 | 0.630037 | 0.727022 |
Fill the missing values
Pandas df.replace() function is used to replace a string, regex, list, dictionary, series, number etc. from a dataframe. This is a very rich function as it has many variations.
Using this .replace() function we can replace we replace all the NaN with whatever value’s we like
Now we will replace all the NaN with ‘0’
df.replace(np.nan,0) //(orginal value,replaced value)
Output:
one | two | three | |
a | -1.359063 | 1.613255 | -0.669396 |
b | 0 | 0 | 0 |
c | 0.885117 | 0.609271 | 0.330818 |
d | 0 | 0 | 0 |
e | -0.136086 | 1.132808 | 0.496091 |
f | 0.210065 | 0.533174 | 0.111560 |
g | 0 | 0 | 0 |
h | 1.027689 | 0.630037 | 0.727022 |
You can also replace the NaN with mean, median, mode and we can also impute values using machine learning models .
We will learn about statistical summary calculations in next article’s
Now we will learn how to convert the datatypes of the variables
When doing data analysis, it is important to make sure you are using the correct data types; otherwise, you may get unexpected results or errors. In the case of pandas, it will correctly infer data types in many cases and you can move on with your analysis without any further thought on the topic.
Despite how well pandas works, at some point in your data analysis processes, you will likely need to explicitly convert data from one type to another.
Now we will discuss the basic pandas data types, how they map to python and numpy data types, and the options for converting from one pandas type to another.
Pandas Data Types
A data type is essentially an internal construct that a programming language uses to understand how to store and manipulate data. For instance, a program needs to understand that you can add two numbers together like 5 + 10 to get 15. Or, if you have two strings such as “cat” and “dog” you could concatenate (add) them together to get “catdog.”
A possible confusing point about pandas data types is that there is some overlap between pandas, and numpy. This table summarizes the key points:
Pandas dtype | NumPy type | Usage |
object | string_, unicode_, mixed types | Text or mixed numeric |
int64 | int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64 | Integer numbers |
float64 | float_, float16, float32, float64 | Floating point numbers |
bool | bool_ | True/False values |
datetime64 | datetime64[ns] | Date and time values |
category | NA | Finite list of text values |
Now we will focus on the following pandas data types and learn how to convert them from one form to another form
- object
- int64
- float64
Consider a DataFrame
Age | Height | Weight | |
Krishna | 25.5 | 5.40 | 45.5 |
Ram | 45.0 | 5.11 | 50.0 |
As we all know that age is an integer value but if we closely observe the data frame age values is in float form
We can check the datatypes of every column by using .dtypes
df.dtypes
Age float64 Height float64 Weight float64 dtype: object
It is showing that Age is a float in-order to convert it into int we will use .astype()
df[‘Age’].astype(‘int32’)
Ram 25 Krishna 45 Name: Age, dtype: int32
Now we can see that Age column is now converted into int datatype
We will see the rest of the pandas applications in the next article……