Python Pandas Tutorial

Python Pandas Tutorial

Table of Contents

Pandas is one of the most critical libraries every Data Scientist using Python must understand. While the packages for data cleaning and visualization such as NumPy, matplotlib, seaborn and plotly may get all the hype, pandas still stand as a bedrock for most data science projects. In this tutorial, we shall be discussing how to carry out important operations with pandas. This is by no means a comprehensive guide to pandas. The library is so wide its documentation runs into thousands of pages and textbooks dedicate an entire chapter to the library. 

You will however get to learn the most important stuff in pandas. You’ll discover the most useful techniques for data manipulation using pandas and the operations you must understand before taking on any data science project. Here’s a rundown of what you’ll learn. 

  • What is Pandas
  • Why you need Pandas
  • How to install Pandas
  • The Series and DataFrame Object
  • Working with Dates
  • How to inspect your data
  • Understanding indexing, slicing, and masking
  • Using loc and iloc indexers
  • Handling missing values
  • How to combine two datasets

Sure, you’re excited. Let’s get started. 

What is Pandas

Pandas is a software library written in Python for data wrangling, manipulation, and analysis. The name Pandas was coined out from an econometric term ‘panel data’, defined as data that take observations for the same sample over a while. 

Pandas is open-source, has high-performance and it is an easy-to-use tool for data manipulation. Pandas is also a great library for time-series data. The library was built on top of NumPy. This implies that you’d need to have NumPy installed on your PC to use pandas. If you have Anaconda installed on your machine, pandas come already preinstalled. If however, you do not use Anaconda on your machine, you may need to install pandas manually. We would talk about how to install pandas on your PC momentarily. In the meantime, it is paramount for you to understand why you need pandas. 

Why do you Need Pandas

Pandas is perhaps, one of the most popular libraries for data analysis. It is used by all and sundry in the industry for most data science projects. But why is that? Let’s see some of the reasons Pandas shines. 

  • Easy data representation: pandas allows you to easily represent data in such a manner that you can understand your data better. It is a requirement to understand the kind of data you are dealing with before you begin to wrangle or manipulate it. 
  • A wide range of features Pandas has a wide range of features, methods, and classes. This makes the library very powerful for data analysis. With pandas, you can filter data according to some conditions, split data according to some set conditions, and many more. 
  • Easy to write: Yes, pandas can do really powerful things. But you do not need long lines of code to perform these powerful operations. For instance, you can get an overview of what your data is like with just one line of code. You can check the mean, median, interquartile range (IQR), range, etc. with just the describe() method. 
  • Pandas makes data flexible: With pandas, you can wrangle your data as you so wish. Using its myriad features, you can edit, customize your data just how you want it. You can merge, concatenate, slice, or reshape the data according to your taste. This helps you distill valuable information from your data in no time. 
  • It is great for large datasets: Pandas has no problems dealing with large datasets. In fact, pandas were created to handle large datasets efficiently. 

You see why it is such a powerful library. We’d get into how to use it in just a moment.  Right now, let’s make sure you have Pandas up and running on your machine. 

How to Install Pandas

Pandas is a very easy package to install. As earlier mentioned, if you have Anaconda installed on your PC, pandas come pre-installed as well. If not, simply open your command prompt and type in

pip install pandas

or

conda install pandas

Alternatively, ‘run’ could run this command on your IDLE rather than using your Command prompt or any terminal program. Just type

!pip install pandas

Be aware that you’d need to be connected to the internet before you run any of the commands above. It would take some time to download all the dependencies of the package and install it on your PC. Once, you have successfully installed pandas, you can confirm that it is up and running by typing

import pandas as pd

The Two Key Pandas Data Structure: Series and DataFrame

Pandas Series Object

A series is a data structure created with pandas that is one-dimensional. Like any other form of data, the data can be in integers, strings, Boolean and floating-point numbers. Series are used when you are dealing with one-dimensional data. Series can hence, not handle data with more than one column. In cases of multiple columns, a different pandas data structure called a DataFrame is recommended. 

Let’s see how to create data using series. First, we import the Pandas import as pd. You can easily create a series from a list or array as follows. 

#import the pandas library
import pandas as pd

#create a series of numbers
data = pd.Series([2, 4, 6, 8, 10, 12, 14, 16, 18], dtype='float64')

#print the data
print(data)

Output:

0     2.0
1     4.0
2     6.0
3     8.0
4    10.0
5    12.0
6    14.0
7    16.0
8    18.0
dtype: float64

As seen above, the values are printed on the right side. But what are the numbers in the left column? They are called indexes. When the index of your series is not specified, python automatically defines the index from 0 to the end of the list. You can specify the index you want by passing the index argument when you call the Series object. See an example here. 

#import the pandas library
import pandas as pd

#create a series of numbers
data = pd.Series([2, 4, 6, 8, 10, 12, 14, 16, 18], dtype='float64', index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])

#print the data
print(data)

Output:
a     2.0
b     4.0
c     6.0
d     8.0
e    10.0
f    12.0
g    14.0
h    16.0
i    18.0
dtype: float64

You can access the index and values of the Series with the index and values attribute respectively. In the example below, let’s print out the index and values of the series we already created. 

#print the index of the series
print('The index of the series is ', data.index)

print()

#print the values of the series
print('The values of the series are ', data.values)

Output:
The index of the series is Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='object') 

The values of the series are [ 2. 4. 6. 8. 10. 12. 14. 16. 18.]

Another method of creating a series is by defining a dictionary and passing the dictionary as an argument for the pd.Series() object. The key of the value is the index of the data, while the value pair is the column of the series. Let’s take an example where we created a series with index a, b, c, d, e.

#create a dictionary 
data_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

#create the series
data = pd.Series(data_dict, dtype='float64')

#print the data
print(data)

Output:
a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

Pandas DataFrame Object

A DataFrame is a 2-dimensional data structure with flexible rows and column names. It is instantiated using the pd.DataFrame() class. You could think of a DataFrame as a tabular representation of data in rows and columns just as you have it in spreadsheets. A data frame accommodates more than one column. A DataFrame can be created from a NumPy array. You’d first need to create a NumPy array and then pass it as an argument to the pd.DataFrame() object. 

#import numpy library
import numpy as np
#create a numpy array
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])

#convert the numpy array into a DataFrame
df = pd.DataFrame(numpy_array)

#print the dataframe
print(df)

Output:
   0  1  2
0  1  2  3
1  4  5  6

In the dataFrame created in the above example, the first column is the index while the first row shows the column names. Pandas automatically assign numbers for the index and column names when not defined. You can specify the names of individual columns by passing the columns parameter when creating the DataFrame. Let’s say we want to name the 3 columns in the last example as ‘column 1’, ‘column 2’, and ‘column 3’. You can do it with the code below. 

#import numpy library
import numpy as np
#create a numpy array
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])

#convert the numpy array into a DataFrame
df = pd.DataFrame(numpy_array, columns=['column 1', 'column 2', 'column 3'])

#print the dataframe
print(df)

Output:
   column 1  column 2  column 3
0         1         2         3
1         4         5         6

Another way of creating a DataFrame is by first creating a dictionary and passing the dictionary as argument to the pd.DataFrame() object. Let’s say we want to record the names of students in a class, their gender and their age. We can use this method to create such data very quickly. 

#create a dictionary firstly
data_dict = {'Name': ['John', 'Mike', 'Krish', 'Ann', 'Tobi'],
            'Age': [22, 19, 25, 22, 20],
            'Gender': ['Male', 'Male', 'Male', 'Female', 'Female']}

#create a dataframe with the dictionary
df = pd.DataFrame(data_dict)
#print the dataframe
print(df)
Output:
    Name  Age  Gender
0   John   22    Male
1   Mike   19    Male
2  Krish   25    Male
3    Ann   22  Female
4   Tobi   20  Female

Notice that the index is the default numeric from 0 to the length of the number of rows. Sometimes, you may want to replace the index column with a particular column in the dataset. You can do this with the set_index() method. Let’s say we want the names column to be the index of the data, the following command accomplishes that. 

#change the index column to the Name column
#setting inplace to True replaces the Name so it does not repeat itself
df.set_index('Name', inplace=True)

#print the dataframe
print(df)
Output:
       Age  Gender
Name              
John    22    Male
Mike    19    Male
Krish   25    Male
Ann     22  Female
Tobi    20  Female

Working with Dates

Most times, you’d be working with data that involves dates. Having to type dates one after the other can be very stressful and for a large dataset, impossible. But not to worry, pandas provides a convenient way of writing dates within a particular range. It is done by the date_range() method. 

This method takes in 3 fundamental parameters: the start date, end date, and frequency. 

pd.date_range(
    start=None,
    end=None,
    freq=None)

By default, the frequency is set to daily, which is denoted by ‘D’. Let’s say we want to write dates from 1st January to 31 March, on a weekly interval. The code below does this. 

#create a Sunday weekly date range from 1st January 2020 to 31st March 2020
dates = pd.date_range(start='01/01/2020', end='03/31/2020', freq='W')

#print the date
print(dates)

Output:
DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
               '2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
               '2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
               '2020-03-29'],
              dtype='datetime64[ns]', freq='W-SUN')

It is important to point out that ‘D’ stands for the daily interval, ‘W’ stands for weekly, ‘M’ stands for a monthly interval while ‘Y’ stands for the yearly interval. You can also create a date range without defining the end date explicitly. However, you will need to define the number of dates you want to print from the start date using the periods’ parameter. You specify the number of dates to be printed by passing the period parameter. Let’s create the same date range as in the last example, but this time, using the period’s parameter. 

#create a Sunday weekly date range from 1st January 2020 to 31st March 2020
dates = pd.date_range(start='01/01/2020', periods=13, freq='W')

#print the date
print(dates)

Output:
DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
               '2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
               '2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
               '2020-03-29'],
              dtype='datetime64[ns]', freq='W-SUN')

The periods argument was set to 13 because we wanted to print 13 weeks from the start date. 

Inspecting your Data

When you are building the machine learning model, you need to know the kind of data you’re working with. A proper understanding of the data and good domain knowledge gives you intuition on how to wrangle the data. If for instance, you are dealing with data that has 2 columns, expenses, and revenue of a company. You don’t need a philosopher to tell you to subtract both columns to derive the profit or loss of that company. 

In a nutshell, you’d need to print out the data and inspect it. It’s easy to print out the entire data if it’s a small dataset. But if the dataset is large, printing it may be a bottleneck. What most data scientists do is to print the beginning and end of the dataset. This is sufficiently good to apprise you of the kind of data you are dealing with.  

To print out the beginning of your data, you can use the head() method. The method prints the first five rows of your data. Let us see an example. 

Let’s start by creating a random dataset using the NumPy random method. 

#create some randomly generated 10 by 5 matrix to 1 decimal point
random_numbers = np.round(np.random.randn(10, 5) * 10, 1)
#create a dataframe from the matrix
df = pd.DataFrame(random_numbers, 
            index=pd.date_range(start='01/01/2020', freq='D', periods=10),
            columns=list('ABCDE'))

#print the dataframe
print(df)

Output:
               A     B     C     D     E
2020-01-01 -18.1   2.2  -3.8   1.0  15.5
2020-01-02  15.2  -4.0  -1.9   9.7  -1.3
2020-01-03   5.3   0.3   4.7  -0.3   1.4
2020-01-04   8.9 -22.0 -10.5  -9.2   6.5
2020-01-05  -6.2   9.2  -7.1 -11.7   2.0
2020-01-06  28.0   7.7  -1.1 -13.8  -2.4
2020-01-07   7.7  -2.9   0.2  -6.2 -11.6
2020-01-08  11.5  -2.0   0.7  -6.6  21.8
2020-01-09 -11.0  -9.5 -15.2  -3.7  -0.3
2020-01-10   9.7 -11.0  -5.8   1.7 -12.5

Now we have our data, let’s print the first five rows using the head() method. 

#print the first 5 rows
print(df.head())
Output:
              A     B     C     D     E
2020-01-01 -18.1   2.2  -3.8   1.0  15.5
2020-01-02  15.2  -4.0  -1.9   9.7  -1.3
2020-01-03   5.3   0.3   4.7  -0.3   1.4
2020-01-04   8.9 -22.0 -10.5  -9.2   6.5
2020-01-05  -6.2   9.2  -7.1 -11.7   2.0

By default, the head() method prints the first 5 rows. 

Signature: df.head(n=5)
Docstring:
Return the first `n` rows.

You can change the number of rows you want to print by passing the number of rows as the parameter of the head() method. Let’s say you want to print the first 8 rows this time. We can simply write the following command. 

#print the first 5 rows
print(df.head(8))

Output:
               A     B     C     D     E
2020-01-01 -18.1   2.2  -3.8   1.0  15.5
2020-01-02  15.2  -4.0  -1.9   9.7  -1.3
2020-01-03   5.3   0.3   4.7  -0.3   1.4
2020-01-04   8.9 -22.0 -10.5  -9.2   6.5
2020-01-05  -6.2   9.2  -7.1 -11.7   2.0
2020-01-06  28.0   7.7  -1.1 -13.8  -2.4
2020-01-07   7.7  -2.9   0.2  -6.2 -11.6
2020-01-08  11.5  -2.0   0.7  -6.6  21.8

In contrast, we can print the ending of our data using the tail() method. For the same data above, let’s print the last 5 rows of the dataset. 

#print the last 5 rows
print(df.tail())

Output:
               A     B     C     D     E
2020-01-06  28.0   7.7  -1.1 -13.8  -2.4
2020-01-07   7.7  -2.9   0.2  -6.2 -11.6
2020-01-08  11.5  -2.0   0.7  -6.6  21.8
2020-01-09 -11.0  -9.5 -15.2  -3.7  -0.3
2020-01-10   9.7 -11.0  -5.8   1.7 -12.5

Similarly, you are at liberty to tweak the number of rows to print based on your preference. Let’s say we want to print only the last 3 rows this time, we can do that by passing the number of rows parameter. 

#print the last 3 rows
print(df.tail(3))

Output:
              A     B     C    D     E
2020-01-08  11.5  -2.0   0.7 -6.6  21.8
2020-01-09 -11.0  -9.5 -15.2 -3.7  -0.3
2020-01-10   9.7 -11.0  -5.8  1.7 -12.5

Describing your Data 

Pandas can help you generate the fundamental descriptive statistics in one single line of code. Say you need to get the mean, standard deviation, minimum value, maximum value, etc for your dataset, it is as easy as calling the describe() method on your data. 

#describe your data
df_describe = df.describe()

#print the descriptive statistics
print(df_describe)

Output:
               A          B         C          D          E
count  10.000000  10.000000  10.00000  10.000000  10.000000
mean    5.100000  -3.200000  -3.98000  -3.910000   1.910000
std    13.478048   9.250345   5.85772   7.093416  10.674835
min   -18.100000 -22.000000 -15.20000 -13.800000 -12.500000
25%    -3.325000  -8.125000  -6.77500  -8.550000  -2.125000
50%     8.300000  -2.450000  -2.85000  -4.950000   0.550000
75%    11.050000   1.725000  -0.12500   0.675000   5.375000
max    28.000000   9.200000   4.70000   9.700000  21.800000

Data Indexing, Slicing, and Masking

Many times, you will be required to extract a particular row, set of rows, columns, or set of columns from the entire data. That is what slicing, masking, and indexing entail. 

To extract a single column in the data, you can call the data variable name and pass the name of the column in square brackets. As an example, if the data variable was named ‘df’ and you want to print the column say ‘Customer ID’, you simply type df[‘Customer ID’]. This is called data indexing.

Let’s see an example of the data we already created. If you want to print column ‘B’ of the data, you can type the following command

#print column B of the data
print(df['B'])

Output:
2020-01-01   -18.1
2020-01-02    15.2
2020-01-03     5.3
2020-01-04     8.9
2020-01-05    -6.2
2020-01-06    28.0
2020-01-07     7.7
2020-01-08    11.5
2020-01-09   -11.0
2020-01-10     9.7
Freq: D, Name: A, dtype: float64

If we want to print more than one column, you pass the column names as a list. Let’s say you want to print the columns B, C, D, and E, you will pass [‘B’, ‘C’, ‘D’, ‘E’] to the dataframe. This method is called fancy indexing. 

See the code that does this below. 

#print column 'B', 'C', 'D', and 'E' of the data

print(df[['B', 'C', 'D', 'E']])

Output:
              B     C     D     E
2020-01-01  2.2  17.5   7.8  -4.1
2020-01-02  7.1  17.7  15.6  -4.9
2020-01-03 -5.5   3.0  -6.5 -21.0
2020-01-04  1.6  12.0   0.3   5.1
2020-01-05  6.6  -0.2  -9.0  11.9
2020-01-06 -3.0 -19.0   3.2  -2.2
2020-01-07  3.6 -34.1   6.8  -8.9
2020-01-08  9.1   5.5  10.5 -11.6
2020-01-09 -1.5 -18.1  -5.7   5.9
2020-01-10 -5.1   2.4  -3.3  -1.7

Notice the 2 square brackets this time. This is because the returned DataFrame would be 2-dimensional so a list of a list must be passed. 

There is another way to access a chunk of data from an entire dataset. It is called data slicing. If you want to print some rows/columns from the entire dataset, you can use df[:] to specify the range of rows/columns you want to print.  

The variable before the: defines the starting column and the one after defines the ending column. The syntax is defined by df[start:stop]. Where ‘df’ is your defined DataFrame variable name. Let’s say we want to print the same 2nd of January to 8th January from the data we already defined in the previous example, we can use slicing to print these rows nicely.

#print rows from 2nd January till 8th January
print(df['2020-01-02':'2020-01-08'])

Output:
              A    B     C     D     E
2020-01-02   0.0  7.1  17.7  15.6  -4.9
2020-01-03   2.4 -5.5   3.0  -6.5 -21.0
2020-01-04 -10.9  1.6  12.0   0.3   5.1
2020-01-05 -13.0  6.6  -0.2  -9.0  11.9
2020-01-06 -18.5 -3.0 -19.0   3.2  -2.2
2020-01-07  18.8  3.6 -34.1   6.8  -8.9
2020-01-08   7.8  9.1   5.5  10.5 -11.6
 

Finally, let’s discuss masking. Masking is another way of filtering out specific data from your dataset. You pass a condition and the program returns a Boolean (True or False). True, when the condition is met and false if otherwise. Let’s put this into perspective with an example. If you want to check dates that are greater than 4th January, we can pass the condition like this. 

#check for dates that are after 4th January
print(df.index > '2020-01-04')
Output:
[False False False False  True  True  True  True  True  True]

As we can see, it returns True for dates below 4th January. We can take it a step further by printing that data that satisfies this condition. We simply pass the DataFrame on this returned Boolean. 

#prints for dates that are after 4th January
print(df[df.index > '2020-01-04'])

Output:
             A    B     C     D     E
2020-01-05 -13.0  6.6  -0.2  -9.0  11.9
2020-01-06 -18.5 -3.0 -19.0   3.2  -2.2
2020-01-07  18.8  3.6 -34.1   6.8  -8.9
2020-01-08   7.8  9.1   5.5  10.5 -11.6
2020-01-09 -25.9 -1.5 -18.1  -5.7   5.9
2020-01-10  10.9 -5.1   2.4  -3.3  -1.7

This is a powerful tool that must be understood.

Using Indexers: loc and iloc 

Before we delve into how to use loc and iloc attributes, we need to understand what implicit and explicit indexes are.

When you create a pandas Series or DataFrame, pandas automatically assigns index numbers from 0 to the length of the data. These are called implicit indices. In the example below where the index was not explicitly defined, 0, 1, 2, and 3 are the implicit indices. 

#create a 1-dimensional array
df = pd.Series(['a', 'b', 'c', 'd'])
#print the Series
print(df)

Output:
0    a
1    b
2    c
3    d
dtype: object

You can however change the index names by specifying the index parameter when creating the series or dataframe. When you explicitly define the index names, they are called explicit indices. Let’s now specify the explicit indices in the example above. 

#create a 1-dimensional array
df = pd.Series(['a', 'b', 'c', 'd'], index=[1, 2, 3, 4])
#print the Series
print(df)

Output:
1    a
2    b
3    c
4    d
dtype: object

Now, the explicit indices for the data are 1, 2, 3, 4 while the implicit indices remain 0, 1, 2, 3. 

If you attempt to slice or index this dataframe, whose implicit and explicit index numbers, there may be confusion as to whether to use the implicit index or the explicit index. 

When indexing, the python interpreter uses the explicit index. Look at the code below. 

#create a 1-dimensional array
df = pd.Series(['a', 'b', 'c', 'd'], index=[1, 2, 3, 4])
#print the indexed series
print(df[1])

Output:

a

On the other hand, when slicing, the python interpreter uses the implicit index. Look at the code below. 

#create a 1-dimensional array
df = pd.Series(['a', 'b', 'c', 'd'], index=[1, 2, 3, 4])
#print the sliced series
print(df[1:3])

Output:

2    b
3    c
dtype: object

As seen, index 1 is now ‘b’ (the explicit index) but it was ‘a’ (the implicit index) in the earlier example. 

To avoid this confusion, we use loc and iloc for indexing or slicing. These special indexer attributes (loc and iloc) allow you to specify whether the slicing should be done based on the implicit or explicit index. 

So when do you use loc or iloc? If we want to slice based on explicit indices, we use the loc attribute. Let’s see an example 

#create a 1-dimensional array
df = pd.Series(['a', 'b', 'c', 'd'], index=[1, 2, 3, 4])
#print the sliced series based on explicit indices
print(df.loc[1:3])

Output:
1    a
2    b
3    c
dtype: object

If  on the other hand,  you want the implicit indices to be used, pass the iloc attribute. Let’s see an example. 

#create a 1-dimensional array
df = pd.Series(['a', 'b', 'c', 'd'], index=[1, 2, 3, 4])
#print the sliced series based on implicit indices
print(df.iloc[1:3])

Output:
2    b
3    c
dtype: object

All these apply in DataFrame as well. For DataFrame, the column labels are the explicit index while the count of the columns are the implicit index. Let’s see an example. 

#create some randomly generated 10 by 5 matrix to 1 decimal point
random_numbers = np.round(np.random.randn(10, 5) * 10, 1)
#create a dataframe from the matrix
df = pd.DataFrame(random_numbers, 
            index=pd.date_range(start='01/01/2020', freq='D', periods=10),
            columns=list('ABCDE'))

#print the sliced series dataframe based on explicit indices using loc
print(df.loc['2020-01-02':'2020-01-08', 'A':'C'])

Output:
               A     B     C
2020-01-02   2.8  14.7  13.4
2020-01-03   5.3  -5.5  -0.0
2020-01-04  12.8  -8.0 -15.1
2020-01-05   2.8  -1.4  -3.8
2020-01-06   5.6 -21.1  14.2
2020-01-07  11.2  -5.4  10.4
2020-01-08  -7.7  -8.1  15.2

You can as well make use of the iloc attribute, this time, you pass the implicit indices. Take a look. 

#create some randomly generated 10 by 5 matrix to 1 decimal point
random_numbers = np.round(np.random.randn(10, 5) * 10, 1)
#create a dataframe from the matrix
df = pd.DataFrame(random_numbers, 
            index=pd.date_range(start='01/01/2020', freq='D', periods=10),
            columns=list('ABCDE'))

#print the sliced series dataframe based on explicit indices using iloc
print(df.iloc[1:8, 0:3])

Output:
              A     B     C
2020-01-02  -3.3  -5.2   0.7
2020-01-03   0.6 -13.4  -4.1
2020-01-04   2.7   5.5  23.3
2020-01-05  -7.6   4.6  -3.3
2020-01-06  10.7  16.3   4.6
2020-01-07  -4.9  -5.5   2.4
2020-01-08  -4.6  10.3  -1.0

Handling Missing Values

Most data you see in tutorials are usually clean and homogeneous, but that’s not the case with real-world data. They are largely inconsistent and have several missing values. Missing values can mar the learning of your machine learning algorithm. As a data scientist, you must know how to handle missing values in your data. It is perhaps the first step in data preprocessing.

The pandas’ library has built-in methods that can check for missing values, replace missing values, or completely drop missing values. They are:

  • isnull(): This method checks id missing values are present in your data and returns a Boolean. It returns true whenever there’s a missing value and false when there’s not. 
  • notnull(): This is the exact opposite of the isnull() method. It returns True when there is no missing value and False when there is. 
  • dropna(): This is one of the most used methods to deal with a missing value. The dropna() method removes missing values from the dataset completely. The entire row or column containing the missing values will be expunged. 
  • fillna(): This method fills the missing values with a particular number as defined. Sometimes, you may want to replace missing values with the mean or medium along the column or row. 

Let’s take some examples where we implement these methods. We will start by creating a dataset containing missing values. The missing values will be created using numpy’s np.nan 

#create some random dataframe
df = pd.DataFrame([[1, 3, np.nan, 1, 5],
                  [1, 4, 1, np.nan, 3],
                   [3, 2, 5, 1, 2],
                  [1, 2, np.nan, 3, 6]],
                 columns = list('ABCDE'))
#print the dataframe
print(df)

Output:
   A  B    C    D  E
0  1  3  NaN  1.0  5
1  1  4  1.0  NaN  3
2  3  2  5.0  1.0  2
3  1  2  NaN  3.0  6

First, let’s check if missing values exist in the dataset. Remember we use the isnull() method for this. 

#check if missing values are present
print(df.isnull())

Output:
       A      B      C      D      E
0  False  False   True  False  False
1  False  False  False   True  False
2  False  False  False  False  False
3  False  False   True  False  False

As seen, it prints True in entries where there were missing values. Now, let’s attempt to drop the rows or columns with missing values. This time, we use the dropna() method. There is one parameter to be mindful of when using the dropna() method which is the ‘axis’ parameter. ‘axis’ specifies the direction with which the missing values will be dropped.  If the axis is set to 0, the rows containing missing values will be dropped. If 1, the columns containing missing values will be dropped. 

In the dataset created, if we want to drop the rows containing missing values, we run the command below. 

#drop columns with missing values
print(df.dropna(axis=0))

Output:
   A  B    C    D  E
2  3  2  5.0  1.0  2

If we want to drop columns with missing values, we run this 

#drop columns with missing values
print(df.dropna(axis=1))
Output:
   A  B  E
0  1  3  5
1  1  4  3
2  3  2  2
3  1  2  6

Now let’s assume we want to fill missing values with a number say 0. We will use the fillna() method this time. The code below does this.

#drop columns with missing values
print(df.fillna(0))

Output:
   A  B    C    D  E
0  1  3  0.0  1.0  5
1  1  4  1.0  0.0  3
2  3  2  5.0  1.0  2
3  1  2  0.0  3.0  6

Combining Two Datasets

Most studies will require you to combine datasets from more than one source before going ahead to preprocess it. It is therefore needful to understand how to use pandas to combine datasets. That’s what we will be discussing here. 

You can combine two datasets by using the pd.concat() method.

Signature:
pd.concat(
    objs,
    axis=0,
    join='outer',
    join_axes=None,
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    sort=None,
    copy=True,
)
Docstring:
Concatenate pandas objects along a particular axis with optional set logic
along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis,
which may be useful if the labels are the same (or overlapping) on
the passed axis number.

Let’s say we want to combine two DataFrame A and B. We will start by creating the DataFrame. 

#create the dataset
A = pd.Series([1, 2, 3, 4])
B = pd.Series([5, 6, 7, 8])
#combine the two dataset
df_combined = pd.concat([A, B])

#print the combined data
print(df_combined)

Output:
0    1
1    2
2    3
3    4
0    5
1    6
2    7
3    8
dtype: int64

Notice that the datasets were combined along the index. This is because the axis parameter is by default set to 0.  To merge along columns, you need to set axis = 1. 

#create the dataset
A = pd.Series([1, 2, 3, 4])
B = pd.Series([5, 6, 7, 8])
#combine the two dataset
df_combined = pd.concat([A, B], axis=1)

#print the combined data
print(df_combined)

Output:
   0  1
0  1  5
1  2  6
2  3  7
3  4  8

Combination with Joins

In some situations, you may wish to combine datasets with different dimensions. How does the python interpreter determine the values to return in entries with no data available? This is controlled by the join parameter. By default, it is set to ‘outer’. It implies that entries with no available data are filled with NaN. Let’s see an example. 

#create the dataset
A = pd.DataFrame([[1, 2, 3, 4],
                [1, 3, 1, 5]])
B = pd.DataFrame([[5, 6, 7],
             [2, 1, 1],
             [5, 7, 2]])
#combine the two dataset
df_combined = pd.concat([A, B], join='outer')

#print A
print('Matrix A\n', A)
print()
#print B
print('Matrix B\n', B)
print()
#print the combined data
print(df_combined)

Output:
Matrix A
    0  1  2  3
0  1  2  3  4
1  1  3  1  5

Matrix B
    0  1  2
0  5  6  7
1  2  1  1
2  5  7  2

   0  1  2    3
0  1  2  3  4.0
1  1  3  1  5.0
0  5  6  7  NaN
1  2  1  1  NaN
2  5  7  2  NaN

If the join parameter was set to inner, the combined DataFrame will contain entries where the data was present in both the individual DataFrame. In other words, rows or columns that contain NaN will be completely removed. 

#create the dataset
A = pd.DataFrame([[1, 2, 3, 4],
                [1, 3, 1, 5]])
B = pd.DataFrame([[5, 6, 7],
             [2, 1, 1],
             [5, 7, 2]])
#combine the two dataset
df_combined = pd.concat([A, B], join='inner')

#print A
print('Matrix A\n', A)
print()
#print B
print('Matrix B\n', B)
print()
#print the combined data
print('Combined dataframe\n', df_combined)

Output:
Matrix A
    0  1  2  3
0  1  2  3  4
1  1  3  1  5

Matrix B
    0  1  2
0  5  6  7
1  2  1  1
2  5  7  2

Combined dataframe
    0  1  2
0  1  2  3
1  1  3  1
0  5  6  7
1  2  1  1

Conclusion

Let’s wrap it up by summarizing what you’ve learned in this tutorial. We started by explaining what pandas is and why you need pandas. We went ahead to discuss the two types of data structure in pandas, i.e. Series and DataFrame. After understanding how to create a Series and DataFrame, you discovered how to work with dates in your data. We said that you should always learn to inspect your data with the head() or tail() function. You can as well get quick insights into your data using the describe() method. 

You have learned data slicing, indexing, and masking for data extraction. We finally discussed how to deal with missing values and how to combine datasets. These are key operations you must learn to carry out with pandas. 

If you have any questions, please drop in the comment section. We will be on the ground to respond to them. 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class