Getting Started with Pandas

Getting Started with Pandas

Table of Contents

Pandas is a popular Python package for data science, and with good reason, it offers powerful, expressive, and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.

What is Pandas ?

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis & data manipulation tool available in any language.

Pandas is well-suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time-series data.
  • Any other form of observational/statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Pandas Data Structures

There are two types of data structures in pandas

  1. Series – 1D labeled homogeneously-typed array

2.   DataFrame – General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

Things that pandas can do:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split apply combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files (CSV and delimited)
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.

“ Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. ”

How to install Pandas ?

Installing Pandas using Anaconda

The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross-platform distribution for data analysis and scientific computing. This is the recommended installation method for most users.

To install this package with conda run the following code in your Jupyter Notebook:

conda install -c anaconda pandas

Installing Pandas using pip

If you have Python and pip already installed on a system, then the installation of Pandas is very easy. 

Install it using this command: 

pip install pandas

Pandas data frame representation

This is how a pandas data frame looks:

Getting Started with Pandas

How To Create a Pandas DataFrame ?

A pandas DataFrame can be created using the following constructor 

pd.DataFrame( data, index=‘rows’, column=‘columns’ )

A pandas DataFrame can  also be created using various inputs like 

  • Lists
  • dict
  • Series
  • Numpy ndarrays
  • Another DataFrame

Creating an Empty DataFrame.  

A basic DataFrame, which can be created is an empty data frame. This is how you do it.

Example:

import pandas as pd
df = pd.DataFrame()
df

Output:

Empty DataFrame
Columns: []
Index: []

Create a DataFrame from Lists.

The DataFrame can be created using a single list or a list of lists.

Example:

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df

Output:

 Name      Age
0     Alex      10 
1     Bob       12 
2     Clarke    13

Create a DataFrame from Dict of ndarrays / Lists

All the ndarrays must be of the same length. If the index is passed, then the length of the index should equal the length of the arrays.

If no index is passed, then by default, the index will be a range(n), where n is the array length.

Example:

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

Output:

   Age    Name
rank1    28      Tom
rank2    34     Jack
rank3    29    Steve
rank4    42    Ricky

Create a DataFrame from List of Dicts

A list of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

The following example shows how to create a DataFrame by passing a list of dictionaries.

Example:

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

Output :

 a    b      c
0   1   2     NaN
1   5   10   20.0

Observe, NaN (Not a Number) is appended in missing areas.

How to select a specific column from the DataFrame

Consider the following DataFrame

Age    Name Sex
rank1    28      Tom  M 
rank2    34     Jack  M 
rank3    29    Steve  M 
rank4    42    Ricky  M

Now we can access a specific column from the DataFrame by simply writing like this df[‘Name’]  And the resultant DataFrame is displayed

Name 
rank1    Tom  
rank2    Jack 
rank3    Steve
rank4    Ricky

How To Select an Index or Column From a Pandas DataFrame

In-order to select rows in a DataFrame we will use  the following functions

  • iloc[]
  • loc[]

Example:

one   three  two
a     1.0    10.0   1 
b     2.0    20.0   2 
c     3.0    30.0   3 
d     NaN     NaN   4

.iloc[] function is used to get rows (or columns) at particular positions in the index
df.iloc[0:2]

one   three  two
a     1.0    10.0   1 
b     2.0    20.0   2

.loc[] function is used to get rows (or columns) with particular labels from the index
df.loc[‘a’]

one       1.0
two       1.0
three    10.0

How To Add an Index, Row or Column to a Pandas DataFrame

Now that you have learned how to select a value from a DataFrame, it’s time to get to the real work and add an index, row, or column to it!

Add New Column to DataFrame

Consider a New Dataframe with Sales data from three different regions. We have data from the following region: West, North, and South.

Region   Company   Product     Month      Sale
0   West     Costco    Dinner_set  September  2500
1   North    Walmart   Grocery     July       3096

Pandas allow you to add a new column Purchase’s to this DataFrame

purchase = [3000, 4000]      
df.assign(Purchase=purchase)

And this is the resultant DataFrame

Region  Company  Product    Month     Sale Purchase
0 West    Costco   Dinner_set September 2500 3000    
1 North   Walmart  Grocery    July      3096 4000

Add New Row to DataFrame

This is a data dictionary with the values of one Region – East that we want to enter in the above dataframe. The data is basically a list with Dictionary having columns as key and their corresponding values.

df=[{‘Region’:’South’,’Company':'D_Mart','Product':   
'Tables','Month':'December','Sales': 1500,            'Purchase':3500}]

And this is the resultant DataFrame

Region  Company  Product    Month     Sale Purchase
0 West    Costco   Dinner_set September 2500 3000    
1 North   Walmart  Grocery    July      3096 4000    
2 South   D-mart   Tables     December  1500 3500

Add New Index to DataFrame

The index of a DataFrame is a set that consists of a label for each row

Now we will consider this DataFrame, and try to set a column as an index.

Region  Company  Product    Month     Sale Purchase
0 West    Costco   Dinner_set September 2500 3000    
1 North   Walmart  Grocery    July      3096 4000    
2 South   D-mart   Tables     December  1500 3500

This is how we do it
df = df.set_index(‘Region’)

This is how the DataFrame looks after setting “Region” as an index

Region Company  Product    Month     Sale Purchase 
West      Costco   Dinner_set September 2500 3000     
North     Walmart  Grocery    July      3096 4000     
South     D-mart   Tables     December  1500 3500

The column ’Region’ is now the index of the DataFrame.

Resetting the Index of Your DataFrame

When your index doesn’t look entirely the way you want it to, you can opt to reset it. You can easily do this with reset_index()

How to Delete Rows or Columns From a Pandas Data Frame

Now that you have seen how to select and add indices, rows, and columns to your DataFrame

Consider a DataFrame

Region  Company  Product    Month     Sale Purchase
0 West    Costco   Dinner_set September 2500 3000    
1 North   Walmart  Grocery    July      3096 4000    
2 South   D-mart   Tables     December  1500 3500

Dropping columns with the column name

To get rid of columns from your DataFrame, you can use the drop() method:
df=df.drop(‘Purchase’) 

Dropping Rows by index label

To get rid of row from your DataFrame, you can use the drop() method:
df=df.drop(‘0’,inplace = True)

We use ‘inplace=True‘ if we want to commit the changes to the dataframe

We will discuss the rest of the Pandas functions and other features in the next article.

5 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class