Getting Started with Pandas

h2k infosysSeptember 9, 2020

5 1,215 6 minutes read

Pandas is a popular Python package for data science, and with good reason, it offers powerful, expressive, and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.

What is Pandas ?

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis & data manipulation tool available in any language.

Pandas is well-suited for many different kinds of data:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

Ordered and unordered (not necessarily fixed-frequency) time-series data.

Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

Any other form of observational/statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Pandas Data Structures

There are two types of data structures in pandas

Series – 1D labeled homogeneously-typed array

2. DataFrame – General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

Things that pandas can do:

Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

Powerful, flexible group by functionality to perform split apply combine operations on data sets, for both aggregating and transforming data

Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

Intuitive merging and joining data sets

Flexible reshaping and pivoting of data sets

Hierarchical labeling of axes

Robust IO tools for loading data from flat files (CSV and delimited)

Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.

“ Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. ”

How to install Pandas ?

Installing Pandas using Anaconda

The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross-platform distribution for data analysis and scientific computing. This is the recommended installation method for most users.

To install this package with conda run the following code in your Jupyter Notebook:

conda install -c anaconda pandas

Installing Pandas using pip

If you have Python and pip already installed on a system, then the installation of Pandas is very easy.

Install it using this command:

pip install pandas

Pandas data frame representation

This is how a pandas data frame looks:

How To Create a Pandas DataFrame ?

A pandas DataFrame can be created using the following constructor

pd.DataFrame( data, index=‘rows’, column=‘columns’ )

A pandas DataFrame can also be created using various inputs like

Lists
dict
Series
Numpy ndarrays
Another DataFrame

Creating an Empty DataFrame.

A basic DataFrame, which can be created is an empty data frame. This is how you do it.

Example:

import pandas as pd
df = pd.DataFrame()
df

Output:

Empty DataFrame
Columns: []Index: []

Create a DataFrame from Lists.

The DataFrame can be created using a single list or a list of lists.

Example:

import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df

Output:

Name Age
0 Alex 10
1 Bob 12
2 Clarke 13

Create a DataFrame from Dict of ndarrays / Lists

All the ndarrays must be of the same length. If the index is passed, then the length of the index should equal the length of the arrays.

If no index is passed, then by default, the index will be a range(n), where n is the array length.

Example:

import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

Output:

Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky

Create a DataFrame from List of Dicts

A list of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

The following example shows how to create a DataFrame by passing a list of dictionaries.

Example:

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

Output :

a b c
0 1 2 NaN
1 5 10 20.0

Observe, NaN (Not a Number) is appended in missing areas.

How to select a specific column from the DataFrame

Consider the following DataFrame

Age Name Sex
rank1 28 Tom M
rank2 34 Jack M
rank3 29 Steve M
rank4 42 Ricky M

Now we can access a specific column from the DataFrame by simply writing like this df[‘Name’] And the resultant DataFrame is displayed

Name
rank1 Tom
rank2 Jack
rank3 Steve
rank4 Ricky

How To Select an Index or Column From a Pandas DataFrame

In-order to select rows in a DataFrame we will use the following functions

iloc[]
loc[]

Example:

one three two
a 1.0 10.0 1
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4

.iloc[] function is used to get rows (or columns) at particular positions in the index
df.iloc[0:2]

one three two
a 1.0 10.0 1
b 2.0 20.0 2

.loc[] function is used to get rows (or columns) with particular labels from the index
df.loc[‘a’]

one 1.0
two 1.0
three 10.0

How To Add an Index, Row or Column to a Pandas DataFrame

Now that you have learned how to select a value from a DataFrame, it’s time to get to the real work and add an index, row, or column to it!

Add New Column to DataFrame

Consider a New Dataframe with Sales data from three different regions. We have data from the following region: West, North, and South.

Region Company Product Month Sale
0 West Costco Dinner_set September 2500
1 North Walmart Grocery July 3096

Pandas allow you to add a new column Purchase’s to this DataFrame

purchase = [3000, 4000]
df.assign(Purchase=purchase)

And this is the resultant DataFrame

Region Company Product Month Sale Purchase
0 West Costco Dinner_set September 2500 3000
1 North Walmart Grocery July 3096 4000

Add New Row to DataFrame

This is a data dictionary with the values of one Region – East that we want to enter in the above dataframe. The data is basically a list with Dictionary having columns as key and their corresponding values.

df=[{‘Region’:’South’,’Company':'D_Mart','Product':   
'Tables','Month':'December','Sales': 1500,            'Purchase':3500}]

And this is the resultant DataFrame

Region Company Product Month Sale Purchase
0 West Costco Dinner_set September 2500 3000
1 North Walmart Grocery July 3096 4000
2 South D-mart Tables December 1500 3500

Add New Index to DataFrame

The index of a DataFrame is a set that consists of a label for each row

Now we will consider this DataFrame, and try to set a column as an index.

Region Company Product Month Sale Purchase
0 West Costco Dinner_set September 2500 3000
1 North Walmart Grocery July 3096 4000
2 South D-mart Tables December 1500 3500

This is how we do it
df = df.set_index(‘Region’)

This is how the DataFrame looks after setting “Region” as an index

Region Company Product Month Sale Purchase
West Costco Dinner_set September 2500 3000
North Walmart Grocery July 3096 4000
South D-mart Tables December 1500 3500

The column ’Region’ is now the index of the DataFrame.

Resetting the Index of Your DataFrame

When your index doesn’t look entirely the way you want it to, you can opt to reset it. You can easily do this with reset_index()

How to Delete Rows or Columns From a Pandas Data Frame

Now that you have seen how to select and add indices, rows, and columns to your DataFrame

Consider a DataFrame

Region Company Product Month Sale Purchase
0 West Costco Dinner_set September 2500 3000
1 North Walmart Grocery July 3096 4000
2 South D-mart Tables December 1500 3500

Dropping columns with the column name

To get rid of columns from your DataFrame, you can use the drop() method:
df=df.drop(‘Purchase’)

Dropping Rows by index label

To get rid of row from your DataFrame, you can use the drop() method:
df=df.drop(‘0’,inplace = True)

We use ‘inplace=True‘ if we want to commit the changes to the dataframe

We will discuss the rest of the Pandas functions and other features in the next article.

Facebook Comments