Pandas is a popular Python package for data science, and with good reason, it offers powerful, expressive, and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.
What is Pandas ?
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis & data manipulation tool available in any language.
Pandas is well-suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time-series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational/statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
Pandas Data Structures
There are two types of data structures in pandas
- Series – 1D labeled homogeneously-typed array
2. DataFrame – General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column
Things that pandas can do:
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split apply combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes
- Robust IO tools for loading data from flat files (CSV and delimited)
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.
“ Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. ”
How to install Pandas ?
Installing Pandas using Anaconda
The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross-platform distribution for data analysis and scientific computing. This is the recommended installation method for most users.
To install this package with conda run the following code in your Jupyter Notebook:
conda install -c anaconda pandas |
Installing Pandas using pip
If you have Python and pip already installed on a system, then the installation of Pandas is very easy.
Install it using this command:
pip install pandas |
Pandas data frame representation
This is how a pandas data frame looks:
How To Create a Pandas DataFrame ?
A pandas DataFrame can be created using the following constructor
pd.DataFrame( data, index=‘rows’, column=‘columns’ ) |
A pandas DataFrame can also be created using various inputs like
- Lists
- dict
- Series
- Numpy ndarrays
- Another DataFrame
Creating an Empty DataFrame.
A basic DataFrame, which can be created is an empty data frame. This is how you do it.
Example:
import pandas as pd df = pd.DataFrame() df
Output:
Empty DataFrame Columns: [] Index: [] |
Create a DataFrame from Lists.
The DataFrame can be created using a single list or a list of lists.
Example:
import pandas as pd data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age']) print df
Output:
Name Age 0 Alex 10 1 Bob 12 2 Clarke 13 |
Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of the same length. If the index is passed, then the length of the index should equal the length of the arrays.
If no index is passed, then by default, the index will be a range(n), where n is the array length.
Example:
import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4']) print df
Output:
Age Name rank1 28 Tom rank2 34 Jack rank3 29 Steve rank4 42 Ricky |
Create a DataFrame from List of Dicts
A list of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.
The following example shows how to create a DataFrame by passing a list of dictionaries.
Example:
import pandas as pd data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data) print df
Output :
a b c 0 1 2 NaN 1 5 10 20.0 |
Observe, NaN (Not a Number) is appended in missing areas.
How to select a specific column from the DataFrame
Consider the following DataFrame
Age Name Sex rank1 28 Tom M rank2 34 Jack M rank3 29 Steve M rank4 42 Ricky M |
Now we can access a specific column from the DataFrame by simply writing like this df[‘Name’]
And the resultant DataFrame is displayed
Name rank1 Tom rank2 Jack rank3 Steve rank4 Ricky |
How To Select an Index or Column From a Pandas DataFrame
In-order to select rows in a DataFrame we will use the following functions
- iloc[]
- loc[]
Example:
one three two a 1.0 10.0 1 b 2.0 20.0 2 c 3.0 30.0 3 d NaN NaN 4 |
.iloc[] function is used to get rows (or columns) at particular positions in the indexdf.iloc[0:2]
one three two a 1.0 10.0 1 b 2.0 20.0 2 |
.loc[] function is used to get rows (or columns) with particular labels from the indexdf.loc[‘a’]
one 1.0 two 1.0 three 10.0 |
How To Add an Index, Row or Column to a Pandas DataFrame
Now that you have learned how to select a value from a DataFrame, it’s time to get to the real work and add an index, row, or column to it!
Add New Column to DataFrame
Consider a New Dataframe with Sales data from three different regions. We have data from the following region: West, North, and South.
Region Company Product Month Sale 0 West Costco Dinner_set September 2500 1 North Walmart Grocery July 3096 |
Pandas allow you to add a new column Purchase’s to this DataFrame
purchase = [3000, 4000] df.assign(Purchase=purchase) |
And this is the resultant DataFrame
Region Company Product Month Sale Purchase 0 West Costco Dinner_set September 2500 3000 1 North Walmart Grocery July 3096 4000 |
Add New Row to DataFrame
This is a data dictionary with the values of one Region – East that we want to enter in the above dataframe. The data is basically a list with Dictionary having columns as key and their corresponding values.
df=[{‘Region’:’South’,’Company':'D_Mart','Product': 'Tables','Month':'December','Sales': 1500, 'Purchase':3500}]
And this is the resultant DataFrame
Region Company Product Month Sale Purchase 0 West Costco Dinner_set September 2500 3000 1 North Walmart Grocery July 3096 4000 2 South D-mart Tables December 1500 3500 |
Add New Index to DataFrame
The index of a DataFrame is a set that consists of a label for each row
Now we will consider this DataFrame, and try to set a column as an index.
Region Company Product Month Sale Purchase 0 West Costco Dinner_set September 2500 3000 1 North Walmart Grocery July 3096 4000 2 South D-mart Tables December 1500 3500 |
This is how we do itdf = df.set_index(‘Region’)
This is how the DataFrame looks after setting “Region” as an index
Region Company Product Month Sale Purchase West Costco Dinner_set September 2500 3000 North Walmart Grocery July 3096 4000 South D-mart Tables December 1500 3500 |
The column ’Region’ is now the index of the DataFrame.
Resetting the Index of Your DataFrame
When your index doesn’t look entirely the way you want it to, you can opt to reset it. You can easily do this with reset_index()
How to Delete Rows or Columns From a Pandas Data Frame
Now that you have seen how to select and add indices, rows, and columns to your DataFrame
Consider a DataFrame
Region Company Product Month Sale Purchase 0 West Costco Dinner_set September 2500 3000 1 North Walmart Grocery July 3096 4000 2 South D-mart Tables December 1500 3500 |
Dropping columns with the column name
To get rid of columns from your DataFrame, you can use the drop() method:df=df.drop(‘Purchase’)
Dropping Rows by index label
To get rid of row from your DataFrame, you can use the drop() method:df=df.drop(‘0’,inplace = True)
We use ‘inplace=True‘ if we want to commit the changes to the dataframe
We will discuss the rest of the Pandas functions and other features in the next article.
5 Responses