Data Reading and Data Inspection Using Pandas

Data Reading and Data Inspection Using Pandas

Table of Contents

In the previous article, we have discussed what pandas is? It’s importance in data science, how to install it, and perform basic operations like adding and deleting index, rows, and columns in a DataFrame. Now we will dive deeper into the applications of pandas in real-time situations like Data Reading and Data Inspection Using Pandas

As a data scientist or an analyst, you’ll probably come across many file types to import and use in your Python scripts. Some analysts use Microsoft Excel, but the application limits what you can do with large data imports. The better option is pandas. It is a powerful analysis toolkit that’s much more intuitive for a data scientist.

What file formats can pandas use…?

Python can handle virtually any data file format much more than Microsoft Excel. That’s the strength of Python. It’s open-source, and there’s probably a library out there to handle it, so you get a vastly more compatible system. 

These are the most common types of Data which we will come across

  • Comma-separated values (CSV)
  • XLSX
  • JSON
  • XML
  • HTML
  • Images
  • PDF
  • DOCX
  • SQL

How to read and write tabular data ?Data Reading and Data Inspection Using Pandas

Now we will learn to read and write data using pandas functions. We will use pandas read_csv() and .to_csv() functions

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

Write a CSV File

You can save your Pandas DataFrame as a CSV file with .to_csv():

df.to_csv(‘data.csv’)

That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:

data.csv

Read a CSV File

Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the Pandas read_csv() function:

df = pd.read_csv(‘data.csv’, index_col=0)

df

COUNTRY      POP         AREA       CONT     IND_DAY                                                  
China        1398.729596.96Asia       NaN        
India        1351.163287.26Asia       1947-08-15 
US           329.749833.52N.America  1776-07-04 
Indonesia    268.071910.93Asia       1945-08-17 

Write an Excel File

You can save your Pandas DataFrame as a CSV file with .to_excel():

df.to_csv(‘data.csv’)

Read an Excel File

You can do that with the Pandas read_excel() function:

df = pd.read_excel('data.xlsx', index_col=0)

Write an Json File

You can save your Pandas DataFrame as a CSV file with .to_json():

df.to_json('data-index.json', orient='index')

Read an Json File

You can do that with the Pandas read_json() function:

df = pd.read_json('data.xlsx', index_col=0)

Write Files

Series and DataFrame objects have methods that enable writing data and labels to the clipboard or files. They’re named with the pattern .to_<file-type>(), where <file-type> is the type of the target file.

You’ve learned about .to_csv() and .to_excel(), but there are others, including:

  • .to_json()
  • .to_html()
  • .to_sql()
  • .to_pickle()

There are still more file types that you can write to, so this list is not exhaustive.

Read Files

Pandas functions for reading the contents of files are named using the pattern .read_<file-type>(), where <file-type> indicates the type of the file to read. You’ve already seen the Pandas read_csv() and read_excel() functions. Here are a few others:

  • read_json()
  • read_html()
  • read_sql()
  • read_pickle()

These functions have a parameter that specifies the target file path. It can be any valid string that represents the path, either on a local machine or in a URL. Other objects are also acceptable depending on the file type.

How to view and inspect data in a DataFrame ?

For checking the data of pandas.DataFrame and pandas.Series with many rows and columns head() and tail() methods are useful.

Now we will use Iris Data set from kaggle for this tutorial

“ https://www.kaggle.com/uciml/iris “ 

import pandas as pd 

df = sns.load_dataset("iris")

Get first n rows of DataFrame: head()

The head() method returns the first n rows.

print(df.head(5))

    sepal_length  sepal_width  petal_length  petal_width species 
05.13.51.40.2setosa  
14.93.01.40.2setosa  
24.73.21.30.2setosa  
34.63.11.50.2setosa  
45.03.61.40.2setosa  

Get first n rows of DataFrame: tail()

The tail() method returns the first n rows.

print(df.tail(5))

     sepal_length sepal_width  petal_length petal_width  species  
1456.73.05.22.3virginica
1466.32.55.01.9virginica
14726.53.05.22.0virginica
1486.23.45.42.3virginica
1595.93.05.11.8virginica

Pandas .shape,.size and .ndim are used to return size, shape and dimensions of data frames and series.

Create a DataFrame

import pandas as pd
import numpy as np
d={‘Name’:pd.Series(['Tom','James','Ricky','Vin',    'Steve']),'Age':pd.Series([25,26,25,23,30]),'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20])}
#Create a DataFrame 
df = pd.DataFrame(d)
print df

Output:

    Age   Name    Rating
025Tom     4.23
126James   3.24
225Ricky   3.98
323Vin     2.56
430Steve   3.20

.shape Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b), where a represents the number of rows and b represents the number of columns.

df.shape

Output:

(5, 3) //5 Rows & 3 Columns

.size Returns the number of elements in the DataFrame.

df.size

Output: 

21 // The total number of elements in our object is:

.ndim Returns the number of dimensions of the object. By definition, DataFrame is a 2D object

df.ndim

Output: 

2 // The dimension of the object is

Pandas .info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Consider the following DataFrame df

      int_col text_col   float_col 
01  alpha0.00
12  beta 0.25
23  gamma0.50
34  delta0.75
45  Epsilon1.00

df.info()

Output:

<class ‘pandas.core.frame.DataFrame’>  
RangeIndex: 5 entries, 0 to 4          
Data columns (total 3 columns):        
float_col    5 non-null float64        
int_col      5 non-null int64          
text_col     5 non-null object         
dtypes: float64(1), int64(1), object(1)
memory usage: 192.0+ bytes             

Pandas .describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns

df.describe()

Output:

       float_col  int_col
count  5.0000005.000000
mean   0.5000003.000000
std    0.3952851.581139
min    0.0000001.000000
25%0.2500002.000000
50%0.5000003.000000
75%0.7500004.000000
max    1.0000005.000000

Pandas .value_counts() function returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Consider a DataFrame

    Student
0Harry  
1Mike   
2Arther 
3Harry  
4Arther 

Output:

Harry     2 
Arther    2 
Mike      1 
Nick      1 
Name: Student, dtype: int64

Pandas is a huge concept where we will learn all its components in a step by step manner. In the next article we will discuss about data selection, data cleaning, filtering, sorting, group-by, joining and combining of the dataset

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class