Data wrangling in R programming

Data wrangling in R programming

Table of Contents

A data set will be provided in many ways to the world. There are differences among them – wide range and long datasets or more columns. A dataset condenses, whether we are having more rows in the dataset or columns. A dataset spotlights on providing the data about the single column which is known as a wide dataset because we will be adding more columns, the dataset will become wider. A dataset that will focus on including the data of the subject in rows called a ling dataset that is more comfortable to manipulate in R.

wide vs long

The above figure depicts the same dataset which represents a wide dataset as well as a long dataset. The dataset with religions with income classification. As we got to know what a long and wide dataset is, we tried to use tools in R to convert the wide to long dataset and long to wide.

Conversion of a wide dataset to a long one:

The gather () function in the ’ tidy’ package makes a wide dataset long. The gather() function is based on the concept of keys and values. The key represents the name of the variable. The data value is the value of the variable. 

key and value

The dataset represents the income acts as the key by categorizing the income of different religions and the frequency will provide the values to the income key.

Syntax:
gather(Data, Key, Value, Columns)

Parameters:
data: The name
key: The name we can use for the key column in the long dataset.
value: The name we can apply for the value column in the long dataset.
columns: list of columns of a wide dataset that we include or exclude from the gathering.

How to make large data sets?

# we can make Wide Datasets Long with gather()

  # we should Load the tidyverse

library(tidyverse)   

# We should Read the dataset

sample_data <- read.csv(“C:/Users/Admin/Desktop/pw.csv”) 

sample_data1

sample_data1_long <- gather(sample_data1, income, freq, -religion)

sample_data1_long

How to convert a long dataset to a wide dataset?

Sometimes we have to perform a reverse operation of the gather() function. So the spread() function is used to convert a long dataset to a wider dataset.

Syntax:
spread(Data, Key, Value)

Parameters:
data: The name
key: The name we can like to use for the key column in the long dataset.
value: The name we may apply for the value column in the long dataset.

library(Tidyverse)   

sample_data 1<- read.csv(“C:/Users/Admin/Desktop/mexicanweatherrr.csv”)

sample_data1

sample_data_wide1 <- spread(sample_data1, element, value)

sample_data_wide1

Data wrangling with Tidyverse

There are Tidyverse suite integrated packages that are designed to work together to do the common data science operations, more user-friendly. The packages will have functions for data wrangling, tidying, reading, parsing, and visualizing among others. We may explore more basic syntax with these packages and specific functions for data wrangling with the ‘dplyr’ package data tidying with the ‘tidyr’ package and data visualization and the ‘ggplot2’ package.

https://hbctraining.github.io/Intro-to-R/img/tidyverse_website.png

These packages use the same style code which is snake_case formatting for all the function names and arguments.

Adding the files to the working directory:

We can bring in a new file with results from different expressions analysis, to work. We have to download the created files to the data folder. we see in the data folder RStudio “Files” tab.

Tidy verse basics:

The tidy verse suite packages will introduce the sets of data structures, functions, and operators to make it work with the data intuitively, but differently from any way we do things better in base R. There are two important new concepts we will focus on: pipes and Tibbles.

Pipes

Stringing together commands in R that can be quite daunting. Trying to understand code that has many nested functions that will be confusing.

The pipe will allow the output of the previous command to be used as input to another command instead of using nested functions.

##  single command

sqrt(85)

## It is the Base R method of running more than one command

round(sqrt(85), digits = 2)

##  It is Running more than one command with piping

sqrt(85) %>% round(digits = 2)

Tibbles

A  base component of the tidyverse is the Tibble. Tibbles is a modern rework of the standard data. frame, with some internal improvements that make the code reliable. They are data frames but will not follow all the same rules.

Questions

  1. What is Data wrangling with R?
  2. What is Tidyverse?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class