A data set will be provided in many ways to the world. There are differences among them – wide range and long datasets or more columns. A dataset condenses, whether we are having more rows in the dataset or columns. A dataset spotlights on providing the data about the single column which is known as a wide dataset because we will be adding more columns, the dataset will become wider. A dataset that will focus on including the data of the subject in rows called a ling dataset that is more comfortable to manipulate in R.
The above figure depicts the same dataset which represents a wide dataset as well as a long dataset. The dataset with religions with income classification. As we got to know what a long and wide dataset is, we tried to use tools in R to convert the wide to long dataset and long to wide.
Conversion of a wide dataset to a long one:
The gather () function in the ’ tidy’ package makes a wide dataset long. The gather() function is based on the concept of keys and values. The key represents the name of the variable. The data value is the value of the variable.
The dataset represents the income acts as the key by categorizing the income of different religions and the frequency will provide the values to the income key.
Syntax:
gather(Data, Key, Value, Columns)
Parameters:
data: The name
key: The name we can use for the key column in the long dataset.
value: The name we can apply for the value column in the long dataset.
columns: list of columns of a wide dataset that we include or exclude from the gathering.
How to make large data sets?
# we can make Wide Datasets Long with gather()
# we should Load the tidyverse
library(tidyverse)
# We should Read the dataset
sample_data <- read.csv(“C:/Users/Admin/Desktop/pw.csv”)
sample_data1
sample_data1_long <- gather(sample_data1, income, freq, -religion)
sample_data1_long
How to convert a long dataset to a wide dataset?
Sometimes we have to perform a reverse operation of the gather() function. So the spread() function is used to convert a long dataset to a wider dataset.
Syntax:
spread(Data, Key, Value)
Parameters:
data: The name
key: The name we can like to use for the key column in the long dataset.
value: The name we may apply for the value column in the long dataset.
library(Tidyverse)
sample_data 1<- read.csv(“C:/Users/Admin/Desktop/mexicanweatherrr.csv”)
sample_data1
sample_data_wide1 <- spread(sample_data1, element, value)
sample_data_wide1
Data wrangling with Tidyverse
There are Tidyverse suite integrated packages that are designed to work together to do the common data science operations, more user-friendly. The packages will have functions for data wrangling, tidying, reading, parsing, and visualizing among others. We may explore more basic syntax with these packages and specific functions for data wrangling with the ‘dplyr’ package data tidying with the ‘tidyr’ package and data visualization and the ‘ggplot2’ package.
These packages use the same style code which is snake_case formatting for all the function names and arguments.
Adding the files to the working directory:
We can bring in a new file with results from different expressions analysis, to work. We have to download the created files to the data folder. we see in the data folder RStudio “Files” tab.
Tidy verse basics:
The tidy verse suite packages will introduce the sets of data structures, functions, and operators to make it work with the data intuitively, but differently from any way we do things better in base R. There are two important new concepts we will focus on: pipes and Tibbles.
Pipes
Stringing together commands in R that can be quite daunting. Trying to understand code that has many nested functions that will be confusing.
The pipe will allow the output of the previous command to be used as input to another command instead of using nested functions.
## single command
sqrt(85)
## It is the Base R method of running more than one command
round(sqrt(85), digits = 2)
## It is Running more than one command with piping
sqrt(85) %>% round(digits = 2)
Tibbles
A base component of the tidyverse is the Tibble. Tibbles is a modern rework of the standard data. frame, with some internal improvements that make the code reliable. They are data frames but will not follow all the same rules.
Questions
- What is Data wrangling with R?
- What is Tidyverse?