1. What does the term “Big Data” mean?
Big Data is about complex and large datasets. Special tools and methods are used to perform operations on big data because a relational database cannot handle it. Big data processing helps companies to understand their business better. We can derive meaningful information from the data collected that is raw and unstructured. Big data helps companies to make better business decisions based on data sets from previous periods.
2. What are the five V’s of Big Data?
They are Volume, Velocity, Variety, Veracity, Value.
Volume represents the growing at a high rate number of data (for example, data volume in Petabytes). Velocity represents the data growth rate.
Variety refers to the various data types and formats (for example, text, audios, videos, etc)
Veracity is about the uncertainty of available data. A high volume of data brings incompleteness and inconsistency. Value refers to getting value from data and generate revenue.
3. What is Hadoop?
Hadoop is a framework that is used by professionals to analyze big data and to get information from it. With the use of simple programming models, it allows a distributed processing of large data sets across clusters of computers. Hadoop allows us to process big data on a single server the same as on thousands of machines. Hadoop is an open-source framework. You can store, process, and analyze complex unstructured data.
4. What is HDFS?
The HDFS stands for Hadoop Distributed File System. It is a default storage unit for Hadoop that can store different types of data and do it in a distributed environment. One of the main advantages of HDFS is that it is highly fault-tolerant and is designed to be deployed on low-cost hardware.
5. What are the two main components of HDFS architecture?
HDFS is built as a master-slave architecture and consists of NameNode and DataNode. NameNode is the master node that has the metadata information about all the HDFS data blocks. It manages the namespace of the file system and also regulates access to files by clients. DataNode is the slave node and responsible for storing the data. Usually, one cluster has a single NameNode and a number of DataNodes.
6. What is YARN in HDFS?
YARN stands for Yet Another Resource Negotiator. It manages resources and provides an environment for execution processes. YARN has two main components: ResourceManager (responsible for allocating resources to respective NodeManagers based on the needs) and NodeManager (executes tasks on every DataNode).
7. What is commodity hardware?
Commodity Hardware is minimal hardware resources needed to run the Apache Hadoop framework, supports Hadoop’s minimum requirements.
8. What is FSCK?
FSCK (Filesystem Check) is a Hadoop command used to run a Hadoop summary report for the state of HDFS. It is dedicated to errors checking but does not correct them. You can execute this command on either the whole system or a subset of files.
9. What does JPS command for Hadoop?
The JPS command tests the work of all the Hadoop daemons, like NameNode, DataNode, ResourceManager, NodeManager, etc.
10. What are the common input formats in Hadoop?
Hadoop has three common input formats:
- Text Input Format is the default input format in Hadoop.
- Sequence File Input Format is used to read files in a sequence.
- Key-Value Input Format is used for plain text files (files broken into lines).
11. How to handle missing values in Big Data?
Missing values have to be handled properly because they can lead to error data and incorrect outcomes. Therefore, it is recommended to treat missing values before work on the datasets. In the case of a small number of missing values, the data can be dropped. If there’s a bulk of missing values, we need to fill the gaps. There are different ways to fill the missing values in Statistics: regression imputation, Imputation Using k-NN, stochastic Regression Imputation, approximate Bayesian bootstrap, maximum likelihood estimation.
12. How does big data analysis influence increasing business revenue?
Big data analysis helps businesses to differentiate themselves from others and increase the revenue. Big data analytics can provide businesses customized recommendations and suggestions. It can help companies to launch new products depending on customer needs and preferences. This leads to more revenue. Sometimes revenue gets a significant increase of 5-20% after implementing big data analytics. Walmart, LinkedIn, Facebook, Twitter, Bank of America used this method.
13. What are the steps for a Big Data solution deployment?
Here are steps to deploy a Big Data solution:
Data Ingestion – this step can be about the extraction of data from various sources, like a CRM (Salesforce), Enterprise Resource Planning System (SAP), RDBMS (MySQL), or any other documents, social media feeds, etc.
For data ingestion batch jobs or real-time streaming can be used. After extraction, the data is stored in HDFS.
Data Storage step is about storing the extracted data. The data can be stored in HDFS or NoSQL database.
Data Processing- is the final step in deploying a big data solution. You can use one of the following frameworks: Spark, MapReduce, Pig, etc.
14. Which hardware configuration is the best for Hadoop jobs?
Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory is perfect for Hadoop operations. But it depends on on the project-specific workflows.
15. What happens when two users try to access the same file in the HDFS?
HDFS NameNode has exclusive write only mode. Only the first user will be able to access the file and the second user access will be rejected.
16. What outlier detection techniques do you know?
Here are six outlier detection methods:
- Extreme Value Analysis determines the data distribution statistical tails. For example, the statistical method ‘z-scores’.
- Probabilistic and Statistical Models determines the ‘unlikely instances’ from data. For example, Gaussian mixture models with the use of ‘expectation-maximization’.
- The Linear Model method models the data into lower dimensions.
- Information-Theoretic Models approach detects outliers as the bad data instances that increase the dataset complexity.
- The High-Dimensional Outlier Detection method identifies the subspaces for the outliers by the distance measures in higher dimensions.
17. What is Rack Awareness in Hadoop?
Rach awareness is an algorithm that helps to selects DataNodes closer to the NameNode, uses their rack information. We apply it to the NameNode to determine how data blocks with replicas will be placed. By default, all nodes belong to the same rack. Rack awareness algorithm improves data reliability and accessibility, cluster performance, network bandwidth. It prevents data loss if a complete rack failure. It keeps the bulk flow in-rack when possible.
18. What is MapReduce job?
A MapReduce job is dedicated to split the input data set into independent chunks. After, the chunks are processed in a parallel manner by the map tasks. The outputs of the maps become an input to the reduce tasks. Typically the input and the output of the job are stored in a file-system.
19. What are the Shuffling and the Sorting processes in a MapReduce job?
Shuffling and Sorting are two main simultaneously running processes during the mapper and reducer works. Shuffling is the mandatory operation of transferring data from Mapper to reducer. The shuffling process serves as the reduce tasks input.
Between the map and reduce phases, before moving to the Reducer and after the mapper, the output key-value pairs are automatically sorted. This is helpful in programs where you need sorting at some stages.
20. What is Partitioner?
The partitioner is an important phase. It controls the partitioning of the intermediate map-reduce output keys by a hash function. The partitioning process determines the reducer to send a key-value pair (of the map output). The number of partitions is the same as the total number of reduce jobs for the MapReduce process.
Hadoop provides a Hash Partitioner as the default class. This class implements the function int getPartition(K key, V value, int numReduceTasks), where the return value is the partition number, the numReduceTasks is the number of fixed reducers.
21. Show a simple example of the MapReduce work.
Let’s look at a simple example that will help to understand the functioning of MapReduce. In real-time applications, this is going to be much more complex because the data we deal with Hadoop and MapReduce is massive.
Let’s assume we have five files, each of them consists of two key-value pairs (a hotel name and rating). The name of a hotel will be our key and the rating is value. For example:
- The Plaza Hotel in New York City, 4.8
- La Mamounia Marrakech, 4.9
- Brown’s Hotel London, 4.2
- The Plaza Hotel New York City, 4.5
- Brown’s Hotel London, 4.8
- The Brando Tahiti, 4.6
- La Mamounia Marrakech, 4.9
Each file may have the data for the same city multiple times. Now, we need to calculate the maximum review for each hotel across five files. The MapReduce framework will divide this process into five map tasks.
Each mapper performs the same for every file and returns intermediate results. For example, let’s assume that we got:
- (The Plaza Hotel in New York City, 4.6)(Brown’s Hotel London, 4.8)…
- (The Plaza Hotel in New York City, 4.5)(Brown’s Hotel London, 4.9)…
- (The Plaza Hotel in New York City, 4.6)(Brown’s Hotel London, 4.5)…
- (The Plaza Hotel in New York City, 4.8)(Brown’s Hotel London, 4.3)…
- (The Plaza Hotel in New York City, 4.9)(Brown’s Hotel London, 4.8)…
After results are passed to the reduce job. There the inputs from all files are combined to get a single value. The final result will be:
(The Plaza Hotel in New York City, 4.9)(Brown’s Hotel London, 4.9)…
These calculations are extremely efficient for calculation on a large dataset.