Sometimes we need to combine two large datasets for this purpose MapReduce provides join operation. If we try to do the join manually, it requires a lot of code. MapReduce provides easy functionality, MapReduce Join and Counter having Two datasets are compared for size, and a smaller dataset is distributed to every DataNode. Then, The Reducer or Mapper uses the smaller dataset and manages it to perform lookup operations to find records. Lastly, the matching records from smaller and large datasets are merged to create the output joined records.
There are two types of joins.
- Map-side Join
- Reduce-side Join
Map-side Join
In the Map-side Join, the operation is performed by the mapper. Here, the Join is performed before the actual map function could consume the data. This type of Join has the prerequisite that it requires the input given to the map to be in the form of a partition, and all such inputs should be in the sorted order. The joining key must sort the equal sections.
Reduce-side Join
In the Reduce-side Join, the operation is performed by the reducer. In reduce-side join, the dataset is not expected to be in the form of structure. The map side joins processing produces the join key and the associated similar tuples from both of the records. Hence, all the tuples that have the same key group into the same reducer, they are joined to form the output records.
Let’s start with Hadoop first.
First of all, start the Hadoop Cluster using the commands given below.
$HADOOP_HOME/sbin/start-dfs.sh |
$HADOOP_HOME/sbin/start-yarn.sh |
Check by typing jps in the terminal if all the Nodes are running.
We have the following data
Download the Github repo from the link given below. We will be using those files.
https://github.com/mrcreamio/Hadoop-tutorials
Move the downloaded file to the respective repository using the command given below.
sudo cp -r /home/ahmed/Desktop/MapReduceJoin /home/supper_user/ |
Move to the respective directory.
cd MapReduceJoin/ |
Now let’s copy our input files to the HDFS.
hdfs dfs -copyFromLocal DeptStrength.txt DeptName.txt / |
Let’s check if we have the files copied.
hdfs dfs -ls / |
Run the program using the command given below.
$HADOOP_HOME/bin/hadoop jar MapReduceJoin.jar /DeptStrength.txt /DeptName.txt /output_mapreducejoin |
Let’s see the output files using the command given below.
Here is the output.
hdfs dfs -cat /output_mapreducejoin/part-00000 |