Apache hadoop is open source software framework which is used to develop fro data processing applications which will be executed in a distributed computing environment.
Applications which are built using hadoop will run on large data sets distributed across clusters of commodity of the computers. In hadoop data will resides in the distributed in the file system which is called hadoop distributed file system. This process of model is based on the data locality concept.
Hadoop ecosystem and components
This will consist of two projects:
- Hadoop map reduce: Its a computational model also software framework for writing applications which will run on the hadoop. These mapreduce programs are efficient of processing enormous data in the parallel on large cluster of computation nodes
- HDFS: it stands for hadoop distributed file system which takes care of the storage which is part of the hadoop applications. Map reduce applications will consume the data from HDFS. This will create multiple replicas of the data blocks and distributed them on the compute nodes in cluster..
Hadoop is used for mapreduce and distributed file system- HDFS term accepts related projects that fall under umbrella of the distributed computing and large scale data processing.
Hadoop Architecture:
Hadoop is architecture for the data storage and also distributed data processing using map to reduce
- Name node: which represents every files and directory which is been used in the name space
- Data node: this will help us to manage the state of the HDFS node which allows to interacts with the blocks
- Master node: this will allow to conduct the parallel processing of the data using hadoop map reduce.
- Slave node: It is the another part of machines in the hadoop cluster which accepts to store the data to do the complex calculations. Rest all the slave node will come with the task tracker and a data node., this will allow to synchronize the processes with the name node and job tracker.
Features of hadoop:
- Suitable for big data analysis: hadoop clusters are the best suited for analysis of the big data. This will flow to the computing node, less network bandwidth will be consumed. It is data locality concept which will help the hadoop based applications.
- Scalability: these can be easily scaled to any extent by just adding clusters nodes and which allows for the growth of the big data. Scaling does not require modifications.
- Fault tolerance: it has a provision to replicate the input data on other cluster nodes , in the event of a cluster node failure, data processing can still be proceed by using the data stored on another cluster node.
- License Free: All mayl get this licence Apache Hadoop website from the download Hadoop an install and work on it.
- open source: It’s the source code that is available where we can modify, change as per the requirements.
- Meant for big data analytics: It can also handle volume, variety, velocity and value. Hadoop will be a concept of handling the data and also controls it with the help of ecosystem approach.
- Horizontal scalability: We may not be needing to build the large clusters by just keeping the nodes added. As the data keeps on growing we keep adding the nodes.
- Distributors: With the help of distributors we get the bundles and also built in packages we may not need to install each package individually.
Cloudera: It is US company that built by the employees of the facebook, linkedln and also yahoo. It also provides the solution for the Hadoop and enterprise.
Hortonworks: Its products are known as HDP which is not an enterprise it will open source and licence free. It has a free tool called Apache Ambari which is created by Hortonworks clusters.
Hadoop Advantages
- Scalable
This is having scalable storage platform as this is stored and also distributed many large data sets across hundreds of inexpensive servers that will operate in parallel. Traditional relational database system that can be scale to process large amounts of data.
- Cost-effective
Hadoop provides cost effective storage for business storing data sets. The problem with an traditional relational database management systems that is cost prohibitive to scale to such a degree where the process such as massive volumes of data.
Questions
1. What is Hadoop?
2. Explain the architecture of Hadoop?