What is Apache Flume?
Apache Flume is a tool for data feeding in HDFS. It accumulates, entireties, and ferries large amounts of streaming data such as log files, events from different origins like network traffic, email messages, etc. to HDFS. Flume is a highly trustworthy & distributed.
Advantages of Apache Flume
There are many benefits of Apache Flume that causes it a better alternative over others. The advantages are:
- Flume is fault-tolerant, trustworthy, and scalable.
- It can store data in centralized places like HBase & HDFS.Â
- The Apache Flume is scalable horizontally.
- If the read rate surpasses the write rate, Flume delivers a constant flow of data between reading and writing data.
- Message delivery is reliable using Flame. Flume transactions are channel-based; for each message, two transactions are maintained.
- It sustains a comprehensive set of sources and destinations types.
- Data from multiple sources can be ingested into Hadoop.
Flume Architecture
A Flume agent has 3 components
- Flume Source
Flume source consumes events generated by the Webservers.
- Flume Channel
The data received by the Flume source is stored in one or more channels. The Flume channel keeps the data until it is passed to the Flume sink. Flume channel uses local storage to store the data.
- Flume Sink
Then Flume sink moves this data from channels to the HDFS.
Let’s First download the Apache Flume from the link given below.
https://www.apache.org/dyn/closer.lua/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
Extract the files using the following command
sudo tar -xvf apache-flume-1.9.0-bin.tar.gz
After extraction, the following folder will be created.
Creating a Twitter Application
To get tweets from twitter we need to create a twitter application. Let’s create a twitter application.
Go to https://developer.twitter.com/apps and login with your account. Click on create an app.
Fill the respective details
Find the Keys and Access Tokens tab there you can see a button named Create my access token. Click on it to develop the access token.
You need to place the above information in the configuration file.
Create a new file with the name twitter.conf in the conf folder of Flume.
MyTwitAgent.sources = Twittersource MyTwitAgent.channels = MemChannel MyTwitAgent.sinks = HDFS MyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume MyTwitAgent.sources.Twitter.channels = MemChannel MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from Twitter App> MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from Twitter App> MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter App> MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value from Twitter App> MyTwitAgent.sources.Twitter.keywords = mrcreamio MyTwitAgent.sinks.HDFS.channel = MemChannel MyTwitAgent.sinks.HDFS.type = hdfs MyTwitAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser/flume/tweets/ MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000 MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0 MyTwitAgent.sinks.HDFS.hdfs.rollCount = 1000 MyTwitAgent.channels.MemChannel.type = memory MyTwitAgent.channels.MemChannel.capacity = 1000 MyTwitAgent.channels.MemChannel.transactionCapacity = 1000
start the Hadoop Cluster using the commands given below.
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Check by typing jps in the terminal if all the Nodes are running.
Create a directory in HDFS
Create the directory in the HDFS using the following command.
hdfs dfs -mkdir ~/twitter_data
Now Execute using the following command.
/home/supper_user/apache-flume-1.9.0-bin/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent
The streaming of tweets into HDFS will start. Given below is the screenshot of the command prompt while fetching tweets.