Apache Flume Tutorial

What is Apache Flume?

Apache Flume is a tool for data feeding in HDFS. It accumulates, entireties, and ferries large amounts of streaming data such as log files, events from different origins like network traffic, email messages, etc. to HDFS. Flume is a highly trustworthy & distributed.

Advantages of Apache Flume

There are many benefits of Apache Flume that causes it a better alternative over others. The advantages are:

Flume is fault-tolerant, trustworthy, and scalable.
It can store data in centralized places like HBase & HDFS.
The Apache Flume is scalable horizontally.
If the read rate surpasses the write rate, Flume delivers a constant flow of data between reading and writing data.
Message delivery is reliable using Flame. Flume transactions are channel-based; for each message, two transactions are maintained.
It sustains a comprehensive set of sources and destinations types.
Data from multiple sources can be ingested into Hadoop.

Flume Architecture

A Flume agent has 3 components

Flume Source

Flume source consumes events generated by the Webservers.

Flume Channel

The data received by the Flume source is stored in one or more channels. The Flume channel keeps the data until it is passed to the Flume sink. Flume channel uses local storage to store the data.

Flume Sink

Then Flume sink moves this data from channels to the HDFS.

Let’s First download the Apache Flume from the link given below.

https://www.apache.org/dyn/closer.lua/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz

Extract the files using the following command

sudo tar -xvf apache-flume-1.9.0-bin.tar.gz

After extraction, the following folder will be created.

Creating a Twitter Application

To get tweets from twitter we need to create a twitter application. Let’s create a twitter application.

Go to https://developer.twitter.com/apps and login with your account. Click on create an app.

Fill the respective details

Find the Keys and Access Tokens tab there you can see a button named Create my access token. Click on it to develop the access token.

You need to place the above information in the configuration file.

Create a new file with the name twitter.conf in the conf folder of Flume.

MyTwitAgent.sources = Twittersource
MyTwitAgent.channels = MemChannel
MyTwitAgent.sinks = HDFS
MyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume
MyTwitAgent.sources.Twitter.channels = MemChannel 
MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from Twitter App>
MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from Twitter App>
MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter App>
MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value from Twitter App>
MyTwitAgent.sources.Twitter.keywords = mrcreamio
MyTwitAgent.sinks.HDFS.channel = MemChannel
MyTwitAgent.sinks.HDFS.type = hdfs
MyTwitAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser/flume/tweets/
MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream
MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text
MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000
MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0
MyTwitAgent.sinks.HDFS.hdfs.rollCount = 1000
MyTwitAgent.channels.MemChannel.type = memory
MyTwitAgent.channels.MemChannel.capacity = 1000
MyTwitAgent.channels.MemChannel.transactionCapacity = 1000

start the Hadoop Cluster using the commands given below.

$HADOOP_HOME/sbin/start-dfs.sh

$HADOOP_HOME/sbin/start-yarn.sh

Check by typing jps in the terminal if all the Nodes are running.

Create a directory in HDFS

Create the directory in the HDFS using the following command.

hdfs dfs -mkdir ~/twitter_data

Now Execute using the following command.

/home/supper_user/apache-flume-1.9.0-bin/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent

The streaming of tweets into HDFS will start. Given below is the screenshot of the command prompt while fetching tweets.

advantages, Apache Flume

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article

What Are the Basics of Salesforce Training for Certification?

April 18, 2025

Everything You’ll Learn in Agile and Scrum Training Courses

April 18, 2025

What are some free online courses for a scrum master?

April 17, 2025

AWS DevSecOps Training Course Overview

April 17, 2025

Scrum Master Certification Online: What You Need to Know Before Enrolling

April 14, 2025

Unlock Opportunities: Top Benefits of a DevOps Course

April 14, 2025

Need a Free Demo Class?

Join H2K Infosys IT Online Training

Enroll Now

Key features of Spark SQL

May 29, 2024

Hadoop YARN Tutorial – Learn the Fundamentals of YARN Architecture

January 13, 2021

Big data Hadoop Certification course – An Insight

October 21, 2020

Hadoop & MapReduce Interview Questions & Answers

October 8, 2020

Hadoop & Mapreduce Examples

September 4, 2020

Hadoop Online Training: Top 10 Hadoop Tools for Big Data

June 9, 2020

Hadoop Big Data Online Test

September 4, 2017

Steven Roger

Steven Roger is a technology blogger for the H2K Infosys blog, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

Read All from Steven Roger