Apache Oozie Tutorial

Apache Oozie Tutorial

Table of Contents

What is Apache Oozie?

To manage Hadoop jobs in a distributed environment, Apache Oozie is used. Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. Oozie authorizes merging multiple complicated jobs that are run in a sequential order to gain a more critical task. Within a sequence of the task, two or more jobs can also be programmed to run parallel to each other.

Oozie supports three types of jobs.

  • Workflow engine

The commitment of a workflow engine is to keep and drive workflows comprised of Hadoop jobs, e.g., MapReduce, Pig, Hive.

  • Coordinator engine

It drives workflow jobs established on predefined schedules and availability of data.

  • Bundle

The bundle is the high-level abstraction that will batch a set of coordinator jobs. The bundle is a collection of coordinator jobs.

Apache Oozie Tutorial

Types of Nodes in Oozie Workflow.

  • Start and End node

Start and end nodes define the beginning and end of the workflow. These nodes include optional fail nodes.

  • Action Nodes

Actual processing tasks are defined in action nodes. The system remotely notifies Oozie when a specific action node finishes and the next node in the workflow is executed. HDFS commands are also included in the action nodes.

  • Fork and Join nodes

Parallel execution of tasks in the workflow is executed with the help of a fork and join nodes. Two or more nodes can run at the same time using Fork nodes. In some cases, we need to wait for some specific tasks to complete their working for this join node is used.

  • Control flow nodes

Control flow nodes take decisions about previous tasks. Control nodes are based on the result of the previous nodes. Control flow nodes are if-else statements that evaluate to be true or false.

Here is an example of a simple workflow of Oozie.

Apache Oozie Tutorial

Packaging and deploying an Oozie workflow application

A workflow application has the workflow description and all the related resources such as MapReduce Jar files, and Pig scripts. Applications need to obey a straightforward directory structure and are deployed to HDFS so that Oozie can access them.

directory structure

<name of workflow>/</name>
??? lib/
? ??? hadoop-examples.jar
??? workflow.xml

It is essential to maintain workflow.xml (a workflow definition file) in the top-level folder.

Lib directory has Jar files, including MapReduce classes. Workflow application coordinating to this layout can be built with any build tool.

Now copy this to HDFS using the command given below

hadoop fs -put hadoop-examples/target/<name of workflow dir> name of workflow

Steps for Running an Oozie workflow job

1. Now we need to tell Oozie which server to use for that Export OOZIE_URL environment variable

export OOZIE_URL="http://localhost:22000/oozie"

2. Run workflow job using the command given below

oozie job -config <location of properties file> -run

3. Get the status of workflow job

oozie job -info <job id>

4. Results of prosperous workflow implementation can be seen using the command below

hadoop fs -cat <location of result>

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class