Stop Thinking, Just Do!

Sungsoo Kim's Blog

Simple YARN Application

tagsTags

28 April 2014


Article Source

Simple YARN Application

Getting Started

This guide walks you through the process of creating a Spring Hadoop YARN application.

What you’ll build

You’ll build a simple Hadoop YARN application with Spring Hadoop and Spring Boot.

What you’ll need

  • About 15 minutes

  • A favorite text editor or IDE

  • JDK 1.6 or later

  • Gradle 1.11+ or Maven 3.0+

  • You can also import the code from this guide as well as view the web page directly into Spring Tool Suite (STS) and work your way through it from there.

  • Local single-node instance based on Hadoop 2.2.0 or later. The Apache Hadoop site has some instructions.

How to complete this guide

Like most Spring Getting Started guides, you can start from scratch and complete each step, or you can bypass basic setup steps that are already familiar to you. Either way, you end up with working code.

To start from scratch, move on to Set up the project.

To skip the basics, do the following:

  • Download and unzip the source repository for this guide, or clone it using Git: git clone https://github.com/spring-guides/gs-yarn-basic.git

  • cd into gs-yarn-basic/initial

  • Jump ahead to Create a Yarn Container.

When you’re finished, you can check your results against the code in gs-yarn-basic/complete.

Hadoop YARN Intro

If you have been following the Hadoop community over the past year or two, you’ve probably seen a lot of discussions around YARN and the next version of Hadoop’s MapReduce called MapReduce v2. YARN (Yet Another Resource Negotiator) is a component of the MapReduce project created to overcome some performance issues in Hadoop’s original design. The fundamental idea of MapReduce v2 is to split the functionalities of the JobTracker, Resource Management and Job Scheduling/Monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and a per-application Application Master (AM). A generic diagram for YARN component dependencies can be found on the Hadoop page describing the YARN architecture.

MapReduce Version 2 is a re-write of the original MapReduce code run as an application on top of YARN. It is also possible to write other types of applications, having nothing to do with MapReduce, and then run them on YARN. However, the YARN APIs are complex and writing a custom YARN based application is difficult. The YARN APIs are low-level infrastructure APIs, not high-level developer APIs.

Spring YARN Intro

The development process for a YARN application, from the moment when a developer starts his or her work to the point when someone actually executes the application on a Hadoop cluster, is a bit more complicated than just creating a few lines of “Hello world!” code.

Let’s see what needs to be considered:

  • What is the project structure for the application code?

  • How is the project built and packaged?

  • How is the packaged application configured?

  • How is the final application executed on YARN?

We believe that Spring YARN and Spring Boot creates a very clear story for how above topics could be handled.

At a high level, Spring YARN provides three different components, YarnClient, YarnAppmaster and YarnContainer which together can be called a Spring YARN Application. We provide default implementations for all components while still giving the end user an option to customize as much as he or she wants.

In a pure Hadoop environment it has always been a cumbersome process to get your own code packaged, deployed and executed on a Hadoop cluster. Should you just put your compiled package in Hadoop’s classpath, or rely on Hadoop’s tools to copy your artifacts into Hadoop during the job submission? What about if your own code depends on some library that isnt already present on Hadoop’s default classpath? Even worse, what about if the dependencies in your code collides with libraries already on Hadoop’s default classpath?

With Spring Boot you can work around all these issues. You either create an executable jar (sometimes called an uber or fat jar) which bundles all dependencies, or a zip package which can be automatically extracted before the code is about to be executed. In the latter case, it’s possible to re-use entries already available on Hadoop’s default classpath.

In this guide we are going to show how these 3 components, YarnClient, YarnAppmaster and YarnContainer are packaged into executable jars using Spring Boot. Internally Spring Boot rely heavy on application auto-configuration and Spring YARN adds its own auto-configuration magic. The application developer can then concentrate on his or her own code and application configuration instead of spending a lot of time trying to understand how all the components should integrate with each other.

Set up the project

First you set up a basic build script. You can use any build system you like when building apps with Spring, but the code you need to work with Gradle and Maven is included here. If you’re not familiar with either, refer to Building Java Projects with Gradle or Building Java Projects with Maven.

We also have additional guides having specific instructions using build systems with Spring YARN. If you’re not familiar with either, refer to Building Spring YARN Projects with Gradle or Building Spring YARN Projects with Maven.

Create the directory structure

In a project directory of your choosing, create the following subdirectory structure:

├── gs-yarn-basic-appmaster
│   └── src
│       └── main
│           ├── resources
│           └── java
│               └── hello
│                   └── appmaster
├── gs-yarn-basic-container
│   └── src
│       └── main
│           ├── resources
│           └── java
│               └── hello
│                   └── container
├── gs-yarn-basic-client
│   └── src
│       └── main
│           ├── resources
│           └── java
│               └── hello
│                   └── client
└── gs-yarn-basic-dist

for example, on *nix systems, with:

mkdir -p gs-yarn-basic-appmaster/src/main/resources
mkdir -p gs-yarn-basic-appmaster/src/main/java/hello/appmaster
mkdir -p gs-yarn-basic-container/src/main/resources
mkdir -p gs-yarn-basic-container/src/main/java/hello/container
mkdir -p gs-yarn-basic-client/src/main/resources
mkdir -p gs-yarn-basic-client/src/main/java/hello/client
mkdir -p gs-yarn-basic-dist

Create the Gradle build files

Below is the initial Gradle build file and the initial Gradle settings file. But you can also use Maven. The pom.xml file is included right here. If you are using Spring Tool Suite (STS), you can import the guide directly.

build.gradle

buildscript {
    repositories {
        maven { url "http://repo.spring.io/libs-milestone" }
    }
    dependencies {
        classpath("org.springframework.boot:spring-boot-gradle-plugin:1.0.0.RELEASE")
    }
}

allprojects {
    apply plugin: 'base'
}

subprojects { subproject ->
    apply plugin: 'java'
    apply plugin: 'eclipse'
    apply plugin: 'idea'
    version =  '0.1.0'
    repositories {
        mavenCentral()
        maven { url "http://repo.spring.io/libs-milestone" }
    }
    dependencies {
        compile("org.springframework.data:spring-yarn-boot:2.0.0.RC2")
    }
    task copyJars(type: Copy) {
        from "$buildDir/libs"
        into "$rootDir/gs-yarn-basic-dist/target/gs-yarn-basic-dist/"
        include "**/*.jar"
    }
    assemble.doLast {copyJars.execute()}
}

project('gs-yarn-basic-client') {
    apply plugin: 'spring-boot'
}

project('gs-yarn-basic-appmaster') {
    apply plugin: 'spring-boot'
}

project('gs-yarn-basic-container') {
    apply plugin: 'spring-boot'
}

project('gs-yarn-basic-dist') {
    dependencies {
        compile project(":gs-yarn-basic-client")
        compile project(":gs-yarn-basic-appmaster")
        compile project(":gs-yarn-basic-container")
        testCompile("org.springframework.data:spring-yarn-boot-test:2.0.0.RC2")
        testCompile("org.hamcrest:hamcrest-core:1.2.1")
        testCompile("org.hamcrest:hamcrest-library:1.2.1")
    }
    test.dependsOn(':gs-yarn-basic-client:assemble')
    test.dependsOn(':gs-yarn-basic-appmaster:assemble')
    test.dependsOn(':gs-yarn-basic-container:assemble')
    clean.doLast {ant.delete(dir: "target")}
    jar.enabled = false
}

task wrapper(type: Wrapper) {
    gradleVersion = '1.11'
}

settings.gradle

rootProject.name = 'gs-yarn-basic'
include 'gs-yarn-basic-client','gs-yarn-basic-appmaster','gs-yarn-basic-container','gs-yarn-basic-dist'

In the above gradle build file we simply create three different jars, each having classes for its specific role. These jars are then repackaged by Spring Boot’s gradle plugin to create an executable jar.

Create a Yarn Container

Here you create ContainerApplication and HelloPojo classes.

gs-yarn-basic-container/src/main/java/hello/container/ContainerApplication.java

package hello.container;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
@EnableAutoConfiguration
public class ContainerApplication {

    public static void main(String[] args) {
        SpringApplication.run(ContainerApplication.class, args);
    }

    @Bean
    public HelloPojo helloPojo() {
        return new HelloPojo();
    }

}

In the above ContainerApplication, notice how we added the @Configuration annotation at the class level and the @Bean annotation on the helloPojo() method. We have jumped a little bit ahead of what you most likely expect us to do. We previously mentioned YarnContainer component which is an interface towards what you’d execute in your containers. You could define your custom YarnContainer to implement this interface and wrap all logic inside of that implementation.

However, Spring YARN defaults to a DefaultYarnContainer if none is defined and this default implementation expects to find a specific bean type from a Spring Application Context having the real user facing logic what container is supposed to do.

gs-yarn-basic-container/src/main/java/hello/container/HelloPojo.java

package hello.container;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.hadoop.fs.FsShell;
import org.springframework.yarn.annotation.OnYarnContainerStart;
import org.springframework.yarn.annotation.YarnContainer;

@YarnContainer
public class HelloPojo {

    private static final Log log = LogFactory.getLog(HelloPojo.class);

    @Autowired
    private Configuration configuration;

    @OnYarnContainerStart
    public void publicVoidNoArgsMethod() throws Exception {
        log.info("Hello from HelloPojo");
        log.info("About to list from hdfs root content");

        FsShell shell = new FsShell(configuration);
        for (FileStatus s : shell.ls(false, "/")) {
            log.info(s);
        }
        shell.close();
    }

}

HelloPojo class is a simple POJO in a sense that it doesn’t extend any Spring YARN base classes. What we did in this class:

@YarnContainer is a stereotype annotation, providing a Spring @Component annotation. This is automatically marking a class to be a candidate for having @YarnContainer functionality.

Within this class we can use @OnYarnContainerStart annotation to mark a public method with void return type and no arguments act as an entry point for some application code that needs to be executed on Hadoop.

To demonstrate that we actually have some real functionality in this class, we simply use Spring Hadoop’s @FsShell to list entries from the root of the HDFS file system. We needed to have Hadoop’s Configuration which is prepared for you so that you can just rely on autowiring for access to it.

Create a Yarn Appmaster

Here you create an AppmasterApplication class.

gs-yarn-basic-appmaster/src/main/java/hello/appmaster/AppmasterApplication.java

package hello.appmaster;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;

@EnableAutoConfiguration
public class AppmasterApplication {

    public static void main(String[] args) {
        SpringApplication.run(AppmasterApplication.class, args);
    }

}

The application class for YarnAppmaster looks even simpler than what we just did for ClientApplication. Again the main() method uses Spring Boot’s SpringApplication.run() method to launch an application.

One might argue that if you use this type of dummy class to basically fire up your application, could we not use a generic class for this? Well simple answer is yes, we even have a generic SpringYarnBootApplication class just for this purpose. You’d define that to be your main class for an executable jar and you’d accomplish this during the gradle build.

In real life, however, you most likely need to start adding more custom functionality to your application component and you’d do that by starting to add more beans. To do that you need to define a Spring @Configuration or @ComponentScan. AppmasterApplication would then act as your main starting point to define more custom functionality. Effectively this is exactly what we do with a YarnContainer in section below.

Create a Yarn Client

Here you create a ClientApplication class.

gs-yarn-basic-client/src/main/java/hello/client/ClientApplication.java

package hello.client;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.yarn.client.YarnClient;

@EnableAutoConfiguration
public class ClientApplication {

    public static void main(String[] args) {
        SpringApplication.run(ClientApplication.class, args)
            .getBean(YarnClient.class)
            .submitApplication();
    }

}
  • @EnableAutoConfiguration tells Spring Boot to start adding beans based on classpath setting, other beans, and various property settings.

  • Specific auto-configuration for Spring YARN components takes place in a same way than from a core Spring Boot.

The main() method uses Spring Boot’s SpringApplication.run() method to launch an application. From there we simply request a bean of type YarnClient and execute its submitApplication() method. What happens next depends on application configuration, which we go through later in this guide. Did you notice that there wasn’t a single line of XML?

Create an Application Configuration

Create a new yaml configuration file for all sub-projects.

gs-yarn-basic-container/src/main/resources/application.yml gs-yarn-basic-appmaster/src/main/resources/application.yml gs-yarn-basic-client/src/main/resources/application.yml

spring:
    hadoop:
        fsUri: hdfs://localhost:8020
        resourceManagerHost: localhost
    yarn:
        appName: gs-yarn-basic
        applicationDir: /app/gs-yarn-basic/
        client:
            files:
              - "file:gs-yarn-basic-dist/target/gs-yarn-basic-dist/gs-yarn-basic-container-0.1.0.jar"
              - "file:gs-yarn-basic-dist/target/gs-yarn-basic-dist/gs-yarn-basic-appmaster-0.1.0.jar"
            launchcontext:
                archiveFile: gs-yarn-basic-appmaster-0.1.0.jar
        appmaster:
            containerCount: 1
            launchcontext:
                archiveFile: gs-yarn-basic-container-0.1.0.jar

Pay attention to the yaml file format which expects correct indentation and no tab characters.

Final part for your application is its runtime configuration, which glues all the components together, which then can be executed as a Spring YARN application. This configuration act as source for Spring Boot’s @ConfigurationProperties and contains relevant configuration properties which cannot be auto-discovered or otherwise needs to have an option to be overwritten by an end user.

This way you can define your own defaults for your environment. Because these @ConfigurationProperties are resolved at runtime by Spring Boot, you even have an easy option to overwrite these properties either by using command-line options, environment variables or by providing additional configuration property files.

Build the Application

For gradle simply execute the clean and build tasks.

./gradlew clean build

To skip existing tests if any:

./gradlew clean build -x test

For maven simply execute the clean and package goals.

mvn clean package

To skip existing tests if any:

mvn clean package -DskipTests=true

Below listing shows files after a succesfull gradle build.

gs-yarn-basic-dist/target/gs-yarn-basic-dist/gs-yarn-basic-client-0.1.0.jar
gs-yarn-basic-dist/target/gs-yarn-basic-dist/gs-yarn-basic-appmaster-0.1.0.jar
gs-yarn-basic-dist/target/gs-yarn-basic-dist/gs-yarn-basic-container-0.1.0.jar

Run the Application

Now that you’ve successfully compiled and packaged your application, it’s time to do the fun part and execute it on Hadoop YARN.

To accomplish this, simply run your executable client jar from the projects root dirctory.

$ java -jar gs-yarn-basic-dist/target/gs-yarn-basic-dist/gs-yarn-basic-client-0.1.0.jar

Using the Resource Manager UI you can see status of an application.

Resource Manager UI

To find Hadoop’s application logs, you need to do a simple find within the hadoop clusters configured userlogs directory.

$ find hadoop/logs/userlogs/ | grep std
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000001/Appmaster.stdout
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000001/Appmaster.stderr
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000002/Container.stdout
hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000002/Container.stderr

Grep logging output from a HelloPojo class.

$ grep HelloPojo hadoop/logs/userlogs/application_1395578417086_0001/container_1395578417086_0001_01_000002/Container.stdout
[2014-03-23 12:42:05.763] boot - 17064  INFO [main] --- HelloPojo: Hello from HelloPojo
[2014-03-23 12:42:05.763] boot - 17064  INFO [main] --- HelloPojo: About to list from hdfs root content
[2014-03-23 12:42:06.745] boot - 17064  INFO [main] --- HelloPojo: FileStatus{path=hdfs://localhost:8020/; isDirectory=true; modification_time=1395397562421; access_time=0; owner=root;
group=supergroup; permission=rwxr-xr-x; isSymlink=false}
[2014-03-23 12:42:06.746] boot - 17064  INFO [main] --- HelloPojo:
FileStatus{path=hdfs://localhost:8020/app; isDirectory=true;
modification_time=1395501405412; access_time=0; owner=hadoop; group=supergroup; permission=rwxr-xr-x; isSymlink=false}

Summary

Congratulations! You’ve just developed a Spring YARN application!

Get the Code


comments powered by Disqus