Stop Thinking, Just Do!

Sungsoo Kim's Blog

Apache Tajo™ 0.8.0 Configuration

tagsTags

4 May 2014


Article Source

Preliminary

catalog-site.xml and tajo-site.xml

Tajo’s configuration is based on Hadoop’s configuration system. Tajo uses two config files:

  • catalog-site.xml - configuration for the catalog server.
  • tajo-site.xml - configuration for other tajo modules.

Each config consists of a pair of a name and a value. If you want to set the config name a.b.c with the value 123, add the following element to an appropriate file.

<property>
  <name>a.b.c</name>
  <value>123</value>
</property>

Tajo has a variety of internal configs. If you don’t set some config explicitly, the default config will be used for for that config. Tajo is designed to use only a few of configs in usual cases. You may not be concerned with the configuration.

In default, there is no tajo-site.xml in ${TAJO}/conf directory. If you set some configs, first copy $TAJO_HOME/conf/tajo-site.xml.templete to tajo-site.xml. Then, add the configs to your tajo-site.

tajo-env.sh

tajo-env.sh is a shell script file. The main purpose of this file is to set shell environment variables for TajoMaster and TajoWorker java program. So, you can set some variable as follows:

VARIABLE=value

If a value is a literal string, type this as follows:

VARIABLE='value'

Cluster Setup

Fully Distributed Mode

A fully distributed mode enables a Tajo instance to run on Hadoop Distributed File System (HDFS). In this mode, a number of Tajo workers run across a number of the physical nodes where HDFS data nodes run.

In this section, we explain how to setup the cluster mode.

Settings

Please add the following configs to tajo-site.xml file:

<property>
  <name>tajo.rootdir</name>
  <value>hdfs://hostname:port/tajo</value>
</property>
<property>
  <name>tajo.master.umbilical-rpc.address</name>
  <value>hostname:26001</value>
</property>
<property>
  <name>tajo.master.client-rpc.address</name>
  <value>hostname:26002</value>
</property>
<property>
  <name>tajo.catalog.client-rpc.address</name>
  <value>hostname:26005</value>
</property>

Workers

The file conf/workers lists all host names of workers, one per line. By default, this file contains the single entry localhost. You can easily add host names of workers via your favorite text editor.

For example:

$ cat > conf/workers
host1.domain.com
host2.domain.com
....

<ctrl + d>

Make base directories and set permissions

If you want to know Tajo’s configuration in more detail, see Configuration page. Before launching the tajo, you should create the tajo root dir and set the permission as follows:

$ $HADOOP_HOME/bin/hadoop fs -mkdir       /tajo
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tajo

Launch a Tajo cluster

Then, execute start-tajo.sh

$ $TAJO_HOME/bin/start-tajo.sh

Note

In default, each worker is set to very little resource capacity. In order to increase parallel degree, please read Worker Configuration.

Note

In default, TajoMaster listens on 127.0.0.1 for clients. To allow remote clients to access TajoMaster, please set tajo.master.client-rpc.address config to tajo-site.xml. In order to know how to change the listen port, please refer Configuration Defaults.

Tajo Master Configuration

Tajo Rootdir

Tajo uses HDFS as a primary storage layer. So, one Tajo cluster instance should have one tajo rootdir. A user is allowed to specific your tajo rootdir as follows:

<property>
  <name>tajo.rootdir</name>
  <value>hdfs://namenode_hostname:port/path</value>
</property>

Tajo rootdir must be a url form like scheme://hostname:port/path. The current implementaion only supports hdfs:// and file:// schemes. The default value is file:///tmp/tajo-${user.name}/.

TajoMaster Heap Memory Size

The environment variable TAJO_MASTER_HEAPSIZE in conf/tajo-env.sh allow Tajo Master to use the specified heap memory size.

If you want to adjust heap memory size, set TAJO_MASTER_HEAPSIZE variable in conf/tajo-env.sh with a proper size as follows:

TAJO_MASTER_HEAPSIZE=2000

The default size is 1000 (1GB).

Worker Configuration

Worker Heap Memory Size

The environment variable TAJO_WORKER_HEAPSIZE in conf/tajo-env.sh allow Tajo Worker to use the specified heap memory size.

If you want to adjust heap memory size, set TAJO_WORKER_HEAPSIZE variable in conf/tajo-env.sh with a proper size as follows:

TAJO_WORKER_HEAPSIZE=8000

The default size is 1000 (1GB).

Temporary Data Directory

TajoWorker stores temporary data on local file system due to out-of-core algorithms. It is possible to specify one or more temporary data directories where temporary data will be stored.

tajo-site.xml

<property>
  <name>tajo.worker.tmpdir.locations</name>
  <value>/disk1/tmpdir,/disk2/tmpdir,/disk3/tmpdir</value>
</property>

Maximum number of parallel running tasks for each worker

In Tajo, the capacity of running tasks in parallel are determined by available resources and workload of running queries. In order to specify it, please see [Worker Resources] (#ResourceConfiguration) section.

Worker Resources

Each worker can execute multiple tasks simultaneously. In Tajo, users can specify the total size of memory and the number of disks for each worker. Available resources affect how many tasks are executed simultaneously.

In order to specify the resource capacity of each worker, you should add the following configs to tajo-site.xml.

property name description value type default value
tajo.worker.resource.cpu-cores the number of cpu cores integer 1
tajo.worker.resource.memory-mb memory size (MB) integer 1024
tajo.worker.resource.disks the number of disks integer 1

Note

Currently, QueryMaster requests 512MB memory and 1.0 disk per task for the backward compatibility. Example Assume that you want to give 5120 MB memory, 6.0 disks, and 24 cores on each worker. The example configuration is as follows:

tajo-site.xml

<property>
  <name>tajo.worker.resource.tajo.worker.resource.cpu-cores</name>
  <value>24</value>
</property>
<property>
  <name>tajo.worker.resource.memory-mb</name>
  <value>5120</value>
</property>
<property>
  <name>tajo.worker.resource.tajo.worker.resource.disks</name>
  <value>6.0</value>
</property>

Dedicated Mode

Tajo provides a dedicated mode that allows each worker in a Tajo cluster to use whole available system resources including cpu-cores, memory, and disks. For this mode, a user should add the following config to tajo-site.xml :

<property>
  <name>tajo.worker.resource.dedicated</name>
  <value>true</value>
</property>

In addition, it can limit the memory capacity used for Tajo worker as follows:

property name description value type default value
tajo.worker.resource.dedicated-memory-ratio how much memory to be used in whole memory float 0.8

Catalog Configuration

If you want to customize the catalog service, copy $TAJO_HOME/conf/catalog-site.xml.template to catalog-site.xml. Then, add the following configs to catalog-site.xml. Note that the default configs are enough to launch Tajo cluster in most cases.

  • tajo.catalog.master.addr - If you want to launch a Tajo cluster in distributed mode, you must specify this address. For more detail information, see Default Ports.
  • tajo.catalog.store.class - If you want to change the persistent storage of the catalog server, specify the class name. Its default value is tajo.catalog.store.DerbyStore. In the current version, Tajo provides three persistent storage classes as follows:
Driver Class Descriptions
tajo.catalog.store.DerbyStore this storage class uses Apache Derby.
tajo.catalog.store.MySQLStore this storage class uses MySQL.
tajo.catalog.store.MemStore this is the in-memory storage. It is only used in unit tests to shorten the duration of unit tests.
tajo.catalog.store.HCatalogStore this storage class uses HiveMetaStore.

MySQLStore Configuration

In order to use MySQLStore, you need to create database and user on MySQL for Tajo.

mysql> create user 'tajo'@'localhost' identified by 'xxxxxx';
Query OK, 0 rows affected (0.00 sec)

mysql> create database tajo;
Query OK, 1 row affected (0.00 sec)

mysql> grant all on tajo.* to 'tajo'@'localhost';
Query OK, 0 rows affected (0.01 sec)

And then, you need to prepare MySQL JDBC driver on the machine which can be ran TajoMaster. If you do, you should set TAJO_CLASSPATH variable in conf/tajo-env.sh with it as follows:

export TAJO_CLASSPATH=/usr/local/mysql/lib/mysql-connector-java-x.x.x.jar

Or you just can copy jdbc driver into $TAJO_HOME/lib.

Finally, you should add the following config to conf/catalog-site.xml.

<property>
  <name>tajo.catalog.store.class</name>
  <value>org.apache.tajo.catalog.store.MySQLStore</value>
</property>
<property>
  <name>tajo.catalog.jdbc.connection.id</name>
  <value><mysql user name></value>
</property>
<property>
  <name>tajo.catalog.jdbc.connection.password</name>
  <value><mysql user password></value>
</property>
  <property>
  <name>tajo.catalog.jdbc.uri</name>
  <value>jdbc:mysql://<mysql host name>:<mysql port>/<database name for tajo>?createDatabaseIfNotExist=true</value>
</property>

HCatalogStore Configuration

Tajo support HCatalogStore to integrate with hive. If you want to use HCatalogStore, you just do as follows.

First, you must compile source code and get a binary archive as follows:

$ git clone https://git-wip-us.apache.org/repos/asf/tajo.git tajo
$ mvn clean package -DskipTests -Pdist -Dtar -Phcatalog-0.1x.0
$ ls tajo-dist/target/tajo-0.8.0-SNAPSHOT.tar.gz

Tajo support to build based on hive 0.11.0 and hive 0.12.0. If you use hive 0.11.0, you have to set -Phcatalog-0.11.0. And if you use hive 0.12.0, you have to set -Phcatalog-0.12.0.

Second, you must set your hive home directory to HIVE_HOME variable in conf/tajo-env.sh with it as follows:

export HIVE_HOME=/path/to/your/hive/directory

Third, if you need to use jdbc to connect HiveMetaStore, you have to prepare mysql jdbc driver on host which can be ran TajoMaster. If you prepare it, you should set jdbc driver file path to HIVE_JDBC_DRIVER_DIR variable in conf/tajo-env.sh with it as follows:

export HIVE_JDBC_DRIVER_DIR=/path/to/your/mysql_jdbc_driver/mysql-connector-java-x.x.x-bin.jar

Lastly, you should add the following config to conf/catalog-site.xml :

<property>
  <name>tajo.catalog.store.class</name>
  <value>org.apache.tajo.catalog.store.HCatalogStore</value>
</property>

Configuration Defaults

Tajo Master Configuration Defaults

Service Name Config Property Name Description default address
Tajo Master Umbilical Rpc tajo.master.umbilical-rpc.address   localhost:26001
Tajo Master Client Rpc tajo.master.client-rpc.address   localhost:26002
Tajo Master Info Http tajo.master.info-http.address   0.0.0.0:26080
Tajo Catalog Client Rpc tajo.catalog.client-rpc.address   localhost:26005

Tajo Worker Configuration Defaults

Service Name Config Property Name Description default address
Tajo Worker Peer Rpc tajo.worker.peer-rpc.address   0.0.0.0:28091
Tajo Worker Client Rpc tajo.worker.client-rpc.address   0.0.0.0:28092
Tajo Worker Info Http tajo.worker.info-http.address   0.0.0.0:28080

comments powered by Disqus