Article Source
Data Model
Data Types
Supported | SQL Type Name | Alias | Size (byte) | Description | Range |
---|---|---|---|---|---|
O | boolean | bool | 1 | true/false | |
bit | 1 | 1/0 | |||
varbit | bit varying | ||||
O | smallint | tinyint, int2 | 2 | small-range integer value | -2^15 (-32,768) to 2^15 (32,767) |
O | integer | int, int4 | 4 | integer value | -2^31 (-2,147,483,648) to 2^31 - 1 (2,147,483,647) |
O | bigint | bit varying | 8 | larger range integer value | -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807) |
O | real | int8 | 4 | variable-precision, inexact, real number value | -3.4028235E+38 to 3.4028235E+38 (6 decimal digits precision) |
O | float[(n)] | float4 | 4 or 8 | variable-precision, inexact, real number value | |
O | double | float8, double precision | 8 | variable-precision, inexact, real number value | 1 .7E–308 to 1.7E+308 (15 decimal digits precision) |
number | decimal | ||||
char[(n)] | character | ||||
varchar[(n)] | character varying | ||||
O | text | text | variable-length unicode text | ||
binary | binary | ||||
varbinary[(n)] | binary varying | ||||
O | blob | bytea | variable-length binary string | ||
date | |||||
time | |||||
timetz | time with time zone | ||||
timestamp | |||||
timestamptz | |||||
O | inet4 | 4 | IPv4 address |
Using real number value (real and double)
The real and double data types are mapped to float and double of java primitives respectively. Java primitives float and double follows the IEEE 754 specification. So, these types are correctly matched to SQL standard data types.
- float[( n )] is mapped to either float or double according to a given length n. If n is specified, it must be bewtween 1 and 53. The default value of n is 53.
- If 1 <- n <- 24, a value is mapped to float (6 decimal digits precision).
- If 25 <- n <- 53, a value is mapped to double (15 decimal digits precision).
- Do not use approximate real number columns in WHERE clause in order to compare some exact matches, especially the - and <> operators. The > or < comparisons work well.
Data Definition Language
CREATE DATABASE
Synopsis
CREATE DATABASE [IF NOT EXISTS] <database_name>
IF NOT EXISTS allows CREATE DATABASE statement to avoid an error which occurs when the database exists.
DROP DATABASE
Synopsis
DROP DATABASE [IF EXISTS]
CREATE TABLE
Synopsis
CREATE TABLE [IF NOT EXISTS] <table_name> [(<column_name> <data_type>, ... )]
[using <storage_type> [with (<key> = <value>, ...)]] [AS <select_statement>]
CREATE EXTERNAL TABLE [IF NOT EXISTS] <table_name> (<column_name> <data_type>, ... )
using <storage_type> [with (<key> = <value>, ...)] LOCATION '<path>'
IF NOT EXISTS
allows CREATE [EXTERNAL] TABLE
statement to avoid an error which occurs when the table does not exist.
Compression
If you want to add an external table that contains compressed data, you should give ‘compression.code’ parameter to CREATE TABLE statement.
create EXTERNAL table lineitem (
L_ORDERKEY bigint,
L_PARTKEY bigint,
...
L_COMMENT text)
USING csv WITH ('csvfile.delimiter'='|','compression.codec'='org.apache.hadoop.io.compress.DeflateCodec')
LOCATION 'hdfs://localhost:9010/tajo/warehouse/lineitem_100_snappy';
compression.codec parameter can have one of the following compression codecs:
- org.apache.hadoop.io.compress.BZip2Codec
- org.apache.hadoop.io.compress.DeflateCodec
- org.apache.hadoop.io.compress.GzipCodec
- org.apache.hadoop.io.compress.SnappyCodec
DROP TABLE
Synopsis
DROP TABLE [IF EXISTS] <table_name> [PURGE]
IF EXISTS
allows DROP DATABASE
statement to avoid an error which occurs when the database does not exist. DROP TABLE
statement removes a table from Tajo catalog, but it does not remove the contents. If PURGE
option is given, DROP TABLE
statement will eliminate the entry in the catalog as well as the contents.
INSERT (OVERWRITE) INTO
INSERT OVERWRITE
statement overwrites a table data of an existing table or a data in a given directory. Tajo’s INSERT OVERWRITE
statement follows INSERT INTO SELECT
statement of SQL. The examples are as follows:
create table t1 (col1 int8, col2 int4, col3 float8);
-- when a target table schema and output schema are equivalent to each other
INSERT OVERWRITE INTO t1 SELECT l_orderkey, l_partkey, l_quantity FROM lineitem;
-- or
INSERT OVERWRITE INTO t1 SELECT * FROM lineitem;
-- when the output schema are smaller than the target table schema
INSERT OVERWRITE INTO t1 SELECT l_orderkey FROM lineitem;
-- when you want to specify certain target columns
INSERT OVERWRITE INTO t1 (col1, col3) SELECT l_orderkey, l_quantity FROM lineitem;
In addition, INSERT OVERWRITE
statement overwrites table data as well as a specific directory.
INSERT OVERWRITE INTO LOCATION '/dir/subdir' SELECT l_orderkey, l_quantity FROM lineitem;
Queries
Overview
Synopsis
SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...]
[FROM <table name> [[AS] <table alias name>] [, ...]]
[WHERE <condition>]
[GROUP BY <expression> [, ...]]
[HAVING <condition>]
[ORDER BY <expression> [ASC|DESC] [NULL FIRST|NULL LAST] [, ...]]
From Clause
Where Clause
Groupby Clause
Select list
SQL Expressions
Arithmetic Expressions
Type Casts
A type cast converts a specified-typed data to another-typed data. Tajo has two type cast syntax:
CAST ( expression AS type )
expression::type ## String Expressions ## Function Call function_name ([expression [, expression ... ]] )
Predicates
IN Predicate
IN predicate provides row and array comparison.
Synopsis
column_reference IN (val1, val2, ..., valN)
column_reference NOT IN (val1, val2, ..., valN)
Examples are as follows:
-- this statement filters lists down all the records where col1 value is 1, 2 or 3:
SELECT col1, col2 FROM table1 WHERE col1 IN (1, 2, 3);
-- this statement filters lists down all the records where col1 value is neither 1, 2 nor 3:
SELECT col1, col2 FROM table1 WHERE col1 NOT IN (1, 2, 3);
You can use ‘IN clause’ on text data domain as follows:
SELECT col1, col2 FROM table1 WHERE col2 IN ('tajo', 'hadoop');
SELECT col1, col2 FROM table1 WHERE col2 NOT IN ('tajo', 'hadoop');
String Pattern Matching Predicates
LIKE
LIKE operator returns true or false depending on whether its pattern matches the given string. An underscore (_) in pattern matches any single character. A percent sign (%) matches any sequence of zero or more characters.
Synopsis
string LIKE pattern
string NOT LIKE pattern
ILIKE
ILIKE is the same to LIKE, but it is a case insensitive operator. It is not in the SQL standard. We borrow this operator from PostgreSQL.
Synopsis
string ILIKE pattern
string NOT ILIKE pattern
SIMILAR TO
Synopsis
string SIMILAR TO pattern
string NOT SIMILAR TO pattern
It returns true or false depending on whether its pattern matches the given string. Also like LIKE, SIMILAR TO
uses _
and %
as metacharacters denoting any single character and any string, respectively.
In addition to these metacharacters borrowed from LIKE, ‘SIMILAR TO’ supports more powerful pattern-matching metacharacters borrowed from regular expressions:
metacharacter | description |
---|---|
| | denotes alternation (either of two alternatives). |
* | denotes repetition of the previous item zero or more times. |
+ | denotes repetition of the previous item one or more times. |
? | denotes repetition of the previous item zero or one time. |
{m} | denotes repetition of the previous item exactly m times. |
{m,} | denotes repetition of the previous item m or more times. |
{m,n} | denotes repetition of the previous item at least m and not more than n times. |
[] | A bracket expression specifies a character class, just as in POSIX regular expressions. |
() | Parentheses can be used to group items into a single logical item. |
Note that .` is not used as a metacharacter in SIMILAR TO operator.
Regular expressions
Regular expressions provide a very powerful means for string pattern matching. In the current Tajo, regular expressions are based on Java-style regular expressions instead of POSIX regular expression. The main difference between java-style one and POSIX’s one is character class.
Synopsis
string ~ pattern
string !~ pattern
string ~* pattern
string !~* pattern
operator | Description |
---|---|
~ | It returns true if a given regular expression is matched to string. Otherwise, it returns false. |
!~ | It returns false if a given regular expression is matched to string. Otherwise, it returns true. |
~* | It is the same to ‘~’, but it is case insensitive. |
!~* | It is the same to ‘!~’, but it is case insensitive. |
Here are examples:
'abc' ~ '.*c' true
'abc' ~ 'c' false
'aaabc' ~ '([a-z]){3}bc true
'abc' ~* '.*C' true
'abc' !~* 'B.*' true
Regular expressions operator is not in the SQL standard. We borrow this operator from PostgreSQL.
Synopsis for REGEXP and RLIKE operators
string REGEXP pattern
string NOT REGEXP pattern
string RLIKE pattern
string NOT RLIKE pattern
But, they do not support case-insensitive operators.