Article Source

Title: Spark 1.5 DataFrame API Highlights: Date/Time/String Handling, Time Intervals, and UDAFs
Authors: Michael Armbrust, Yin Huai, Davies Liu and Reynold Xin

Spark 1.5 DataFrame API Highlights

Date/Time/String Handling, Time Intervals, and UDAFs

A few days ago, we announced the release of Spark 1.5. This release contains major under-the-hood changes that improve Spark’s performance, usability, and operational stability. Besides these changes, we have been continuously improving DataFrame API. In this blog post, we’d like to highlight three major improvements to DataFrame API in Spark 1.5, which are:

New built-in functions;
Time intervals; and
Experimental user-defined aggregation function (UDAF) interface.

New Built-in Functions in Spark 1.5

In Spark 1.5, we have added a comprehensive list of built-in functions to the DataFrame API, complete with optimized code generation for execution. This code generation allows pipelines that call functions to take full advantage of the efficiency changes made as part of Project Tungsten. With these new additions, Spark SQL now supports a wide range of built-in functions for various use cases, including:

</tbody> </table> For all available built-in functions, please refer to our API docs ([Scala Doc](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$), [Java Doc](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html), and [Python Doc](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)). Unlike normal functions, which execute immediately and return a result, DataFrame functions return a Column, that will be evaluated inside of a parallel job. These columns can be used inside of DataFrame operations, such as select, filter, groupBy, etc. The input to a function can either be another Column (i.e. df['columnName']) or a literal value (i.e. a constant value). To make this more concrete, let’s look at the syntax for calling the round function in Python. round is a function that rounds a numeric value to the specified precision. When the given precision is a positive number, a given input numeric value is rounded to the decimal position specified by the precision. When the specified precision is a zero or a negative number, a given input numeric value is rounded to the position of the integral part specified by the precision.

Category	Functions
Aggregate Functions	`approxCountDistinct, avg, count, countDistinct, first, last, max, mean, min, sum, sumDistinct`
Collection Functions	`array_contains, explode, size, sort_array`
Date/time Functions	Date/timestamp conversion:</p> `unix_timestamp, from_unixtime, to_date, quarter, day, dayofyear, weekofyear, from_utc_timestamp, to_utc_timestamp` Extracting fields from a date/timestamp value: `year, month, dayofmonth, hour, minute, second` Date/timestamp calculation: `datediff, date_add, date_sub, add_months, last_day, next_day, months_between` Misc.: `current_date, current_timestamp, trunc, date_format`</td> </tr>
Math Functions	`abs, acros, asin, atan, atan2, bin, cbrt, ceil, conv, cos, sosh, exp, expm1, factorial, floor, hex, hypot, log, log10, log1p, log2, pmod, pow, rint, round, shiftLeft, shiftRight, shiftRightUnsigned, signum, sin, sinh, sqrt, tan, tanh, toDegrees, toRadians, unhex`
Misc. Functions	`array, bitwiseNOT, callUDF, coalesce, crc32, greatest, if, inputFileName, isNaN, isnotnull, isnull, least, lit, md5, monotonicallyIncreasingId, nanvl, negate, not, rand, randn, sha, sha1, sparkPartitionId, struct, when`
String Functions	`ascii, base64, concat, concat_ws, decode, encode, format_number, format_string, get_json_object, initcap, instr, length, levenshtein, locate, lower, lpad, ltrim, printf, regexp_extract, regexp_replace, repeat, reverse, rpad, rtrim, soundex, space, split, substring, substring_index, translate, trim, unbase64, upper`
Window Functions (in addition to Aggregate Functions)	`cumeDist, denseRank, lag, lead, ntile, percentRank, rank, rowNumber`

# Create a simple DataFrame
data = [
  (234.5, "row1"),
  (23.45, "row2"),
  (2.345, "row3"),
  (0.2345, "row4")]
df = sqlContext.createDataFrame(data, ["i", "j"])
 
# Import functions provided by Spark’s DataFrame API
from pyspark.sql.functions import *
 
# Call round function directly
df.select(
  round(df['i'], 1),
  round(df['i'], 0),
  round(df['i'], -1)).show()
 
 
+----------+----------+-----------+
|round(i,1)|round(i,0)|round(i,-1)|
+----------+----------+-----------+
|     234.5|     235.0|      230.0|
|      23.5|      23.0|       20.0|
|       2.3|       2.0|        0.0|
|       0.2|       0.0|        0.0|
+----------+----------+-----------+

Stop Thinking, Just Do!