SparkR Setting for RStudio

Overview

SparkR originated at UC Berkeley AMPLAB, with additional contributions from Alteryx, Intel, Databricks, and others.

What is SparkR?

New R language API for Spark and SparkSQL
Exposes existing Spark functionality in an R-friendly syntax via the DataFrame API
Has its own shell, but can also be imported like a standard R package and used with Rstudio.

History of DataFrames

SparkR began as an R package that ported Spark’s core functionality (RDDs) to the R language.
The next logical step was to add SparkSQL and SchemaRDDs.
Initial implementation of SQLContext and SchemaRDDs working in SparkR

Why DataFrames?

Uses the distributed, parallel capabilities offered by RDDs, but imposes a schema on the data
More structure == Easier access and manipulation
Natural extension of existing R conventions since DataFrames are already the standard
Super awesome distributed, in-memory collections
Schemas == metadata, structure, declarative instead of imperative

Sys.setenv(SPARK_HOME="/usr/local/Cellar/apache-spark/1.4.1/libexec")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local[8]")
sqlContext <- sparkRSQL.init(sc)
faithful
df <- createDataFrame(sqlContext, faithful)
# Select one column
head(select(df, df$eruptions))
# Filter out rows
head(filter(df, df$waiting < 50))

DataFrames in SparkR

Multiple Components:
- A set of native S4 classes and methods that live inside a standard R package
- A SparkR backend that passes data structures and method calls to the JVM
- A set of “helper” methods written in Scala

Stop Thinking, Just Do!

SparkR Setting for RStudio

Tags

8 September 2015

SparkR Setting for RStudio

Overview

What is SparkR?

History of DataFrames

Why DataFrames?

DataFrames in SparkR