Stop Thinking, Just Do!

Sungsoo Kim's Blog

SparkR Setting for RStudio

tagsTags

8 September 2015


SparkR Setting for RStudio

Overview

SparkR originated at UC Berkeley AMPLAB, with additional contributions from Alteryx, Intel, Databricks, and others.

What is SparkR?

  • New R language API for Spark and SparkSQL
  • Exposes existing Spark functionality in an R-friendly syntax via the DataFrame API
  • Has its own shell, but can also be imported like a standard R package and used with Rstudio.

History of DataFrames

  • SparkR began as an R package that ported Spark’s core functionality (RDDs) to the R language.
  • The next logical step was to add SparkSQL and SchemaRDDs.
  • Initial implementation of SQLContext and SchemaRDDs working in SparkR

Why DataFrames?

  • Uses the distributed, parallel capabilities offered by RDDs, but imposes a schema on the data
  • More structure == Easier access and manipulation
  • Natural extension of existing R conventions since DataFrames are already the standard
  • Super awesome distributed, in-memory collections
  • Schemas == metadata, structure, declarative instead of imperative
Sys.setenv(SPARK_HOME="/usr/local/Cellar/apache-spark/1.4.1/libexec")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="local[8]")
sqlContext <- sparkRSQL.init(sc)
faithful
df <- createDataFrame(sqlContext, faithful)
# Select one column
head(select(df, df$eruptions))
# Filter out rows
head(filter(df, df$waiting < 50))

DataFrames in SparkR

  • Multiple Components:
    • A set of native S4 classes and methods that live inside a standard R package
    • A SparkR backend that passes data structures and method calls to the JVM
    • A set of “helper” methods written in Scala

comments powered by Disqus