SparkR Setting for RStudio
SparkR originated at UC Berkeley AMPLAB, with additional contributions from Alteryx, Intel, Databricks, and others.
What is SparkR?
- New R language API for Spark and SparkSQL
- Exposes existing Spark functionality in an R-friendly syntax via the DataFrame API
- Has its own shell, but can also be imported like a standard R package and used with Rstudio.
History of DataFrames
- SparkR began as an R package that ported Spark’s core functionality (RDDs) to the R language.
- The next logical step was to add SparkSQL and SchemaRDDs.
- Initial implementation of SQLContext and SchemaRDDs working in SparkR
- Uses the distributed, parallel capabilities offered by RDDs, but imposes a schema on the data
- More structure == Easier access and manipulation
- Natural extension of existing R conventions since DataFrames are already the standard
- Super awesome distributed, in-memory collections
- Schemas == metadata, structure, declarative instead of imperative
Sys.setenv(SPARK_HOME="/usr/local/Cellar/apache-spark/1.4.1/libexec") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init(master="local") sqlContext <- sparkRSQL.init(sc) faithful df <- createDataFrame(sqlContext, faithful) # Select one column head(select(df, df$eruptions)) # Filter out rows head(filter(df, df$waiting < 50))
DataFrames in SparkR
- Multiple Components:
- A set of native S4 classes and methods that live inside a standard R package
- A SparkR backend that passes data structures and method calls to the JVM
- A set of “helper” methods written in Scala