Article Source
- Title: sampleclean
Programming With SampleClean
The SampleClean project is hosted on Github:
Latest Release: SampleClean-0.1 (Jar) (GZip Tar) (Zip)
Requirements: JDK 1.6+, Scala 2.10.x, Spark 1.0-1.2
We provide a set of Scala libraries for Entity Resolution, Crowd Sourcing, and Approximate Query Processing.
Entity Resolution: The problem of linking multiple database
representations of the same real world “entity”. SampleClean provides a
library and programming API for constructing distributed entity
resolution pipelines.
Crowd Sourcing: Entity resolution tasks can be hard to automate and
for reliable results crowdsourcing is a preferred solution. SampleClean
provides a library of crowd sourcing tools that also adaptively learns
through Active Learning. To use crowd sourcing, a pre-requisite is to
run the AMPCrowd server.
Approximate Query Processing: We often want to know aggregate
statistics of the database (SUM, COUNT, AVG), and to answer these
queries with high accuracy it often suffices to clean a small sample of
data. SampleClean provides the primitives to sample and extrapolate
query results on the sample.
Programming With SampleClean
You can download the SampleClean jar to include with any Spark programs or you can clone our github repository to check out the source code. We have provided a programming guide to help you get started.
Quick Start
We will walk through a basic tutorial on how to get SampleClean running using Spark Shell either locally or on a cluster.
1. Java Development Kit 7+ Download
- Scala 2.10.x Download
Spark and SampleClean Local Installation
1. First create a new directory mkdir sampleclean
- Download Spark 1.2.x to this directory Download
- Untar Spark
tar xvzf spark-1.2.2.tgz
- Build Spark
cd spark-1.2.2
sbt/sbt -Phive assembly/assembly
- Download SampleClean to the spark directory
To avoid permission issues on a local deployment, configure hive with our default config. Download the config to the spark directory Download
- Put the config in the spark configuration folder
mv hive-site.xml.default conf/hive-site.xml
Testing Your Installation
8. Download the example dataset to the spark folder Download
- Open the Spark shell
./bin/spark-shell --jars sampleclean-v0.1.jar
- Import SampleClean
import sampleclean.api.SampleCleanContext
11. Create New SampleCleanContext and HiveContextval scc = new SampleCleanContext(sc)
12. Load Example Dataset
restaurant(id String,
entity String,
name String,
category String,
city String)
scc.hql("LOAD DATA LOCAL INPATH 'restaurant.csv' OVERWRITE INTO TABLE restaurant")
- Create a working set
- Count the number of distinct restaurants
scc.hql("select count(distinct name) from restaurant").collect().foreach(println)
- Do Entity Resolution
import sampleclean.clean.deduplication.EntityResolution
val algorithm = EntityResolution.longAttributeCanonicalize(scc,"restaurant_working","name",0.7)
- Count the number of distinct restaurants
scc.hql("select count(distinct name) from restaurant").collect().foreach(println)
Using the Crowd
19. Configure crowd tasks (if you installed AMPCrowd earlier):
import sampleclean.crowd._
val crowdConfig = CrowdConfiguration(crowdName=”internal”,
val taskParams = CrowdTaskConfiguration(votesPerPoint=1, maxPointsPerTask=10)
- Add a crowd matching step to the entity resolution algorithm
val crowdMatcher = EntityResolution.createCrowdMatcher(scc, “name” , “restaurant_working”)
val crowdAlgorithm = EntityResolution.longAttributeCanonicalize(scc,"restaurant_working","name",0.6)
- Run the crowd-driven entity resolution (creating crowd tasks)
- Do some crowd tasks (navigate your browser to
- Persist the new results
- Count the number of distinct restaurants
scc.hql("select count(distinct name) from restaurant").collect().foreach(println)
- Exit
Cluster Installation
You can also use SampleClean on a Spark cluster using our provided scripts. Note that you must have valid AWS credentials to start your cluster. The scripts configure all requirements necessary. Check sampleclean-async/deploy/README to learn about deploying EC2 clusters for Sample Clean. After starting the cluster, you can login remotely and use Sample Clean with Spark Submit or Spark Shell (similar to the local usage mode). Remember to load your datasets into HDFS using ephemeral or persistent storage before running your application.