Stop Thinking, Just Do!

Sungsoo Kim's Blog

MapReduce and Hadoop Algorithmic in Academic Papers

tagsTags

23 March 2014


Summary

Motivation

Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.</p>

Which areas do the papers cover?

Ads & E-commerce

Improving ad relevance in sponsored search
Predicting the Click-Through Rate for Rare/New Ads
Learning Influence Probabilities in Social Networks
Mining advertiser-specific user behavior using adfactors
Extracting user profiles from large scale data
Large-Scale Behavioral Targeting (2009)
Search Advertising using Web Relevance Feedback (2008)
Predicting Ads’ ClickThrough Rate with Decision Rules (2008)
*A stochastic learning-to-rank algorithm and its application to contextual advertising (2011)
*Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching (2011)
*Learning website hierarchies for keyword enrichment in contextual advertising (2011)</p>

Astronomy

*Algorithms for Large-Scale Astronomical Problems (2011)</p>

Social Networks

*Social Content Matching in MapReduce (2011)
*Parallel Knowledge Community Detection Algorithm Research Based on MapReduce (2011)
*Large-Scale Community Detection on YouTube for Topic Discovery and Exploration (2011)</p>

Bioinformatics/Medical Informatics

A novel approach to multiple sequence alignment using hadoop data grids
MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
*HBase, MapReduce, and Integrated Data Visualization for Processing Clinical Signal Data (2011)
*Accelerating statistical image reconstruction algorithms for fan-beam x-ray CT using cloud computing (2011)</p>

Machine Translation

Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski
Grammar based statistical MT on Hadoop (2009)
Large Language Models in Machine Translation (2008)
*Fast, Easy and Cheap: Construction of Statistical Machine Translation Models with Mapreduce</p>

Spatial Data Processing

Experiences on Processing Spatial Data with MapReduce
*Scalable spatio-temporal knowledge harvesting (2011)</p>

Information Extraction and Text Processing

Statistical Sentence Chunking Using Map Reduce
Data-intensive text processing with MapReduce
Web-Scale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)
*Batch Text Similarity Search with MapReduce (2011)
*An Empirical Study of Massively Parallel Bayesian Networks Learning for Sentiment Extraction from Unstructured Text (2011)
*EntityTagger: automatically tagging entities with descriptive phrases (2011)</p>

Artificial Intelligence/Machine Learning/Data Mining

LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems
Stateful Bulk Processing for Incremental Analytics
Mining dependency in distributed systems through unstructured logs analysis
Beyond online aggregation: parallel and incremental data mining with online mapreduce
Learning based opportunistic admission control algorithm for mapreduce as a service
OWL reasoning with WebPIE: calculating the closure of 100 billion triples
Scaling ECGA model building via data-intensive computing
SPARQL basic graph pattern processing with iterative mapreduce
Residual Splash for Optimally Parallelizing Belief Propagation
Stochastic gradient boosted distributed decision trees
Distributed Algorithms for Topic Models
When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing
Cloud Computing Boosts Business Intelligence of Telecommunication Industry
Parallel K-Means Clustering Based on MapReduce
Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
Parallel algorithms for mining large-scale rich-media data
Scaling Simple and Compact Genetic Algorithms using MapReduce
Scalable Distributed Reasoning using Mapreduce
Scaling Up Classifiers to Cloud Computers (2008)
*Preliminary Results on Using Matching Algorithms in Map-Reduce Applications (2011)
*Improving the Effectiveness of Statistical Feature Selection Algorithms Using Bag of Synsets and its Parallelization (2011)
*Tri-training and MapReduce-based massive data learning (2011)
*Parallel evolutionary approach of compaction problem using mapreduce (2011)
*COMET: A Recipe for Learning and Using Large Ensembles on Massive Data (2011)
*Parallelized K-Means clustering algorithm for self aware mobile ad-hoc networks (2011)</p>

    For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our previous blog post.

Search Query Analysis

Parallelizing Random Walk with Restart for large-scale query recommendation
BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)
AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)</p>

Automatically Incorporating New Sources in Keyword Search-Based Data Integration
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Learning URL patterns for webpage de-duplication
Information Seeking with Social Signals: Anatomy of a Social Tag-based EXploratory Search Browser
MIREX: Mapreduce Information Retrieval Experiments
Efficient Clustering of Web Derived Data Sets
The PageRank algorithm and application on searching of academic papers
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
On Single-Pass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)
*Scalable knowledge harvesting with high precision and high recall (2011)
*MapReduce indexing strategies: Studying scalability and efficiency (2011)
*Ranking on large-scale graphs with rich metadata (2011)
*Distributed Index for Near Duplicate Detection (2011)
*SPRINT: ranking search results by paths (2011)
*Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models (2011)
*Sparse hidden-dynamics conditional random fields for user intent understanding (2011)</p>

    For more about mapreduce in information retrieval, check out our presentation Mapreduce in Search.

Spam & Malware Detection

Characterizing Botnets from Email Spam Records (2008)

Image and Video Processing

Font rendering on a GPU-based raster image processor
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)

Networking

Reducible Complexity in DNS</p>

Simulation

Map-Reduce Meets Wider Varieties of Applications (2008)

  • Simulation of earthquakes (geology)</p>

Statistics

User-based collaborative filtering recommendation algorithms on hadoop
Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)

Numerical Mathematics

Distributed non-negative matrix factorization for dyadic data analysis on mapreduce
A mapreduce algorithm for SC
Multi-GPU Volume Rendering using MapReduce
Mapreduce for Integer Factorization
*Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent (2011)</p>

Sets & Graphs

Towards scalable RDF graph analytics on MapReduce
Efficient Parallel Set-Similarity Joins using Mapreduce
Max-cover algorithm in map-reduce
Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Graph Twiddling in a MapReduce World
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in real-world networks: proofs, algorithms and observations (2008)
*Filtering: A Method for Solving Graph Problems in MapReduce (2011)
*Colorful Triangle Counting and a MapReduce Implementation (2011)
*Mining Large Graphs: Algorithms, Inference, and Discoveries (2011)
*On labeled paths (2011)
*HADI: Mining radii of large graphs (2011)
*Towards Efficient Subgraph Search in Cloud Computing Environment (2011)</p> </ul>

Author organizations and companies

Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas</p>

Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

Changes from the prior postings is that this posting only includes _new_ papers (2011):

Artificial Intelligence/Machine Learning/Data Mining

Bioinformatics/Medical Informatics

Image and Video Processing

Statistics and Numerical Mathematics

Search and Information Retrieval

Sets & Graphs

Simulation

Social Networks


comments powered by Disqus