Summary
 Article Source: Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
 Author: Amund Tveit (Atbrox cofounder)
Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.</p>
Which areas do the papers cover?
Ads & Ecommerce
Improving ad relevance in sponsored search
Predicting the ClickThrough Rate for Rare/New Ads
Learning Influence Probabilities in Social Networks
Mining advertiserspecific user behavior using adfactors
Extracting user profiles from large scale data
LargeScale Behavioral Targeting (2009)
Search Advertising using Web Relevance Feedback (2008)
Predicting Ads’ ClickThrough Rate with Decision Rules (2008)
*A stochastic learningtorank algorithm and its application to contextual advertising (2011)
*Parallelizing largescale data processing applications with data skew: a case study in productoffer matching (2011)
*Learning website hierarchies for keyword enrichment in contextual advertising (2011)</p>
Astronomy
*Algorithms for LargeScale Astronomical Problems (2011)</p>
Social Networks
*Social Content Matching in MapReduce (2011)
*Parallel Knowledge Community Detection Algorithm Research Based on MapReduce (2011)
*LargeScale Community Detection on YouTube for Topic Discovery and Exploration (2011)</p>
Bioinformatics/Medical Informatics
A novel approach to multiple sequence alignment using hadoop data grids
MapReduceBased Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
*HBase, MapReduce, and Integrated Data Visualization for Processing Clinical Signal Data (2011)
*Accelerating statistical image reconstruction algorithms for fanbeam xray CT using cloud computing (2011)</p>
Machine Translation
Training PhraseBased Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski
Grammar based statistical MT on Hadoop (2009)
Large Language Models in Machine Translation (2008)
*Fast, Easy and Cheap: Construction of Statistical Machine Translation Models with Mapreduce</p>
Spatial Data Processing
Experiences on Processing Spatial Data with MapReduce
*Scalable spatiotemporal knowledge harvesting (2011)</p>
Information Extraction and Text Processing
Statistical Sentence Chunking Using Map Reduce
Dataintensive text processing with MapReduce
WebScale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)
*Batch Text Similarity Search with MapReduce (2011)
*An Empirical Study of Massively Parallel Bayesian Networks Learning for Sentiment Extraction from Unstructured Text (2011)
*EntityTagger: automatically tagging entities with descriptive phrases (2011)</p>
Artificial Intelligence/Machine Learning/Data Mining
LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems
Stateful Bulk Processing for Incremental Analytics
Mining dependency in distributed systems through unstructured logs analysis
Beyond online aggregation: parallel and incremental data mining with online mapreduce
Learning based opportunistic admission control algorithm for mapreduce as a service
OWL reasoning with WebPIE: calculating the closure of 100 billion triples
Scaling ECGA model building via dataintensive computing
SPARQL basic graph pattern processing with iterative mapreduce
Residual Splash for Optimally Parallelizing Belief Propagation
Stochastic gradient boosted distributed decision trees
Distributed Algorithms for Topic Models
When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via DataIntensive Computing
Cloud Computing Boosts Business Intelligence of Telecommunication Industry
Parallel KMeans Clustering Based on MapReduce
Largescale multimedia semantic concept modeling using robust subspace bagging and MapReduce
Parallel algorithms for mining largescale richmedia data
Scaling Simple and Compact Genetic Algorithms using MapReduce
Scalable Distributed Reasoning using Mapreduce
Scaling Up Classifiers to Cloud Computers (2008)
*Preliminary Results on Using Matching Algorithms in MapReduce Applications (2011)
*Improving the Effectiveness of Statistical Feature Selection Algorithms Using Bag of Synsets and its Parallelization (2011)
*Tritraining and MapReducebased massive data learning (2011)
*Parallel evolutionary approach of compaction problem using mapreduce (2011)
*COMET: A Recipe for Learning and Using Large Ensembles on Massive Data (2011)
*Parallelized KMeans clustering algorithm for self aware mobile adhoc networks (2011)</p>

For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our previous blog post.
Search Query Analysis
Parallelizing Random Walk with Restart for largescale query recommendation
BBM: Bayesian Browsing Model from Petabytescale Data (2009)
AIDE: Adhoc Intents Detection Engine over Query Logs (2009)</p>
Information Retrieval (Search)
Automatically Incorporating New Sources in Keyword SearchBased Data Integration
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Learning URL patterns for webpage deduplication
Information Seeking with Social Signals: Anatomy of a Social Tagbased EXploratory Search Browser
MIREX: Mapreduce Information Retrieval Experiments
Efficient Clustering of Web Derived Data Sets
The PageRank algorithm and application on searching of academic papers
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
On SinglePass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web (2008)
*Scalable knowledge harvesting with high precision and high recall (2011)
*MapReduce indexing strategies: Studying scalability and efficiency (2011)
*Ranking on largescale graphs with rich metadata (2011)
*Distributed Index for Near Duplicate Detection (2011)
*SPRINT: ranking search results by paths (2011)
*Bagging GradientBoosted Trees for High Precision, Low Variance Ranking Models (2011)
*Sparse hiddendynamics conditional random fields for user intent understanding (2011)</p>

For more about mapreduce in information retrieval, check out our presentation Mapreduce in Search.
Spam & Malware Detection
Characterizing Botnets from Email Spam Records (2008)
 Clustering of emails into spam campaign
 Finding probability that 2 spam messages are sent form same machine
 Estime likelihood of botnets based on common senders in spam campaigns
The Ghost In The Browser Analysis of Webbased Malware (2007)</p>
Image and Video Processing
Font rendering on a GPUbased raster image processor
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
 Video Stream ReRendering
MapReduce Meets Wider Varieties of Applications (2008)  Location detection in images
*Counting triangles and the curse of the last reducer (2011)
*Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments (2011)</p>
Networking
Reducible Complexity in DNS</p>
Simulation
MapReduce Meets Wider Varieties of Applications (2008)
 Simulation of earthquakes (geology)</p>
Statistics
Userbased collaborative filtering recommendation algorithms on hadoop
Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
 Digg.com story recommendations
Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)  Measuring Wikipedia Editor similarity
MapReduce Meets Wider Varieties of Applications (2008)  Netflix video recommendation
Largescale Parallel Collaborative Filtering for the Netflix Prize (2008)</p>
Numerical Mathematics
Distributed nonnegative matrix factorization for dyadic data analysis on mapreduce
A mapreduce algorithm for SC
MultiGPU Volume Rendering using MapReduce
Mapreduce for Integer Factorization
*LargeScale Matrix Factorization with Distributed Stochastic Gradient Descent (2011)</p>
Sets & Graphs
Towards scalable RDF graph analytics on MapReduce
Efficient Parallel SetSimilarity Joins using Mapreduce
Maxcover algorithm in mapreduce
Distributed Algorithm for Computing Formal Concepts Using MapReduce Framework
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Graph Twiddling in a MapReduce World
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in realworld networks: proofs, algorithms and observations (2008)
*Filtering: A Method for Solving Graph Problems in MapReduce (2011)
*Colorful Triangle Counting and a MapReduce Implementation (2011)
*Mining Large Graphs: Algorithms, Inference, and Discoveries (2011)
*On labeled paths (2011)
*HADI: Mining radii of large graphs (2011)
*Towards Efficient Subgraph Search in Cloud Computing Environment (2011)</p>
</ul>
Author organizations and companies
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas</p>
Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)
Changes from the prior postings is that this posting only includes _new_ papers (2011):
Artificial Intelligence/Machine Learning/Data Mining

NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce
Distributed Evolutionary Algorithm Using the MapReduce Paradigm–A Case Study for Data Compaction Problem
On Using Pattern Matching Algorithms in MapReduce Applications
Using Variational Inference and MapReduce to Scale Topic Modeling
A MapReducebased distributed SVM algorithm for automatic image annotation
Scalable and Parallel Boosting with MapReduce
MasterSlave Parallel Genetic Algorithm Based on MapReduce Using Cloud Computing
Fast clustering using MapReduce
KMeans Clustering with Bagging and MapReduce
Insitu MapReduce for Log Processing
Clustering Very Large Multidimensional Datasets with MapReduce
Large Scale Fuzzy pD* Reasoning Using MapReduce
MapReduce network enabled algorithms for classification based on association rules
PARABLE: A PArallel RAndompartition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework
A MapReduce based parallel SVM for large scale spam filtering
Clustering Systems with Kolmogorov Complexity and MapReduce
Bioinformatics/Medical Informatics

Rapid parallel genome indexing with MapReduce
CloudAligner: A fast and fullfeatured MapReduce based tool for sequence mapping
Nephele: genotyping via complete composition vectors and MapReduce
Genome Analysis with MapReduce
Parallel Metagenomic Sequence Clustering via Sketching and Maximal Quasiclique Enumeration on Mapreduce Clouds
HadoopGIS: A High Performance Query System for Analytical Medical Imaging with MapReduce
Image and Video Processing

Multilayer graphbased semisupervised learning for largescale image datasets using mapreduce
Skyline web service selection with MapReduce
HIPI: A Hadoop Image Processing Interface for Imagebased MapReduce Tasks
An Approach for Processing Large and Nonuniform Media Objects on MapReduceBased Clusters
Building Wavelet Histograms on Large Data in MapReduce
Statistics and Numerical Mathematics

Solving Linear Programs in MapReduce
Gaussian Deconvolution and MapReduce Approach for Chipseq Analysis
Design and implementation of parallel statistical algorithm based on Hadoop’s MapReduce model
A MapReduce framework for onroad mobile fossil fuel combustion CO2 emission estimation
Search and Information Retrieval

Fast personalized PageRank on MapReduce
MapReduce for Experimental Search
Fulltext indexing for optimizing selection operations in largescale data analytics
Sets & Graphs

MapReduce in MPI for Largescale Graph Algorithms
Design Distributed Digraph Algorithms using MapReduce
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce
Processing thetajoins using MapReduce
ClauseIteration with MapReduce to Scalably Query Data Graphs in the SHARD GraphStore
Mining TeraScale Graphs with MapReduce: Theory, Engineering and Discoveries
Filtering: a method for solving graph problems in MapReduce
Colorful Triangle Counting and a MapReduce Implementation
A parallel computing model for largegraph mining with MapReduce
Simulation

Molecular Dynamics Simulation Based on Hadoop Mapreduce
TH‐E‐BRC‐04: Monte‐Carlo Simulation in a Cloud Computing Environment with MapReduce
Distributed simulation of P systems by means of mapreduce: first steps with hadoop and Plingua
Social Networks
Implementation of a LargeScalable Social Data Analysis System Based on MapReduce </ul> ## Spatial Data Processing

SDPPF—A MapReduce based parallel processing framework for spatial data
MRGIR: Open geographical information retrieval using MapReduce
Scalable Local Regression for Spatial Analytics
Research on Parallel DBSCAN Algorithm Design Based on MapReduce

P 2 LSA and P 2 LSA+: two paralleled probabilistic latent semantic analysis algorithms based on the mapreduce model
Processing Wikipedia Dumps: A CaseStudy comparing the XGrid and MapReduce Approaches
MapReduce for HITS Algorithm with Application to Chinese Word Networks
Implementing MapReduce over language and literature data over the UK National Grid Service
Representing ngram language models for compact storage and fast retrieval