Article Source

Title: DataSets

Datasets for Deep Learning

Symbolic Music Datasets

Piano-midi.de: classical piano pieces (http://www.piano-midi.de/)
Nottingham : over 1000 folk tunes (http://abc.sourceforge.net/NMD/)
MuseData: electronic library of classical music scores (http://musedata.stanford.edu/)
JSB Chorales: set of four-part harmonized chorales (http://www.jsbchorales.net/index.shtml)

Natural Images

MNIST: handwritten digits (http://yann.lecun.com/exdb/mnist/)
NIST: similar to MNIST, but larger
Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations)
CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories ( http://www.cs.utoronto.ca/~kriz/cifar.html)
Caltech 101: pictures of objects belonging to 101 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech101/)
Caltech 256: pictures of objects belonging to 256 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech256/)
Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset
STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. http://www.stanford.edu/~acoates//stl10/
The Street View House Numbers (SVHN) Dataset - http://ufldl.stanford.edu/housenumbers/
NORB: binocular images of toy figurines under various illumination and pose (http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/)
Imagenet: image database organized according to the WordNethierarchy (http://www.image-net.org/)
Pascal VOC: various object recognition challenges (http://pascallin.ecs.soton.ac.uk/challenges/VOC/)
Labelme: A large dataset of annotated images, http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php
COIL 20: different objects imaged at every angle in a 360 rotation(http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php)
COIL100: different objects imaged at every angle in a 360 rotation (http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php)

Artificial Datasets

Arcade Universe- An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
A collection of datasets inspired by the ideas from BabyAISchool:
- BabyAIShapesDatasets
  
  distinguishing between 3 simple shapes
- BabyAIImageAndQuestionDatasets
  
  a question-image-answer dataset
Datasets generated for the purpose of an empirical evaluation of deep architectures (DeepVsShallowComparisonICML2007):
- MnistVariations
  
  introducing controlled variations in MNIST
- RectanglesData
  
  discriminating between wide and tall rectangles
- ConvexNonConvex
  
  discriminating between convex and nonconvex shapes
- BackgroundCorrelation
  
  controlling the degree of correlation in noisy MNIST backgrounds

Faces

Labelled Faces in the Wild: 13,000 images of faces collected from the web, labelled with the name of the person pictured (http://vis-www.cs.umass.edu/lfw/)
Toronto Face Dataset
Olivetti: a few images of several different people (http://www.cs.nyu.edu/~roweis/data.html)
Multi-Pie: The CMU Multi-PIE Face Database (http://www.multipie.org/)
Face-in-Action (http://www.flintbox.com/public/project/5486/)
JACFEE: Japanese and Caucasian Facial Expressions of Emotion (http://www.humintell.com/jacfee/)
FERET: The Facial Recognition Technology Database (http://www.itl.nist.gov/iad/humanid/feret/feret_master.html)
mmifacedb: MMI Facial Expression Database (http://www.mmifacedb.com/)
IndianFaceDatabase: http://vis-www.cs.umass.edu/~vidit/IndianFaceDatabase/)
(e.g. The Yale Face Database (http://vision.ucsd.edu/content/yale-face-database) and The Yale Face Database B (http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html)). **
**

Text

20 newsgroups: classification task, mapping word occurences to newsgroup ID (http://qwone.com/~jason/20Newsgroups/)
Reuters (RCV*) Corpuses: text/topic prediction (http://about.reuters.com/researchandstandards/corpus/)
Penn Treebank : used for next word prediction or next character prediction (http://www.cis.upenn.edu/~treebank/)
Broadcast News: large text dataset, classically used for next word prediction (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S44)
Wikipedia Dataset
Multidomain sentiment analysis dataset: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

Speech

TIMIT Speech Corpus: phoneme classification (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1)
Aurora : Timit with noise and additional information

Recommendation Systems

MovieLens: Two datasets available from http://www.grouplens.org. The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
Jester: This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
Netflix Prize: Netflix released an anonymised version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
Book-Crossing dataset: This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.

Misc

“Musk” dataset
CMU Motion Capture Database: (http://mocap.cs.cmu.edu/)
Brodatz dataset: texture modeling (http://www.ux.uis.no/~tranden/brodatz.html)
Million Song dataset: http://labrosa.ee.columbia.edu/millionsong/
Merck Molecular Activity Challenge - http://www.kaggle.com/c/MerckActivity/data

Stop Thinking, Just Do!

Datasets for Deep Learning

Tags

21 January 2017

Article Source

Datasets for Deep Learning

Symbolic Music Datasets

Natural Images

Artificial Datasets

Faces

Text

Speech

Recommendation Systems

Misc