Article Source

Title: Open Data for Deep Learning

Open Data for Deep Learning

Recent Additions

Symbolic Music Datasets

Natural-Image Datasets

MNIST: handwritten digits
CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories
Caltech 101: pictures of objects belonging to 101 categories
Caltech 256: pictures of objects belonging to 256 categories
STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Like CIFAR-10 with some modifications.
The Street View House Numbers (SVHN) Dataset
NORB: binocular images of toy figurines under various illumination and pose
Imagenet: image database organized according to the WordNethierarchy
Pascal VOC: various object recognition challenges
Labelme: A large dataset of annotated images
COIL 20: different objects imaged at every angle in a 360 rotation
COIL100: different objects imaged at every angle in a 360 rotation

Artificial Datasets

Arcade Universe - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
A collection of datasets inspired by the ideas from BabyAISchool:
BabyAIShapesDatasets: distinguishing between 3 simple shapes
BabyAIImageAndQuestionDatasets: a question-image-answer dataset
Datasets generated for the purpose of an empirical evaluation of deep architectures (DeepVsShallowComparisonICML2007):
MnistVariations: introducing controlled variations in MNIST
RectanglesData: discriminating between wide and tall rectangles
ConvexNonConvex: discriminating between convex and nonconvex shapes
BackgroundCorrelation: controling the degree of correlation in noisy MNIST backgrounds.

Facial Datasets

Labelled Faces in the Wild: 13,000 images of faces collected from the web, labeled with the name of the person pictured.
Olivetti: a few images of several different people
Multi-Pie: The CMU Multi-PIE Face Database
Face-in-Action
JACFEE: Japanese and Caucasian Facial Expressions of Emotion
FERET: The Facial Recognition Technology Database
mmifacedb: MMI Facial Expression Database
IndianFaceDatabase
The Yale Face Database and The Yale Face Database B).

Text Datasets

Speech Datasets

TIMIT Speech Corpus: phoneme classification
MovieLens The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
Netflix Prize: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
Book-Crossing dataset: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.

Miscellaneous Datasets

Thanks to deeplearning.net for many of these links and dataset descriptions. Any suggestions of open data sets we should include for the Deeplearning4j community are welcome!

Stop Thinking, Just Do!

Open Data for Deep Learning

Tags

19 January 2017

Article Source

Open Data for Deep Learning

Recent Additions

Symbolic Music Datasets

Natural-Image Datasets

Artificial Datasets

Facial Datasets

Text Datasets

Speech Datasets

Miscellaneous Datasets