20+ Machine Learning Datasets & Project Ideas

By Shivashish Thakur

To Build a perfect model, you need a large amount of data. But finding the right dataset for your machine learning and data science project is sometimes quite a challenging task. There are many organizations, researchers, and individuals who’ve shared their work, and we will use their datasets to build our project.

So in this article, we are going to discuss 20+ Machine learning and Data Science dataset and project ideas that you can use for practicing and upgrading your skills.


1. Enron Email Dataset

The Enron Dataset is popular in natural language processing. It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron.

Data Link: Enron email dataset

Project Idea: Using k-means clustering, you can build a model to detect fraudulent activities. K-means clustering is an unsupervised Machine learning algorithm. It separates the observations into k number of clusters based on the similar patterns in the data.


2. Chatbot Intents Dataset

The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.

Data Link: Intents JSON Dataset

Project Idea: You can build a chatbot or understand the working of a chatbot by twisting and expanding the data with your observations. To build a Chatbot of your own, you need to have a good knowledge of Natural language processing concepts.

Source Code: Chatbot Project in Python


3. Flickr 30k Dataset

The Flickr 30k dataset has over 30,000 images, and each image is labeled with different captions. This dataset is used to build an image caption generator. And this dataset is an upgraded version of Flickr 8k used to build more accurate models.

Data Link: Flickr image dataset

Project Idea: You can build a CNN model that is great for analysing and extracting features from the image and generate a english sentence that describes the image that is called Caption.


4. Parkinson Dataset

Parkinson’s is a disease that can cause a nervous system disorder and affects the movement. Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease.

Data Link: Parkinson dataset

Project Idea: You can build a model that can be used to differentiate healthy people from people having Parkinson’s disease. The algorithm that is useful for this purpose is XGboost, which stands for extreme gradient boosting, and it is based on decision trees.

Source Code: ML Project on Detecting Parkinson’s Disease


5. Iris Dataset

The iris dataset is a beginner-friendly dataset that has information about the flower petal and sepal sizes. This dataset has 3 classes with 50 instances in every class, so only contains 150 rows with 4 columns.

Data Link: Iris dataset

Project Idea: Classification is the task of separating items into their corresponding class. You can implement a machine learning classification or regression model on the dataset.


6. ImageNet dataset

ImageNet is a large image database that is organized according to the wordnet hierarchy. It has over 100,000 phrases and an average of 1000 images per phrase. The size exceeds 150 GB. It is suitable for image recognition, face recognition, object detection, etc. It also hosts a challenging competition named ILSVRC for people to build more and more accurate models.

Data Link: Imagenet Dataset

Project Idea: To implement image classification on this huge database and recognize objects. CNN model (Convolutional neural networks) are necessary for this project to get accurate results.


7. Mall Customers Dataset

The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.

Dataset Link: mall customers dataset

Project Idea: Segment the customers based on their gender, age, interest. It is useful in customized marketing. Customer segmentation is an important practice of dividing customers based on individual groups that are similar.

Source Code: Customer segmentation with Machine learning.


8. Google Trends Data Portal

Google trends data can be used to examine and analyze the data visually. You can also download the dataset into CSV files with a simple click. We can find out what’s trending and what people are searching for.

Data Link: Google trends datasets


9. The Boston Housing Dataset

This is a popular dataset used in pattern recognition. It contains information about the different houses in Boston based on crime rate, tax, number of rooms, etc. It has 506 rows and 14 different variables in columns. You can use this dataset to predict house prices.

Data Link: Boston dataset

Project Idea: Predict the housing prices of a new house using linear regression. Linear regression is used to predict values of unknown input when the data has some linear relationship between input and output variables.


10. Uber Pickups Dataset

The dataset has information about 4.5 million Uber pickups in New York City from April 2014 to September 2014 and 14 million more from January 2015 to June 2015. Users can perform data analysis and gather insights from the data.

Data Link: Uber pickups dataset

Project Idea: To analyze the data of the customer rides and visualize the data to find insights that can help improve business. Data analysis and visualization is an important part of data science. They are used to gather insights from the data, and with visualization, you can get quick information from the data.


11. Recommender Systems Dataset

This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.

Data Link: Recommender systems dataset

Project Idea: Build a product recommendation system like Amazon. A recommendation system can suggest your products, movies, etc. based on your interests and the things you like and have used earlier.

Source Code: Movie Recommendation System Project


12. UCI Spambase Dataset

Classifying emails as spam or non-spam is a very common and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.

Data Link: UCI spambase dataset

Project Idea: You can build a model that can identify your emails as spam or non-spam.


13. GTSRB (German traffic sign recognition benchmark) Dataset

The GTSRB dataset contains around 50,000 images of traffic signs belonging to 43 different classes and contains information on the bounding box of each sign. The dataset is used for multiclass classification.

Data Link: GTSRB dataset

Artificial Intelligence Project Idea: Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions.

Source Code: Traffic Signs Recognition Python Project


14. Cityscapes Dataset

This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.

Data Link: Cityscapes dataset

Project Idea: To perform image segmentation and detect different objects from a video on the road. Image segmentation is the process of digitally partitioning an image into various different categories like cars, buses, people, trees, roads, etc.


15. Kinetics Dataset

There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos.

Data Link: Kinetics dataset

Project Idea: Build a human action recognition model and detect the action of a human. Human action recognition is recognized by a series of observations.


16. IMDB-Wiki dataset

The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.

Data Link: IMDB wiki dataset

Project Idea: Make a model that will detect faces and predict their gender and age. You can have categories in different ranges like 0-10, 10-20, 30-40, 50-60, etc.


17. Color Detection Dataset

The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color. It also has the hexadecimal value of the color.

Data Link: Color Detection Dataset

Project Idea: The color dataset can use used to make a color detection app in which we can have an interface to pick a color from the image and the app will display the name of the color.

Source Code: Color Detection Python Project


18. Urban Sound 8K dataset

The urban sound dataset contains 8732 urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc. The dataset is popular for urban sound classification problems.

Data Link: Urban Sound 8K dataset

Project Idea: We can build a sound classification system to detect the type of urban sound playing in the background. This will help you get started with audio data and understand how to work with unstructured data.


19. Librispeech Dataset

This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. It is used for speech recognition projects.

Data Link: Librispeech dataset

Project Idea: Build a speech recognition model to detect what is being said and convert it into text. The objective of speech recognition is to automatically identify what is being said in the audio.


20. Breast Histopathology Images Dataset

This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.

Data Link: Breast histopathology dataset

Project Idea: To build a model that can classify breast cancer. You build an image classification model with Convolutional neural networks.

Source Code: Breast Cancer Classification Python Project


21. Youtube 8M Dataset

The youtube 8M dataset is a large scale labeled video dataset that has 6.1 million Youtube video ids, 350,000 hours of video, 2.6 billion audio/visual features, 3862 classes, and 3 avg labels per video. It is used for video classification purposes.

Data Link: Youtube 8M

Project Idea: Video classification can be done by using the dataset, and the model can describe what video is about. A video takes a series of inputs to classify in which category the video belongs.



In this article, we saw more than 20 machine learning datasets that you can use to practice machine learning or data science. Creating a dataset on your own is expensive, so we can use other people’s datasets to get our work done. But we should read the documents of the dataset carefully because some datasets are free, while for some datasets, you have to give credit to the owner as stated by them.


Bio: Shivashish Thaku is an Analyst and technical content writer. He is a technology freak who loves to write about the latest cutting edge technologies that are transforming the world. He is also a sports fan who loves to play and watch football.


Source: 20+ Machine Learning Datasets & Project Ideas


The Best 25 Datasets for Natural Language Processing

Natural language processing is a massive field of research. With so many areas to explore, it can sometimes be difficult to know where to begin – let alone start searching for data.

With this in mind, we’ve combed the web to create the ultimate collection of free online datasets for NLP. Although it’s impossible to cover every field of interest, we’ve done our best to compile datasets for a broad range of NLP research areas, from sentiment analysis to audio and voice recognition projects. Use it as a starting point for your experiments, or check out our specialized collections of datasets if you already have a project in mind.


Datasets for Sentiment Analysis

Where can I download datasets for sentiment analysis?

Machine learning models for sentiment analysis need to be trained with large, specialized datasets. The following list should hint at some of the ways that you can improve your sentiment analysis algorithm.

Multidomain Sentiment Analysis Dataset: This is a slightly older dataset that features a variety of product reviews taken from Amazon.

IMDB Reviews: Featuring 25,000 movie reviews, this relatively small dataset was compiled primarily for binary sentiment classification use cases.

Stanford Sentiment Treebank: Also built from movie reviews, Stanford’s dataset was designed to train a model to identify sentiment in longer phrases. It contains over 10,000 snippets taken from Rotten Tomatoes.

Sentiment140: This popular dataset contains 160,000 tweets formatted with 6 fields: polarity, ID, tweet date, query, user, and the text. Emoticons have been pre-removed.

Twitter US Airline Sentiment: Scraped in February 2015, these tweets about US airlines are classified as classified as positive, negative, and neutral. Negative tweets have also been categorized by reason for complaint.


Datasets for Text

Where can I download text datasets for natural language processing?

Natural language processing is a massive field of research, but the following list includes a broad range of datasets for different natural language processing tasks, such as voice recognition and chatbots.

20 Newsgroups: This collection of approximately 20,000 documents covers 20 different newsgroups, from baseball to religion.

Reuters News Dataset: The documents in this dataset appeared on Reuters in 1987. They have since been assembled and indexed for use in machine learning.

The WikiQA Corpus: This corpus is a publicly-available collection of question and answer pairs. It was originally assembled for use in research on open-domain question answering.

UCI’s Spambase: Originally created by a team at Hewlett-Packard, this large spam email dataset is useful for developing personalized spam filters.

Yelp Reviews: This open dataset released by Yelp contains more than 5 million reviews.

WordNet: Compiled by researchers at Princeton University, WordNet is essentially a large lexical database of English ‘synsets’, or groups of synonyms that each describe a different, distinct concept.


Audio Speech Datasets for Natural Language Processing

Where can I download audio datasets for natural language processing? 

Audio speech datasets are useful for training natural language processing applications such as virtual assistants, in-car navigation, and any other sound-activated systems.

2000 HUB5 English: This dataset contains transcripts derived from 40 telephone conversations in English. The corresponding speech files are also available through this page.

LibriSpeech: This corpus contains roughly 1,000 hours of English speech, comprised of audiobooks read by multiple speakers. The data is organized by chapters of each book.

Spoken Wikipedia Corpora: Containing hundreds of hours of audio, this corpus is composed of spoken articles from Wikipedia in English, German, and Dutch. Due to the nature of the project, it also contains a diverse set of readers and topics.

Free Spoken Digit Dataset: This is a collection of 1,500 recordings of spoken digits in English.

TIMIT: This data is designed for research in acoustic-phonetic studies and the development of automatic speech recognition systems. It contains recordings of 630 speakers of American English reading ten ‘phonetically rich’ sentences.


Datasets for Natural Language Processing (General)

Where can I download open datasets for natural language processing? 

Still can’t find what you need? Here are a few more datasets for natural language processing tasks.

Enron Dataset: Containing roughly 500,000 messages from the senior management of Enron, this dataset was made as a resource for those looking to improve or understand current email tools.

Amazon Reviews: This dataset contains around 35 million reviews from Amazon spanning a period of 18 years. It includes product and user information, ratings, and the plaintext review.

Google Books Ngrams: A Google Books corpora of n-grams, or ‘fixed size tuples of items’, can be found at this link. The ‘n’ in ‘n-grams’ specifies the number of words or characters in that specific tuple.

Blogger Corpus: Gathered from, this collection of 681,288 blog posts contains over 140 million words. Each blog included here contains at least 200 occurrences of common English words.

Wikipedia Links Data: Containing approximately 13 million documents, this dataset by Google consists of web pages that contain at least one hyperlink pointing to English Wikipedia. Each Wikipedia page is treated as an entity, while the anchor text of the link represents a mention of that entity.

Gutenberg eBooks List: This annotated list of ebooks from Project Gutenberg contains basic information about each eBook, organized by year.

Hansards Text Chunks of Canadian Parliament: This corpus contains 1.3 million pairs of aligned text chunks from the records of the 36th Canadian Parliament.

Jeopardy: The archive linked here contains more than 200,000 questions and answers from the quiz show Jeopardy. Each data point also contains a range of other information, including the category of the question, show number, and air date.

SMS Spam Collection in English: This dataset consists of 5,574 English SMS messages that have been tagged as either legitimate or spam. 425 of the texts are spam messages that were manually extracted from the Grumbletext website.


Still can’t find what you need? Lionbridge AI creates and annotates customized datasets for a wide variety of NLP projects, including everything from chatbot variations to entity annotation. With over 20 years of experience in managing a crowd of over 500,000+ linguistic specialists, Lionbridge AI is perfectly placed to provide your model with a solid foundation.

Source: The Best 25 Datasets for Natural Language Processing | Lionbridge AI


MoVi: A Large Multipurpose Motion and Video Dataset: Model and Code

Human movements are both an area of intense study and the basis of many applications such as character animation. For many applications, it is crucial to identify movements from videos or analyze datasets of movements. Here we introduce a new human Motion and Video dataset MoVi, which we make available publicly. It contains 60 female and 30 male actors performing a collection of 20 predefined everyday actions and sports movements, and one self-chosen movement. In five capture rounds, the same actors and movements were recorded using different hardware systems, including an optical motion capture system, video cameras, and inertial measurement units (IMU). For some of the capture rounds, the actors were recorded when wearing natural clothing, for the other rounds they wore minimal clothing. In total, our dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data. In this paper, we describe how the dataset was collected and post-processed; We present state-of-the-art estimates of skeletal motions and full-body shape deformations associated with skeletal motion. We discuss examples for potential studies this dataset could enable.

Source: MoVi: A Large Multipurpose Motion and Video Dataset: Model and Code