The only way to learn data science, data analysis, machine learning, or artificial intelligence topics is by practicing or doing projects. There is no other alternative to that. But most of the time when I did a project for my portfolio or practice a new concept, I had to spend a good amount of time finding a suitable dataset. I decided to write this article to share some of the datasets I found very useful and interesting. That way at least you have some dataset to practice in hand.
Census Dataset
If you want to get a taste of how to explore a big dataset, work with this one. This dataset is very big.
This one is great for Exploratory Data Analysis, Statistical Analysis & Modeling, and, Data Visualization practice.
Download this dataset from here.
Airbnb Dataset
I received this dataset as a part of an interview a while ago.
I was asked to do an Exploratory Data Analysis and develop a Machine Learning Model using this dataset.
This dataset has a lot of text data and numerical data. You can use this dataset to practice a lot of different types of projects.
You will see several datasets in this link. But I was asked to download the listings.csv file for my interview.
Cars Dataset
This is a reasonable size dataset that can be used to practice some Regression Models and Exploratory Data Analysis.
This dataset contains these columns: YEAR, Make, Model, Size, (kW), Unnamed: 5, TYPE, CITY (kWh/100 km), HWY (kWh/100 km), COMB (kWh/100 km), CITY (Le/100 km), HWY (Le/100 km), COMB (Le/100 km), (g/km), RATING, (km), TIME (h).
Here is the link for this dataset
Heart Disease Dataset
I found this dataset in Kaggle. Since then I have used it in so many different articles to demonstrate a concept.
These are two examples:
A complete Guide to Confidence Interval and Examples in Python
Logistic Regression From Scratch Using a Real Dataset
You will find some examples of Exploratory Data Analysis done and details about the dataset as well. Check out this dataset. I am sure you will use it a lot.
Download this dataset from this link.
NHANES Dataset
An amazing dataset for learners. The column names of this dataset may not look very understandable at first.
But once you get used to them, you can use this one dataset to practice Data Analysis, Visualization, Statistical Modeling, and Machine Learning models(both classification and regression).
People Wiki Dataset
It contains Wikipedia profiles of some famous people.
The dataset contains three columns: URI, name (name of the person), and text (it includes the Wikipedia profile).
A simple but very useful dataset for Natural Language Processing
Please check out this article to see an example of what you can do with this dataset:
Natural Language Processing in Python With a Project
Here is the link to this dataset
Amazon Product Review Dataset
This dataset contains millions of product reviews of the products of amazon.
It has three columns: Name of the product, review, and rating. This dataset is almost a real dataset, very good for Natural Language Processing.
I have a sentiment analysis project and an article where I used this dataset. Please check it out here:
Sentiment Analysis in Python with Amazon Product Review Data
Download this dataset from this link.
Movie Dataset
This is another dataset that is good for Machine Learning and Natural Language Processing.
This one contains the following columns: index, budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count, cast, crew, director.
I used this dataset for this project:
Build A Recommendation System Using Simple Codes in Python
Housing Price dataset
This is one of the most common datasets to develop Regression Models. For sure you can use it for other purposes as well.
This is mostly used to predict the housing prices based on the information in the other columns.
This dataset contains these columns: id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zip code, lat, long, sqft_living15, sqft_lot15.
Here is the link.
Mushrooms Dataset
I found this dataset in the course Applied Data Science With Python Specialization in Coursera.
I used it for Classification problems. It can be used for other purposes as well.
It contains these columns: class, cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalk-shape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-color-above-ring, stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat.
Here is the link to this dataset
Olympic Dataset
This dataset has information on the Olympic results. Each row contains the data of a country.
This dataset will give you a taste of data cleaning to start with.
I learned Python’s libraries like Numpy and Pandas using this dataset.
Download this dataset from here
Titanic Dataset
Another very popular dataset. I myself used it a lot, I saw different experienced people using this dataset to present a concept.
This dataset contains these columns: PassengerId, Survived, P-class, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.
This dataset is good for Exploratory Data Analysis, Machine Learning Models specially Classification Models, Statistical Analysis, and Data Visualization Practice.
Here is the link to this dataset
Iris Dataset
Another widely used dataset in data science courses.
This one is especially good for learning Classification Models.
It contains these columns: SepalLength, SepalWidth, PetalLength, PetalWidth, Name
Fraud Dataset
I found this dataset from the course Applied Data Science With Python Specialization in Coursera.
We used for Classification Models.
A credit card fraud detection project looks good in a portfolio.
Download this dataset here
Canada Immigration Dataset
This dataset provides information about how many immigrants came from which country by year.
A great dataset to practice Exploratory Data Analysis and Data Visualization
I used this dataset in this article:
Start Using Matplotlib Today With This Tutorial
Facebook Stock Data
It provides Facebook stock performance per day.
The columns in this dataset are Date, Open, High, Low, Close, Adj Close, Volume.
This one can be very useful in Time Series Analysis and Visualization or Time Series Related problems.
I used this dataset in this article:
Pandas date_range Function Details
Digits dataset
This dataset contains the pixel values for digits.
This is a commonly used dataset for Multiclass Classification problems.
I got this dataset from Professor Andrew Ng’s Machine Learning course in Coursera.
Download this dataset from this link.
BBC Text Dataset
Another wonderful dataset for Natural Language Processing.
This dataset contains information on different types of news from BBC archives. It’s a big text dataset.
It is normally popular for Multiclass Classification problems.
The dataset is big but it has only two columns: text and category.
Here is the link for this dataset
Cats vs Dogs
Very commonly used to practice Image Classification.
This dataset contains images of cats and dogs.
It is good for computer vision problems.
Malignant vs Benign
Another useful dataset for Computer Vision Problems
This dataset also contains images of two types of skin cancer.
Good for Image Classification problems
Download this dataset from here
Natural Images Dataset
This dataset contains images of airplanes, cars, cats, dogs, flowers, fruit, motorbike, and person.
You can have some practice more of Multiclass Classification
Here is the link to the dataset
Conclusion
These are all the datasets I wanted to share today. You should find good enough sets of datasets and some projects idea as well from this page to practice the necessary skills and make a portfolio.Start