## All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio

The only way to learn data science, data analysis, machine learning, or artificial intelligence topics is by practicing or doing projects. There is no other alternative to that. But most of the time when I did a project for my portfolio or practice a new concept, I had to spend a good amount of time finding a suitable dataset. I decided to write this article to share some of the datasets I found very useful and interesting. That way at least you have some dataset to practice in hand.

## Census Dataset

If you want to get a taste of how to explore a big dataset, work with this one. This dataset is very big.

This one is great for Exploratory Data Analysis, Statistical Analysis & Modeling, and, Data Visualization practice.

## Airbnb Dataset

I received this dataset as a part of an interview a while ago.

I was asked to do an Exploratory Data Analysis and develop a Machine Learning Model using this dataset.

This dataset has a lot of text data and numerical data. You can use this dataset to practice a lot of different types of projects.

## Cars Dataset

This is a reasonable size dataset that can be used to practice some Regression Models and Exploratory Data Analysis.

This dataset contains these columns: YEAR, Make, Model, Size, (kW), Unnamed: 5, TYPE, CITY (kWh/100 km), HWY (kWh/100 km), COMB (kWh/100 km), CITY (Le/100 km), HWY (Le/100 km), COMB (Le/100 km), (g/km), RATING, (km), TIME (h).

Here is the link for this dataset

## Heart Disease Dataset

I found this dataset in Kaggle. Since then I have used it in so many different articles to demonstrate a concept.

These are two examples:

A complete Guide to Confidence Interval and Examples in Python

## Logistic Regression From Scratch Using a Real Dataset

You will find some examples of Exploratory Data Analysis done and details about the dataset as well. Check out this dataset. I am sure you will use it a lot.

## NHANES Dataset

An amazing dataset for learners. The column names of this dataset may not look very understandable at first.

But once you get used to them, you can use this one dataset to practice Data Analysis, Visualization, Statistical Modeling, and Machine Learning models(both classification and regression).

## People Wiki Dataset

It contains Wikipedia profiles of some famous people.

The dataset contains three columns: URI, name (name of the person), and text (it includes the Wikipedia profile).

A simple but very useful dataset for Natural Language Processing

Please check out this article to see an example of what you can do with this dataset:

## Natural Language Processing in Python With a Project

Here is the link to this dataset

## Amazon Product Review Dataset

This dataset contains millions of product reviews of the products of amazon.

It has three columns: Name of the product, review, and rating. This dataset is almost a real dataset, very good for Natural Language Processing.

I have a sentiment analysis project and an article where I used this dataset. Please check it out here:

## Movie Dataset

This is another dataset that is good for Machine Learning and Natural Language Processing.

This one contains the following columns: index, budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count, cast, crew, director.

I used this dataset for this project:

## Housing Price dataset

This is one of the most common datasets to develop Regression Models. For sure you can use it for other purposes as well.

This is mostly used to predict the housing prices based on the information in the other columns.

This dataset contains these columns: id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zip code, lat, long, sqft_living15, sqft_lot15.

## Mushrooms Dataset

I found this dataset in the course Applied Data Science With Python Specialization in Coursera.

I used it for Classification problems. It can be used for other purposes as well.

It contains these columns: class, cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalk-shape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-color-above-ring, stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat.

Here is the link to this dataset

## Olympic Dataset

This dataset has information on the Olympic results. Each row contains the data of a country.

This dataset will give you a taste of data cleaning to start with.

I learned Python’s libraries like Numpy and Pandas using this dataset.

## Titanic Dataset

Another very popular dataset. I myself used it a lot, I saw different experienced people using this dataset to present a concept.

This dataset contains these columns: PassengerId, Survived, P-class, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.

This dataset is good for Exploratory Data AnalysisMachine Learning Models specially Classification ModelsStatistical Analysis, and Data Visualization Practice.

Here is the link to this dataset

## Iris Dataset

Another widely used dataset in data science courses.

This one is especially good for learning Classification Models.

It contains these columns: SepalLength, SepalWidth, PetalLength, PetalWidth, Name

## Fraud Dataset

I found this dataset from the course Applied Data Science With Python Specialization in Coursera.

We used for Classification Models.

A credit card fraud detection project looks good in a portfolio.

This dataset provides information about how many immigrants came from which country by year.

A great dataset to practice Exploratory Data Analysis and Data Visualization

## Start Using Matplotlib Today With This Tutorial

It provides Facebook stock performance per day.

The columns in this dataset are Date, Open, High, Low, Close, Adj Close, Volume.

This one can be very useful in Time Series Analysis and Visualization or Time Series Related problems.

## Digits dataset

This dataset contains the pixel values for digits.

This is a commonly used dataset for Multiclass Classification problems.

I got this dataset from Professor Andrew Ng’s Machine Learning course in Coursera.

## BBC Text Dataset

Another wonderful dataset for Natural Language Processing.

This dataset contains information on different types of news from BBC archives. It’s a big text dataset.

It is normally popular for Multiclass Classification problems.

The dataset is big but it has only two columns: text and category.

Here is the link for this dataset

## Cats vs Dogs

Very commonly used to practice Image Classification.

This dataset contains images of cats and dogs.

It is good for computer vision problems.

## Malignant vs Benign

Another useful dataset for Computer Vision Problems

This dataset also contains images of two types of skin cancer.

Good for Image Classification problems