Here are some data sets for practicing your machine learning and deep learning skills.


This is a small data set to get you started. You can apply your skills you gained from doing the caret vignette to practice your machine learning skills.

Overview of the data set: Percentage of body fat, age, weight, height, and ten body circumference measurements for 200 men. Body fat, a measure of health, is estimated through an underwater weighing technique. Your goal is to predict percent body fat using some or all of the other variables. These data have been modified from a data set of actual human measurements.

#Read in the data
temp <- read.csv("")

bodyfat_train <- as_tibble(temp)

ggplot(data = bodyfat_train , mapping = aes(x = Age, y= Bodyfat)) +
  geom_point() +
  geom_smooth(method = "lm", color="red")

This is the test data set from our ISEC2020 Introduction to kaggle website which has more information about the data, a test data set, and more. See the class handout on Kaggle for more information about our class ISEC2020 Kaggle competitions.

Kaggle datasets

There are many, many data sets available on Kaggle. See the data link in Kaggle.

Vanderilt dataset repository

Vanderilt dataset repository: Also see the bottom of their webpage for links to other dataset repositories.

Climate change dataset

If you want to try out your skills on some large data, work with these data. These data are part of the American Statistical Association (ASA) Section on Statistics and the Environment (ENVR) 2020 Data Challenge. If you are a student, you may wish to participate in the competition. See more here.

ENVR 2020 Data Challenge Data Set. Warning this is a large data set. Read this handout for more information about handling the large files.

From the data challenge website: The data sets for the ENVR Data Challenge 2020 are generously provided by Jupiter Intelligence and are focused around multiple model-based outputs of maximum temperature, minimum temperature, and precipitation.