ML Resources

01 Jan 2017

Easily the best way to improve at machine learning is to just get practice at it. Practice helps you improve at machine learning in two ways: 1. You get familiar with implementing common algorithms. Whether you use machine learning libraries or roll your own, becoming comfortable with these algorithms is critical. 2. By working with a variety of data sets, you can also gain an intuition of how data is structured, so you know when to apply what algorithms.

So, the most effective way to gain experience at machine learning is to experience implementing a variety of algorithms over differently behaved data sets.

I've consolidated the machine learning resources I am familiar with, both from school and outside, into a list below. I've marked each data set with what model(s) I used for the data set, but please experiment with any algorithm that interests you!

Regression

Predict food truck profit from city population. Download: population_profit.csv. Dataset from Andrew Ng's Coursera course. Models I used: linear regression.
Predict house prices from area and number of bedrooms. Download: area_bedrooms_price.csv. Dataset from Andrew Ng's Coursera course. Models I used: linear regression.
Predict car MPG from 7 variables. Download: auto_data.csv. Dataset from UCI repository{:target="_blank"}. Models I used: linear regression.
Predict E. coli bacterial growth rate from many factors. For this, use only the gene expressions as attributes. Download: ecoli_data.zip. Dataset from Ilias Tagkopoulos{:target="_blank"} at UC Davis. Models I used: linear regression with regularization.

Classification

Note: For any supervised classification problem, you can also use it to practice unsupervised clustering by ignoring the class labels.

Predict college admission. Download: score1_score2_admit.csv. Dataset from Andrew Ng's Coursera course. Models I used: logistic regression.
Predict microchip acceptance. Download: score1_score2_accept.csv. Dataset from Andrew Ng's Coursera course. Models I used: logistic regression.
Classify handwritten 0-9 digit. Download: MNIST_classification.zip. Dataset from Andrew Ng's Coursera course. Models I used: logistic regression, fully-connected neural network with 1 hidden layer.
Classify yeast protein localization site based on 8 features. Download: yeast_data.csv. Dataset from UCI repository{:target="_blank"}. Models I used: fully-connected neural network with various numbers of layers.
Classify E. coli bacterial characteristics. Classify the strain type, medium type, environmental type and gene perturbation. Download: ecoli_data.zip. Dataset from Ilias Tagkopoulos{:target="_blank"} at UC Davis. Models I used: SVM.

Unsupervised Learning

Simple synthetic datasets to practice k-means clustering. Download: kmeans_simple.zip. Dataset from Thomas Strohmer{:target="_blank"} at UC Davis. Models I used: k-means clustering with BIC and AIC.
Iris species clustering: try to cluster irises into species based on 4 features. Download: iris.csv. Dataset from UCI repository{:target="_blank"}. Models I used: k-means clustering.
"Crescents" synthetic dataset which is impossible with k-means clustering. Download: crescents.csv. Dataset from Thomas Strohmer{:target="_blank"} at UC Davis. Models I used: diffusion maps.

Dimensionality Reduction

Reduce the dimensionality of 32x32 grayscale face images. Download: yale_faces.zip. Dataset originally from Yale, collected by Thomas Strohmer{:target="_blank"} at UC Davis.

Donald Pinckney

ML Resources

Regression

Classification

Unsupervised Learning

Dimensionality Reduction

Related Posts

An Interactive Introduction to Dependent Types with Idris 25 Jun 2019

Top questions and thoughts from WWDC 2019 03 Jun 2019

Topological Data Analysis and Persistent Homology 02 May 2019