SmartCore 0.1.0
other version

Supervised Learning

Most machine learning problems falls into one of two categories: supervised and unsupervised. This page describes supervised learning algorithms implemented in SmartCore.

In supervised learning we build a model which can be written in the very general form as \[Y = f(X) + \epsilon\] \(X = (X_1,X_2,…,X_p)\) are observations that consist of \(p\) different predictors, \(Y\) is an associated target values and \(\epsilon\) is a random error term, which is independent of \(X\) and has zero mean. We fit an unknown function \(f\) to our data to predict the response for future observations or better understand the relationship between the response and the predictors.

Supervised learning is by far the most common machine learning method that is used in practice. Supervised learning problems can be grouped into regression and classification:

  • Classification: when the output variable is qualitative (category), such as “cancerous” or “benign”, “round” or “square”.
  • Regression: a regression problem is when the output variable is quantitative, such as “height” or “dollars”.

All algorithms in SmartCore support both, qualitative and quantitative response variables and follow the same naming convention for function names. To fit an algorithm to your data use fit method that usually comes with at least 2 mandatory parameters: x for your predictors and y for target values. All optional parameters are hidden behind Default::default(). To make a prediction use predict method that takes new observations as x and predicts estimated class labels or target values.

K Nearest Neighbors

K-nearest neighbors (KNN) is one of the simplest and best-known non-parametric classification and regression method. KNN does not require training. The algorithm simply stores the entire dataset and then uses this dataset to make predictions.

More formally, given a positive integer \(K\) and a test observation \(x_0\), the KNN classifier first identifies the \(K\) points in the training data that are closest to \(x_0\), represented by \(N_0\) and then estimates the conditional probability for class \(j\) as the fraction of points in N0 whose response values equal \(j\):

\[ Pr(Y=j \vert X=x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i=j) \]

KNN Regressor is closely related to the KNN classifier. It estimates target value using the average of all the reponses in \(N_0\), i.e.

\[ \hat{y} = \frac{1}{K} \sum_{i \in N_0} y_i \]

The choice of \(K\) is very important. \(K\) can be found by tuning algorithm on a holdout dataset. It is a good idea to try many different values for \(K\) (e.g. values from 1 to 21) and see which value gives the best test error rate.

To determine which of the \(K\) instances in the training dataset are most similar to a new input a distance metric is used. For real-valued input variables, the most popular distance metric is Euclidean distance. You can choose the best distance metric based on the properties of your data. If you are unsure, you can experiment with different distance metrics and different values of \(K\) together and see which mix results in the most accurate models.

Nearest Neighbors Classification

To fit KNN Classifier to your data use KNNClassifier. Let’s fit KNN Classifier to the Breast Cancer dataset:

use smartcore::dataset::*;
// DenseMatrix wrapper around Vec
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
// Imports for KNN classifier
use smartcore::math::distance::Distances;
use smartcore::neighbors::knn_classifier::*;
// Model performance
use smartcore::metrics::roc_auc_score;
use smartcore::model_selection::train_test_split;
// Load dataset
let cancer_data = breast_cancer::load_dataset();
// Transform dataset into a NxM matrix
let x = DenseMatrix::from_array(
    cancer_data.num_samples,
    cancer_data.num_features,
    &cancer_data.data,
);
// These are our target class labels
let y = cancer_data.target;
// Split dataset into training/test (80%/20%)
let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2);
// KNN classifier
let y_hat_knn = KNNClassifier::fit(
    &x_train,
    &y_train,
    Distances::euclidian(),
    Default::default(),
).and_then(|knn| knn.predict(&x_test)).unwrap();    
// Calculate test error
println!("AUC: {}", roc_auc_score(&y_test, &y_hat_knn));

Default value of \(K\) is 3. If you want to change value of this and other parameters replace Default::default() with instance of KNNClassifierParameters.

Nearest Neighbors Regression

KNN Regressor, implemented in KNNClassifier is very similar to KNN Classifier, the only difference is that returned value is a real value instead of class label. To fit KNNRegressor to Boston Housing dataset:

use smartcore::dataset::*;
// DenseMatrix wrapper around Vec
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
// KNN
use smartcore::math::distance::Distances;
use smartcore::neighbors::knn_regressor::KNNRegressor;
// Model performance
use smartcore::metrics::mean_squared_error;
use smartcore::model_selection::train_test_split;
// Load dataset
let cancer_data = boston::load_dataset();
// Transform dataset into a NxM matrix
let x = DenseMatrix::from_array(
    cancer_data.num_samples,
    cancer_data.num_features,
    &cancer_data.data,
);
// These are our target class labels
let y = cancer_data.target;
// Split dataset into training/test (80%/20%)
let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2);
// KNN regressor
let y_hat_knn = KNNRegressor::fit(
    &x_train,
    &y_train,
    Distances::euclidian(),
    Default::default(),
).and_then(|knn| knn.predict(&x_test)).unwrap();
// Calculate test error
println!("MSE: {}", mean_squared_error(&y_test, &y_hat_knn));

As with KNN Classifier you can change value of k and other parameters by passing an instance of KNNRegressorParameters to fit function.

Nearest Neighbor Algorithms

The computational complexity of KNN increases with the size of the training dataset. This is because every time prediction is made algorithm has to search through all stored samples to find K nearest neighbors. Efficient implementation of KNN requires special data structure, like CoverTree to speed up look-up of nearest neighbors during prediction.

Cover Tree is the default algorithm for KNN regressor and classifier. Change value of algorithm field of the KNNRegressorParameters or KNNClassifierParameters if you want to switch to brute force search method.

Brute Force

The brute force nearest neighbor search is the simplest algorithm that calculates the distance from the query point to every other point in the dataset while maintaining a list of K nearest items in a Binary Heap. This algorithms does not maintain any search data structure and results in \(O(n)\) search time, where \(n\) is number of samples. Brute force search algorithm is implemented in LinearKNNSearch.

Cover Tree

Although Brute Force algorithms is very simple approach it outperforms a lot of space partitioning approaches like k-d tree on higher dimensional spaces. However, the brute-force approach quickly becomes infeasible as the dataset grows in size. To address inefficiencies of Brute Force other data structures are used that reduce the required number of distance calculations by efficiently encoding aggregate distance information for the sample.

A Cover Tree is a tree data structure used for the partitiong of metric spaces to speed up nearest neighbor operations. Cover trees are fast in practice and have great theoretical properties:

  • Construction: \(O(c^6n\log n)\)
  • Query: \(O(c^{12}\log n)\),
  • The cover tree requires \(O(n)\) space.

where \(n\) is number of samples in a dataset and \(c\) denotes the expansion constant.

Distance Metrics

The choice of distance metric for KNN algorithm largely depends on properties of your data. If you don’t know which distance to use go with Euclidean distance function or choose metric that gives you the best performance on a hold out test set. There are many other distance measures that can be used with KNN in SmartCore

Distance Metric Parameters Description
Euclidian   \(\sqrt{\sum_{i=1}^n (x_i-y_i)^2}\)
Hamming   the number of places where \( x \) and \( y \) differ
Mahalanobis \(S\), covariance matrix of the dataset \(\sqrt{(x - y)^TS^{-1}(x - y)}\)
Manhattan   \(\sum_{i=0}^n \lvert x_i - y_i \rvert\)
Minkowski p, distance order \(\left(\sum_{i=0}^n \lvert x_i - y_i \rvert^p\right)^{1/p}\)

Linear Models

Linear regression is one of the most well known and well understood algorithms in statistics and machine learning.

Simple linear regression
Figure 1. Simple linear regression.

The model describes the relationship between a dependent variable y (also called the response) as a function of one or more independent, or explanatory variables \(X_i\). The general equation for a linear model is: \[y = \beta_0 + \sum_{i=1}^n \beta_iX_i + \epsilon\]

where the target value \(y\) is a linear combination of the features \(X_i\).

Linear Regression

Use fit method of LinearRegression to fit Ordinary Least Squares to your data.

use smartcore::dataset::*;
// DenseMatrix wrapper around Vec
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
// Linear Regression
use smartcore::linear::linear_regression::LinearRegression;
// Model performance
use smartcore::metrics::{mean_squared_error, roc_auc_score};
use smartcore::model_selection::train_test_split;
// Load dataset
let cancer_data = boston::load_dataset();
// Transform dataset into a NxM matrix
let x = DenseMatrix::from_array(
    cancer_data.num_samples,
    cancer_data.num_features,
    &cancer_data.data,
);
// These are our target class labels
let y = cancer_data.target;
// Split dataset into training/test (80%/20%)
let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2);
// Linear Regression
let y_hat_lr = LinearRegression::fit(&x_train, &y_train, Default::default())
    .and_then(|lr| lr.predict(&x_test)).unwrap();
// Calculate test error
println!("MSE: {}", mean_squared_error(&y_test, &y_hat_lr));

By default, SmartCore uses SVD Decomposition to find estimates of \(\beta_i\) that minimizes the sum of the squared residuals. While SVD Decomposition provides the most stable solution, you might decide to go with QR Decomposition since this approach is more computationally efficient than SVD Decomposition. For comparison, runtime complexity of SVD Decomposition is \(O(mn^2 + n^3)\) vs \(O(mn^2 + n^3/3)\) for QR decomposition, where \(n\) and \(m\) are dimentions of input matrix \(X\). Use solver attribute of the LinearRegressionParameters to choose between decomposition methods.

Logistic Regression

Logistic regression uses linear model to represent relashionship between dependent and explanatory variables. Unlike linear regression, output in logistic regression is modeled as a binary value (0 or 1) rather than a numeric value. to squish output between 0 and 1 Sigmoid function is used.

In SmartCore Logistic Regression is represented by LogisticRegression struct that has methods fit and predict.

use smartcore::dataset::*;
// DenseMatrix wrapper around Vec
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
// Logistic Regression
use smartcore::linear::logistic_regression::LogisticRegression;
// Model performance
use smartcore::metrics::{mean_squared_error, roc_auc_score};
use smartcore::model_selection::train_test_split;
// Load dataset
let cancer_data = breast_cancer::load_dataset();
// Transform dataset into a NxM matrix
let x = DenseMatrix::from_array(
    cancer_data.num_samples,
    cancer_data.num_features,
    &cancer_data.data,
);
// These are our target class labels
let y = cancer_data.target;
// Split dataset into training/test (80%/20%)
let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2);
// Logistic Regression
let y_hat_lr = LogisticRegression::fit(&x_train, &y_train)
    .and_then(|lr| lr.predict(&x_test)).unwrap();
// Calculate test error
println!("AUC: {}", roc_auc_score(&y_test, &y_hat_lr));

SmartCore uses Limited-memory BFGS routine to find optimal combination of \(\beta_i\) parameters.

Decision Trees

Classification and Regression Trees (CART) and its modern variant Random Forest are among the most powerful algorithms available in machine learning.

CART models relationship between predictor and explanatory variables as a binary tree. Each node of the tree represents a decision that is made based on an outcome of a single attribute. The leaf nodes of the tree represent an outcome. To make a prediction we take the mean of the training observations belonging to the leaf node for regression and the mode of observations for classification.

Given a dataset with just three explanatory variables and a qualitative dependent variable the tree might look like an example below.

Decision Tree example
Figure 2. An example of Decision Tree where target is a class.

CART model is simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches, like Logistic and Linear Regression, especially when the response can be well approximated by a linear model. Tree-based method is also non-robust which means that a small change in the data can cause a large change in the final estimated tree. That’s why it is a common practice to combine prediction from multiple trees in ensemble to estimate predicted values.

In SmartCore both, decision and regression trees can be found in the tree module. Use DecisionTreeClassifier to fit decision tree and DecisionTreeRegressor for regression.

To fit DecisionTreeClassifier to Breast Cancer dataset:

use smartcore::dataset::*;
// DenseMatrix wrapper around Vec
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
// Tree
use smartcore::tree::decision_tree_classifier::DecisionTreeClassifier;
use smartcore::tree::decision_tree_regressor::DecisionTreeRegressor;
// Model performance
use smartcore::metrics::roc_auc_score;
use smartcore::model_selection::train_test_split;
// Load dataset
let cancer_data = breast_cancer::load_dataset();
// Transform dataset into a NxM matrix
let x = DenseMatrix::from_array(
    cancer_data.num_samples,
    cancer_data.num_features,
    &cancer_data.data,
);
// These are our target class labels
let y = cancer_data.target;
// Split dataset into training/test (80%/20%)
let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2);
// Decision Tree
let y_hat_tree = DecisionTreeClassifier::fit(&x_train, &y_train, Default::default())
    .and_then(|tree| tree.predict(&x_test)).unwrap();
// Calculate test error
println!("AUC: {}", roc_auc_score(&y_test, &y_hat_tree));

Here we have used default parameter values but in practice you will almost always use k-fold cross validation or hold-out validation dataset to fine tune your parameter values.

Ensemble methods

In ensemble learning we combine predictions from multiple base models to reduce the variance of predictions and decrease generalization error. Base models are assumed to be independent from each other. Bagging is one of the most streightforward ways to reduce correlation between base models in the ensemble. It works by taking repeated samples from the same training data set. As a result we generate K different training data sets (bootstraps) that overlap but are not the same. We then train our base model on the each bootstrapped training set and average predictions for regression or use majority voting scheme for classification.

Random Forest

Random forest is an extension of bagging that also randomly selects a subset of features when training a tree. This improvement decorrelated the trees and hence decreases prediction error even more. Random forests have proven effective on a wide range of different predictive modeling problems.

Let’s fit Random Forest regressor to Boston Housing dataset:

use smartcore::dataset::*;
// DenseMatrix wrapper around Vec
use smartcore::linalg::naive::dense_matrix::DenseMatrix;
// Random Forest
use smartcore::ensemble::random_forest_regressor::RandomForestRegressor;
// Model performance
use smartcore::metrics::mean_squared_error;
use smartcore::model_selection::train_test_split;
// Load dataset
let cancer_data = boston::load_dataset();
// Transform dataset into a NxM matrix
let x = DenseMatrix::from_array(
    cancer_data.num_samples,
    cancer_data.num_features,
    &cancer_data.data,
);
// These are our target class labels
let y = cancer_data.target;
// Split dataset into training/test (80%/20%)
let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2);
// Random Forest
let y_hat_rf = RandomForestRegressor::fit(&x_train, &y_train, Default::default())
    .and_then(|rf| rf.predict(&x_test)).unwrap();
// Calculate test error
println!("MSE: {}", mean_squared_error(&y_test, &y_hat_rf));

You should get lower mean squared error here when compared to other methods from this manual. This is because by default Random Forest fits 100 independent trees to different bootstrapped training sets and calculates target value by averaging predictions from these trees.

Random Forest classifier works in a similar manner. The only difference is that you prediction targets should be nominal or ordinal values (class label).

References