Top 5 Machine Learning Algorithms Explained

Machine Learning

Top 5 Machine Learning Algorithms Explained

Exploring the most popular data science methods and their applications.

Machine Learning Algorithms

There are so many new data science algorithms and methods coming out every year that it can be overwhelming to learn all of them. Luckily, we have the State of Machine Learning and Data Science 2020 report, which was published by Kaggle in December last year, filled with statistics about Data Scientists from around the world including age breakdowns, educational background, programming skills, and salary information.

This article will focus on the most popular machine learning (ML) algorithms, explaining each method and the idea behind them while providing examples of their applications along with other helpful articles detailing the code involved. Let’s dive in!

1. Linear & Logistic Regression

Linear Regression

Linear regression is the first algorithm that every data science enthusiast will come across. Although it’s a very simple algorithm that you will see even in introductory statistics classes, it can be incredibly useful because of its interpretability and simplicity, which explains why it’s the most common algorithm used by Data Scientists.

For simple linear regression, it is all about modeling the linear relationship between two variables by fitting a linear equation to your data. This equation is y=a+bx, where:

x is the explanatory/independent variable
y is the dependent variable
a is the intercept (y when x=0)
b is your slope

Below is an example of fitting a linear equation through data points on a graph.

The primary goal of linear regression is to find the strength of correlation between your variables. The strength of correlation is measured in the range of -1 and 1, where:

-1 to 0 → negatively correlated (x increase=y decrease and vice versa)
0 → no correlation
0 to 1 → positively correlated (x increase=y increase)

We can also use linear regression for prediction. Once we know the relationships between x and y, we predict y for any value of x.

Example:

Let’s say I have data on money spent on advertising (x) and sales (y).

Fitting my data on a graph and doing simple linear regression, I can learn 2 things:

Does increasing advertising really increase my sales?
If this year’s sales target is 10 million dollars, how much should I spend on advertising to reach that target?

This is what simple linear regression can do. With calculus and linear algebra, we can increase the number of explanatory variables beyond 2, which gives us multiple linear regression. There is also multivariate linear regression, where we can predict multiple correlated dependent variables rather than just one.

Now that you know a bit about linear regression, let’s move on to logistic regression.

Logistic Regression

Logistic regression is similar to linear regression, but is used for classification. Also, instead of your independent variable y being a continuous or discrete value like “10.1 million” or “7 feet,” it’s a binary value.

Logistic regression began as a generalized linear model (GLM), a model that utilizes link function and takes the response variables in our typical linear regression. It then converts them into the binary output data that we want.

So how does logistic regression output these binary values? It does so using the sigmoid function, also commonly known as a “squashing function” because it compresses the range of x values into the range of [0, 1].

Example

Given data on the number of hours spent studying, can we determine the probability of passing an exam?

Using the example from the above image, what the sigmoid function does is take data on hours spent studying, and compresses the values into probabilities of 0s and 1s, where 0 means not passing the test and 1 means passing.

Summary

Linear regression and logistic regression are both supervised machine learning algorithms
Linear regression is used to model the linear relationship between two or more variables and also for regression
Logistic regression is similar to linear regression but is used for classification, typically binary
Logistic regression is a type of generalized linear model, a generalization of the ordinary linear regression using link functions

2. Introducing Decision Trees & Random Forest

Decision trees themselves are flowchart-like models of decisions and their consequences. It’s a way to display an algorithm that only contains conditional statements like “if,” “then,” and “else.”

Decision tree models are a supervised learning algorithm where the goal is to predict the value of a target variable based on several input variables. Because of their interpretability and simplicity, decision trees are among the most popular machine learning algorithms.

CART models

There are two categories of tree models, both under the umbrella term Classification and Regression Tree (CART).

Classification trees: Target variables take discrete set of values
Regression trees: Target variables take continuous values

Metrics

Decision trees follow a top-down structure, and different algorithms use different metrics to measure the “best” way to split decision trees.

Regression trees → variance reduction
Classification trees → Gini Impurity, information gain and Chi-Square

You can read more about these models here.

Decision Tree Example

Here is an example of regression trees in action:

This decision tree shows us the chance of survival for the Titanic passengers. Here are some helpful terms for decision trees and what they would be in this example:

Root node → gender
Decision nodes → age and sibsp(no of spouses or siblings)
Terminal/leaf node → died/survived(where the figures are the probability of survival and the percentage of observations)

Looking at this decision tree, we can conclude that the chances of survival are good if you were either:

Female
A male younger than 9.5 years old (age≤9.5) with less than 3 siblings or spouses (sibsp<3)

And we cannot have any meaningful understanding of decision trees without also getting into ensemble methods.

Ensemble methods

Instead of using a single decision tree, we can increase predictive performance by combining several decision trees together using ensemble methods.

By using an ensemble model, we group several weak learners together to form a strong learner. You can also think of this as a group of people making better conclusions than one individual. Two common ensemble methods are:

1. Boosting

A sequential process of training a series of weak classifiers where each model attempts to correct the errors (called “misclassifications”) from the previous model by giving them higher weights (“importance”). By averaging the weights of all the weak classifiers, we arrive at the final classifier.
The primary goal of boosting is to reduce bias (or “underfitting”)

A couple examples of boosting models are AdaBoost and XGBoost.

2. Bootstrap aggregated (or “bagging”)

Bagging involves building multiple decision trees by repeatedly resampling training data with replacement—a proccess called bootstrapping — where we reach a final prediction by averaging the predictions together or by majority vote
Bagging is primarily used to reduce variance

One example of bagging is the random forest model, a topic we will dive into next.

Random Forest Model

As mentioned previously, a random forest is a type of bagging algorithm. In bagging, we draw random samples from the training data with repetition when building trees. However, for random forest, we have the additional step of drawing random subsets of features to train the individual trees, i.e. splitting the nodes.

With this random feature selection, each tree is more independent from others when compared to basic bagging, which ultimately can reduce variance. Compared to simple decision trees, random forest models have the following advantages:

Higher predictive performance, which means better bias-variance tradeoffs
Faster than bagging because each tree is learning from only a subset of features instead of all of them

With this information, you can understand why it’s called a random forest: “random” comes from bootstrapping random samples, and “forest” comes from building several trees.

Example

At first, it might seem like a good idea to just use random forest models. After all, they have better performance. But for any model, it’s important to understand the trade-off between interpretability and prediction accuracy.

If you’re building a model to make predictions and it’s important that you know which features are more important than others, then you should opt for using decision trees that have a high capacity for such interpretability. Their flow chart-type models are more simple to interpret and understand.

In contrast, random forest models made up of multiple decision trees are difficult — though not impossible — to interpret. They also require longer training times and computational resources. But if interpretability isn’t an issue and you’re working with a large dataset, random forest models might be the solution you need.

Summary

Decision trees are supervised learning algorithms that are used for classification and regression
CART is the umbrella term for classification and regression trees
Metrics are ways to measure the “best” way to split the nodes of decision trees
Ensemble methods are ways to combine several decision trees to increase predictive performance
Bagging is where you bootstrap random samples of your training data to build trees to reduce overfitting
Boosting is where each subsequent classifier puts more weight on previous classifier’s errors and boost the effect of this error to improve it in the next iteration to reduce underfitting
Random forest models are useful when you have a large dataset and interpretability isn’t a priority
Decision trees are useful when you want interpretable, easy and fast-to-train models

3. Gradient Boosting Machines

Gradient boosting machines (or GBMs) are a supervised ML technique used for both regression and classification. It is an ensemble method that uses the concept of boosting, or the aggregation of an ensemble of weak individual models to build a stronger final model. This ensemble can be any type of model, but decision trees are the most common, called “gradient boosted trees.”

What’s unique about gradient boosting is that it can identify the errors of weak models and incrementally build a final ensemble model by utilizing a loss function that is optimized with gradient descent. This gradual process is how the name gradient boosting came about.

For further context, here’s a brief explanation of loss function and gradient descent:

Loss function

What are loss functions? Also known as cost functions, they are a measure that tells you how good a model’s coefficients are at fitting the data — or rather, how good your model is at making predictions.
Say you’re trying to predict house prices. If your prediction deviates a lot from the actual data, then your loss function outputs a very large number. If it’s close to the actual value, the loss function will give you a small number. Thus, this loss function depends on the error between the actual and estimated house prices.

Gradient descent

Gradient descent is an optimization algorithm that can minimize a function by iteratively moving toward the steepest descent that is defined by the negative gradient
One way to visualize gradient descent is by picturing that you’re on top of a mountain and want to get to the lowest point. With this algorithm, you find where the mountain is sloping down and head towards that direction — your gradient — and the steps you take are the descent.

Examples

GBMs are extremely popular across many domains. XGBoost is a type of optimized GBM and is one of the leading methods for winning online data science competitions. The popular GBM frameworks are XGBoost, LightGBM and CatBoost.

Similar to random forest, GBM has the disadvantages of low interpretability and high computational demand.

Summary

GBMs are a supervised ML method that is used for both regression and classification
GBMs build an ensemble of shallow and weak trees where each successive tree learns and improves on the previous one based on gradient descent. By combining them, you get a powerful final model that is high in predictive power.
Loss function is a way to measure how well your model makes predictions
Gradient descent is an optimization algorithm that minimizes the cost function by moving toward the steepest descent defined by the negative gradient

4. Convolutional Neural Networks

Convolutional Neural Networks (abbreviated CNN or ConvNet) are a type of neural network that is the de facto computer vision machine learning algorithm today.

Below, I’ll summarize some artificial neural networks along with an explanation of how CNNs expand the capabilities of the regular neural networks for image recognition and classification.

Artificial Neural Networks (ANNs)

Neural networks (NNs) are architectures that are inspired by our biological brains, albeit the inner workings are not exactly similar.

Our brains are a connected network of cells called neurons. Each neuron receives input from other neurons and outputs it to other neurons. Our brains then learn by forming and destroying the connections between our neurons, altering the strength of the existing connections. Neural networks work similarly — they take input as numerical data and pass it through several layers of nodes where the input adds up successively through each layer. Then, at the final layer, they produce an output.

Here is a helpful illustration of ANNs:

I mentioned earlier that our brain alters the strength of existing connections between our neurons, so how does an abstract and non-biological neural network alter its connections?

Backpropagation

A neural network learns through a process called backpropagation, or “backprop” for short. How this works is that, as you train neural networks, you feed it inputs along with the sample output. What backprop does is look at the error between the actual output produced and the sample output given, then alter the weights between the individual neurons to reduce that error.

This process happens iteratively, and after many iterations, the neural network would eventually configure the right weights that effectively models the pattern/relationship between the inputs and outputs.

Example

To give a quick example, you might train a neural network to recognize dogs by feeding it thousands of dog images. Computers don’t see images like we do, so it converts these images into vector representations of those images and passed through the input layer. Then, the hidden layers will perform computations on those images, and the output layer will give us a weight that can be interpret as a “yes, this is a dog” or “no, this isn’t a dog.” If the NN gets it right, those weights are reinforced, and if it’s wrong, it weakens them.

What about CNN?

What determines neural networks as, for example, convolutional or recurrent, depends on the type of hidden layers used.

For CNN, the hidden layers are:

Convolutional layers
Pooling layers
ReLU layers
Fully connected layers

Simply put, these layers take an image, split it into smaller chunks, scan over each of them to look for patterns, and pass it to the next layer for similar operations. As an example, the first layer might find the edges in the picture, the second layer might identify objects, the third layer reduce dimensions, etc.

And it turns out, these layers allow a CNN to be very effective when applied to image processing and even video processing tasks, making it a top established algorithm in the computer vision world.

Summary

Neural networks are simplified models of our brains that allow computers to learn by themselves
The layers of an ANN are input, hidden and output
Neural networks learn by enforcing or weakening weights in the nodes through the concept of backpropagation
CNN differentiates from regular ANN by having a unique architecture for its layer. It is commonly used in image classification and recognition, including facial recognition for smartphones and self-driving cars.

5. Bayesian Methods

Bayesian approaches are the application of Bayesian statistics, which is a branch of statistics that rely on Bayes’ rule.

If you forgot what Bayes’ rule looks like, here is the equation:

Using an example, let’s ask the question:

What is the probability that it will rain on a given cloudy day?

P(Rain) → prior
P(Cloud|Rain) → likelihood
P(Cloud) → evidence

The probability is thus:

P(Rain|Cloud) =  P(Cloud|Rain) * P(Rain)/P(Cloud)

Essentially, what Bayes’ rule tells us is the probability of an event, based on prior knowledge of conditions that might be related to that event.

Another form of Bayes’ rule would thus be:

P(A|B) = likelihood * prior/evidence

Why use Bayesian approaches?

Bayesian approaches are all about starting with a belief (or prior) and getting data to update our belief. The outcome is called the posterior. As we get more data, this posterior becomes our new prior, and the cycle repeats. This is the process of Bayesian inference.

Bayesian statistics provides a way to update our knowledge as more data becomes available. That said, this concept can be incredibly useful in today’s world where data is overflowing and real-time predictions are becoming more necessary.

A few advantages of Bayesian ML are:

Ability to include prior knowledge and beliefs to data when training models
Incremental improvement of models with Bayesian updating
Flexible feature modeling with Bayesian hierarchical modeling
Ability to quantify uncertainty of estimated model parameters and predictions

One disadvantage of Bayesian ML is that it’s hard to approximate computationally. This explains why there aren’t mainstream Bayesian inferences in deep learning now. But the good news is that we’re slowly moving in that direction.

Summary

Bayesian approaches relies on Bayes’ rule.
Bayesian ML allows us to encode our prior beliefs about what models should look like, independent of what the data tells us

Top 5 Machine Learning Algorithms Explained

Machine Learning