by Matt Kirk
Matt Kirk is a data architect, software engineer, and entrepreneur based out of Seattle, WA. For years, he struggled to piece together his quantitative finance background with his passion for building software. Then he discovered his affinity for solving problems with data. Now, he helps multimillion dollar companies with their data projects. From diamond recommendation engines to marketing automation tools, he loves educating engineering teams about methods to start their big data projects. To learn more about how you can get started with your big data project (Beyond reading this book), check out matthewkirk.com for tips.
Stock prices.
Temperature.
Web app interactions.
At this very moment right now, I can pull data on any of these three things down to the millisecond. And that’s pretty amazing.
Though the possibilities are endless, so what?
What exactly are you supposed to do with it? Having data doesn’t mean that you’ll end up with lots of insight that solve real problems. Instead, data can become a huge distraction and maintenance cost to any project.
In this article, I’ll walk through how to find insight from data using Machine Learning in Python. This article will focus on the three classes of machine learning algorithms and how it applies to different problems and also give you something to try out in your console.
By the end of this article, you won’t have a Ph.D. in machine learning – that would take seven years – but you will at least have enough information to get started learning about this fascinating subject. Before we discuss algorithms, we will discuss what machine learning aims to solve: transforming data into insight.
Machine learning transforms data into insight
Our goal is to achieve insight but to take it a bit further let’s look at something called the Data, Information, and Insight pyramid. This structure serves as a hierarchy of needs when it comes to knowledge.
At the very bottom, you have data which is the cornerstone of all of our knowledge. Data could be stock prices throughout the day or temperature measurements. These aren’t interesting, and looking through data sometimes is an impossible task. You can’t look through all billion records of temperature data.
In the middle is information. Information is where you take some data and aggregate it together into something more useful. For instance, this could be a max, a min, or an average of data points. If you’ve ever tuned into CNBC, you most likely see stock price highs and lows or the average return for that day. This aggregation is information.
Lastly which is the most useful is insight. Insight is subjective. For instance, if the weather is cold for an extended period, then most likely it is winter. As humans, we are naturally gifted at pattern matching and therefore, determining insight.
How do we find insights: deduction vs. induction?
There are lots of ways of going about finding insights. The most obvious way of finding insight is to explicitly code it. For example, we could match terms or select specific instances in the data.
But there’s a limit to that, humans only can find so much. Also, while we’re good at matching patterns, they can have difficulty at statistical reasoning. We are almost always biased.
Explicit programming is a deductive approach to finding insight. A lot of artificial intelligence algorithms will spend computing power “planning” or “searching” based on heuristics. However, deductive reasoning is difficult to apply towards data since we sometimes don’t know what to look for.
Another option is to use data as the center of attention. If we were able to take data and derive some commonalities in it, then we could build our software off of that. Using data to influence programming is inductive. Induction is at the center of machine learning.
Inductive reasoning: Machine Learning
Machine Learning is a collection of algorithms that learn from data without being explicitly programmed.
This point is important: Machine Learning is a group of algorithms that don’t have to be intervened by humans but instead can just be set to run on specific data.
Inside of machine learning though there are subclasses of problems or different classes of learning. The classes are either supervised, unsupervised, and reinforcement learning.
To explain the differences between each of the learning classes I like to use the following table. It outlines what the goal of each class is conceptual:
Class  Function  Goal  Algorithms  Packages 
Supervised  f(x) = y  Map inputs to outputs 


Unsupervised  f(x)=x  Map input onto itself 


Reinforcement  max R  Maximize long term reward 


Supervised Learning: Map Inputs to Outputs
This class of machine learning algorithms is the most popular. Most individuals want to take data inputs and outputs and build some model. Whether it’s finding the optimal portfolio, spam filtering, churn analysis or anything else.
In this section, I’ll explain what the intuitive goal of supervised learning problems are, what algorithms exist, and what sort of Python libraries are available for use as well as their pros and cons.
The Goal
Supervised learning is probably the most intuitive of each of the learning problems. It asks the following question:
Based on previous data, can we build a model that predicts new data?
Supervised Learning can come in two different styles: regression or classification.
Regression is where you want to determine a number given some data. For instance, let’s say we want to find the predicted return of a stock investment. Regressions base their predictions on factors that we have available. We could then take that information, feed it through a model and then receive a number.
Classification is when we want a particular class label as the answer. A lot of classification problems are binary classification (true/false) where you are just looking for a yes or no answer. Spam classification is a pretty classic example where you want to determine whether a given email is spammy or not.
The Algorithms
The traditional algorithms for supervised learning are:
 Decision Trees
 KNearest Neighbors
 Linear Regression
 Logistic Regression
 Naive Bayes
 Neural Networks
Here’s an overview of each of them.
Decision Trees
Decision trees are one of the most intuitive algorithms and create a tree that branches on conditions. Trees are similar to how if/else works inside of most programming languages (like Python). An example I used in my book Thoughtful Machine Learning with Python is using attributes of mushrooms to classify whether they were poisonous or not. (link to scikitlearn, turi) these are available in scikitlearn or turi.
```python from sklearn import datasets from sklearn import tree iris = datasets.load_iris() clf = tree.DecisionTreeClassifier() y_pred = clf.fit(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum())) # Number of mislabeled points out of a total 150 points : 0 ````
KNearest Neighbors
KNearest Neighbors is a simple algorithm that looks at a query point and determines a classification or nominal value based off of the “k” nearest points (or neighbors). This simple algorithm is used in estimating the value of real estate properties and can be quite good for a simple classifier or regression.
```python from sklearn import datasets from sklearn import neighbors iris = datasets.load_iris() knn = neighbors.KNeighborsClassifier() y_pred = knn.fit(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum())) # Number of mislabeled points out of a total 150 points : 5 ````
Linear Regression
Linear regression is a class of algorithms that attempts to fit a line to some data points. Linear regressions work exceptionally well and have applications in finance, as well as mapping movie reviews to specific factors like actors or directors.
```python from sklearn import datasets from sklearn import linear_model from sklearn import metrics import math iris = datasets.load_iris() lm = linear_model.LinearRegression() y_pred = lm.fit(iris.data, iris.target).predict(iris.data) print("Root Mean Squared Error : %f" % math.sqrt(metrics.mean_squared_error(iris.target, y_pred))) # Root Mean Squared Error : 0.215372 ````
Logistic Regression
Logistic regression is a take on a regression that attempts to attach a probability of classification by using something called a sigmoid function (which is a learning curve). Logistic regression is used a lot because of how fast it is. Google uses this all the time to train models for simple classifications.
```python from sklearn import datasets from sklearn import linear_model iris = datasets.load_iris() lr = linear_model.LogisticRegression() y_pred = lr.fit(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum())) # Number of mislabeled points out of a total 150 points : 6 ````
Naive Bayesian Classifier
Naive Bayes Classifiers are one of the most famous algorithms. It takes a bunch of features and probabilistically determines whether those features are uncommon or not. The Naive part comes from the fact that each feature is independent. So for instance with spam classification the word “prince” inside of an email might show up more often in spammy emails.
```python from sklearn import datasets iris = datasets.load_iris() from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() y_pred = gnb.fit(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum())) # Number of mislabeled points out of a total 150 points : 6 ````
Neural Networks
Neural Networks are the area of supervised learning that has the most future right now, thanks to the popularity of Deep Learning (which is just bigger and more exciting neural networks. Neural Nets are a class of algorithms that put together nodes for simple calculations. Together they can do things like language classification, image detection and much much more.
```python from sklearn import datasets from sklearn import neural_network iris = datasets.load_iris() mlp = neural_network.MLPClassifier(hidden_layer_sizes=(20), max_iter=1000) y_pred = mlp.fit(iris.data, iris.target).predict(iris.data) print("Number of mislabeled points out of a total %d points : %d" % (iris.data.shape[0],(iris.target != y_pred).sum())) # Number of mislabeled points out of a total 150 points : 4 ````
The Packages
For supervised learning problems, there exists a lot of good python packages out there like Turi (graphlab), Scikit Learn, Theano, and Tensorflow.
Algorithm  Packages Available 
Decision Trees  Scikitlearn, Turi 
KNearest Neighbors  Scikitlearn, Turi 
Linear Regression  Scikitlearn, Turi 
Logistic Regression  Scikitlearn, Turi 
Naive Bayes  Scikitlearn, Turi 
Neural Networks  Tensorflow, scikitlearn, Turi, theano 
Support Vector Machines  Libsvm, scikitlearn, Turi 
Unsupervised Learning: Map Data onto itself
The Goal
Unsupervised learning is a little peculiar in that it’s trying to take inputs and build a model that predicts itself. This might not make a lot of sense on first reading, but the idea is to build a representation of what exists instead of trying to build some prediction mechanism.
The Algorithms
Unsupervised Learning methods fall into a couple of subcategories: clustering, dimensionality reduction, and deep learning.
Clustering takes a set of data and tries to build a cluster mapping for each data point. So the idea is making some vector x which might have many different elements and attaching some label to it. Labels can then be utilized in visualization either by coloring or splitting up the data.
Dimensionality reduction is used as a way of overcoming the curse of dimensionality which is a problem in many machine learning algorithms. The idea here is to take dimensions, which can number in the thousands, and represent the same data with fewer dimensions. As you can imagine the simpler, the data is the more stable the results will be.
Deep learning might surprise you to be existing here. Deep learning is becoming quite popular lately but what it is effectively doing is detecting features without any intervention from the human. Some deep learning algorithms have this amazing ability to take an image, identify features about it and feed that into another model (like a supervised model).
Some popular algorithms in this category are:
Algorithm  Subclass 
KMeans Clustering  Clustering 
EMClustering  Clustering 
Principal Component Analysis  Dimensionality Reduction 
Independent Component Analysis  Dimensionality Reduction 
Autoencoders / Convolutional Neural Nets  Deep Learning 
Again I could spend an entire article speaking about each one of these individually but instead want you to get a good idea as to what they are. In general, there are two classes: clustering and dimension transformations.
Clustering
Clustering algorithms aim to take a dataset that is unlabeled and put it into labeled categories. So, for instance, taking some dataset like Iris and put it into three categories. These classes might not have any information attached to them (or a name), but they are categorized.
KMeans clustering is one of the simplest clustering algorithms. The idea is simply to state that you want K clusters and to find K centroids to cluster. In practice, this will take a dataset like the Iris dataset and put it into K clusters.
```python import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import cluster from sklearn import datasets np.random.seed(5) iris = datasets.load_iris() k_means = cluster.KMeans(3) k_means.fit(iris.data) fig = plt.figure(1, figsize=(4, 3)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() labels = k_means.labels_ ax.scatter(iris.data[:, 3], iris.data[:, 0], iris.data[:, 2], c=labels.astype(np.float)) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') plt.show() ```
EM Clustering is an extension of KMeans that looks for clusters that aren’t circular in shape. Non circular clusters can be quite useful for determining datasets that don’t follow a normal distribution. They also have the added benefit of not having to have a specific number of clusters defined.
```python import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import mixture from sklearn import datasets np.random.seed(5) iris = datasets.load_iris() em = mixture.GaussianMixture(n_components=3, covariance_type='full').fit(iris.data) fig = plt.figure(1, figsize=(4, 3)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() labels = em.predict(iris.data) ax.scatter(iris.data[:, 3], iris.data[:, 0], iris.data[:, 2], c=labels.astype(np.float)) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') plt.show() ```
Dimension Transformations
Dimension transformations aim to take a dataset with n dimensions and transform it into a new dataset with either less or more dimensions. Famous transformations are Principal Component Analysis, Independent Component Analysis, or Deep Learning methods like Autoencoders or Convolutional Neural Nets.
Principal Component Analysis is a dimension reduction algorithm that takes numerical data and turns it into a matrix factorization of the same. The idea is to take a linear component space and determine the most important vectors out of it. An excellent visualization is something called Eigenfaces which takes human faces and turns it into a sort of average looking face.
Independent Component Analysis (ICA) unlike Principal component analysis (PCA) aims to take out independent components like “nose,” “ears.” So instead of averaging out all the features, it would detect very specific features. ICA can be useful for reducing noise in data.
Lastly, we have the frontier which is Deep Learning. Autoencoders and Convolutional Neural Nets aim to take something like an image and determine new features off of it. Autoencoders are trying to find a compact version of the original information whereas convolutional neural networks will aim to create new data points.
The Packages
In general, there are quite a few tools at our disposal for unsupervised learning either through scikitlearn, Turi, or Tensorflow.
Algorithm  Package available in 
KMeans Clustering  Scikitlearn, Turi 
EMClustering  Scikitlearn, Turi 
Principal Component Analysis  scikitlearn 
Independent component Analysis  scikitlearn 
Autoencoders  tensorflow 
Convolutional Neural Nets  tensorflow 
Reinforcement Learning: Win the game
Lastly, my favorite category is reinforcement learning. This category is about winning over time instead of at a particular moment in time. Think of this class of algorithms as not worrying about losing the battle if it wins the war.
The Goal
Reinforcement learning is entirely different from supervised and unsupervised learning. Instead of trying to map a function to some data at a particular point in time instead reinforcement learning algorithms work to maximize some long term reward.
So, for instance, think of a game like Chess. You want to win. Reinforcement Learning attempts to take into consideration the actions you can take as well as the state you are currently in to determine the best move or policy.
The Algorithms
Reinforcement Learning is a bit newer than supervised learning and unsupervised learning and as such doesn’t have nearly as many algorithms but there are still some highly useful algorithms like these:
 QLearning
 TDLambda
 Multi armed bandits
Each of these algorithms is intriguing in their right and could easily constitute an article a piece. I highly recommend you read up on the Sutton book about reinforcement learning as well as (link to python code).
QLearning
QLearning which can sometimes be called Value Iteration or Policy Iteration is attempting to solve something called the Bellman equations. In the 50s Bellman was studying optimal control theory and wanted to maximize some long term discounted reward. He came up with an equation called the Bellman equation which was a recursive function. QLearning takes that a bit further to solve for a particular value Q (which is quality). The higher the Q value, the more likely you are to have a good policy for playing whatever game it is.
TD Lambda
TD Lambda or Temporal Difference learning can be explained using weather. While you could try and predict the weather based on years of weather data the more practical solution is to look at the last weeks worth of data. TD Lambda takes this further by regressing on data points in a weighted fashion towards what is newer. So today’s weather is mostly influenced by what the weather was yesterday
Multiarmed Bandits
Finally, multiarmed bandits or narmed bandits is an algorithm highly useful for things like A/B tests. Imagine you are running an A/B test and want to determine the winner. Naively we all think about splitting the traffic 50/50 between A and B. But that, unfortunately, doesn’t take into consideration that perhaps A is better than B. What multiarmed bandits do is split the traffic using the information it’s collected. It does this by trading off exploitation with exploration and eventually comes to a good enough answer.
Together these algorithms are the new frontier with things like AlphaGo, and Deep Learning taking off. There aren’t a lot of open source packages for reinforcement learning yet, but I hope that will change soon.
The Packages
While you can’t find a lot of packages out there off the shelf for programming reinforcement learning algorithms, there is the OpenAI gym which serves as a way to test out different reinforcement learning algorithms. I do recommend checking out the great work by Shangtong Zhang who converted Sutton’s original book into Python examples (https://github.com/ShangtongZhang/reinforcementlearninganintroduction)
Algorithm  Package 
QLearning  No packages 
TDLambda  No packages 
Multiarmed bandits  slots 
How it all relates
Together all these algorithms come together to make the space of Machine Learning. We’ve talked about Supervised Learning, Unsupervised Learning, and Reinforcement Learning. As well as how they relate to Python the language and packages usable for them.
Machine Learning is a fascinating subject that has all kinds of applications: from classifying spam to playing Chess. As you’ve seen in this article it is really suited for deriving insight out of data. The added benefit is that through the use of Python it is quite simple to implement these algorithms either by using packages like Scikitlearn or Tensorflow.
If you want to read some books on machine learning I recommend these to check out: Thoughtful Machine Learning with Python, Machine Learning by Peter Flach, and Python Machine Learning. Of course you can take the Coursera Course from Andrew Ng as well
I hope you have followed along with the examples and have found something useful out of them. I highly recommend you check out one of the books on machine learning that is out there for a more in depth approach.
If you found this article interesting, check out my email list that talks about starting a successful machine learning project https://matthewkirk.com/?ml or follow me on Twitter (https://www.twitter.com/mjkirk). I’d love to hear from you.
Leave a Reply