MIT OCW - Introduction to Machine Learning - Week 1: Basics - Introduction

Posted May 17, 2025 Updated May 26, 2025

By Nhat Nam Phan

12 min read

General information

The main focus of machine learning is making decisions or predictions based on data.

A human still has to frame the problem:

Acquire and organize data.
Design a space of possible solutions.
Select a learning algorithm and its parameters.
Apply the algorithm to the data.
Validate the resulting solution to decide whether it’s good enough to use.

In general, we need to solve these two problems:

Estimation: In case of having a noisy data with bad reflections, we have to gather them and make estimates or predictions based on them. For example, what if the same treatment end up with different results on different trials. Or how can we predict how well an estimate may compare to future results.
Generalization: The way we predict results of a situation or experiment that we have never encountered before in our data set.

At each problem, we can describe it using 6 characteristics, 3 of which characterize the problem and 3 of which characterize the solution:

Problem class: It’s about the nature of the training data and what kinds of queries will be made.
Assumptions: What do we know about the source of the data or the form of the solution?
Evaluation criteria: What is the goal of the prediction or estimation system? How will the answers to individual queries be evaluated? How will the overall performance of the system be measured?
Model type: Will an intermediate model be made? What aspects of the data will be modeled? How will the model be used to make predictions?
Model class: What particular parametric class of models will be used? What criterion will we use to pick a particular model from the model class?
Algorithm: What computational process will be used to fit the model to the data and/or to make predictions?

Problem class

There are many different problem classes in machine learning. They vary according to what kind of data is provided and what kind of conclusions are to be drawn from it.

In this course provided by MIT OpenCourseWare, I have chance to focus on classification and regression (two example of supervised learning), and will touch on reinforcement learning and sequence learning.

Supervised learning

The basic concept: Imagine we have the following dataset:

\[D_n = {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(n)}, y^{(n)})}\]

when $x^{(i)}$ is taken as an input and $y^{(i)}$ is an output.

The idea of supervised learning is basically find some mapping or some relatiosnship between $x$ and $y$ so that in the future, when we are given a $x$, we will return a $y$ as a prediction for some problem.

For example, the $x$ might be an image, and the $y$ might be whether it is a face or a hand or any body part.

Classification: Another example, let’s take a $x$ as a vector in a d-dimension space:

\[x^{(i)} \in \mathbb{R}^d\]

Let’s consider the case where $y$ is a set of discrete value:

\[y^{(i)} \in \{-1, 1\}\]

This is called a classification problem. Moreover, there’re only 2 discrete values in this dataset, this is called binary classification problem. This is a kind of supervised learning.

For example, if we are provided an image and we have to answer whether this is a face or not, the answer will only in 2 cases: yes or no.

More broadly, in case of we are given a set of songs, and we have to convert it into the form of vectors in d-dimension space, it called feature representation:

\[\varphi(x): \text{feature representation} \in \mathbb{R}^d.\]

Regression: It is like classification, except that $y^{(i)} \in \mathbb{R}^d$.

Hypothesis: Is some rules that would let us take a $x$ as an input and return a $y$ as a result.

Loss function: How can we evaluate if a hypothesis or a prediction is better than another one? This is the basic idea of a loss function. It can be represented as:

\[L(g, a)\]

where $g$ (guess) and $a$ (actual) are come from the $y$ set. For example, in the binary classification problem above, $g \in {-1, +1}$ and $a \in {-1, +1}$.

Loss function can be simplified as a funny question like: How sad are we when we predicted $g$ when $a$ is the correct answer?

Training set error: Our goal is to find out the way to optimize the hypothesis or the prediction, then to reduce the loss on new data.

\[\mathcal{E}(h) = \frac{1}{n'} \cdot \sum_{i = n + 1}^{n + n'} \mathcal{L(h(x^{(i)}), y^{(i)})}\]

The above value is our optimization goal. But it is a new dataset, so we cannot operate anything on it.

Our task is to optimize the available dataset:

\[\mathcal{E}(h) = \frac{1}{n} \cdot \sum_{i = 1}^{n} \mathcal{L(h(x^{(i)}), y^{(i)})}\]

This is called a training set error.

Unsupervised learning

Unsupervised learning doesn’t involve learning a function from inputs to outputs based on a set of input-output pairs. Instead, one is given a data set and generally expected to find some patterns or structure inherent in it.

Density estimation:

Given samples:

\[x^{(1)}, \dots, x^{(n)} \in \mathbb{R}^D\]

Each sample $x^{(i)}$ is a d-dimension vector:

\[x^{(i)} = \begin{bmatrix} x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_D^{(i)} \end{bmatrix} \in \mathbb{R}^D\]

These samples is drawn IID from some distribution $\Pr(X)$

The goal is to predict the probability $\Pr(x^{n+1})$ of an element drawn from the same distribution.

Clustering:

Given samples:

\[x^{(1)}, \dots, x^{(n)} \in \mathbb{R}^D\]

The goal of clustering is to find a partitioning of the samples that groups together samples that are similar.

The “similar” term used here is based on the definition of the similarity between samples and exactly what criterion is to be used to evaluate.

Dimensionality reduction:

Given samples:

\[x^{(1)}, \dots, x^{(n)} \in \mathbb{R}^D\]

The problem is to re-represent them as points in a d-dimensional space where $d < D$.

The goal is to simplify the dataset, which makes it more intuitive. Moreover, dimensionality reduction removes some less important dimension, which makes the model more efficiently.

Reinforcement learning

The goal is still to learn a mapping from input values $x$ to output values $y$, but without a direct supervision signal to specify which output values $y$ are best for a particular input.

There’s no training yet. Instead, the learning process forms like a trial-and-error game:

The agent (learner) observes the current state $x^{(0)}$.
It selects an action $y^{(0)}$.
It receives a reward $r^{(0)}$, which depends on $x^{(0)}$ and possibly $y^{(0)}$.
The process continues probabilistically to a new state $x^{(1)}$, with a distribution that depends only on $x^{(0)}$ and $y^{(0)}$.
The agent continue observing the current state $x^{(1)}$.
$\dots$

The goal is to find a policy $\pi$ mapping $x$ to $y$. But what is a policy $\pi$.

It can be simplified: Policy is like a strategy that used by agent (learner) to decide the next step in each state. It is represented by the following function:

\[\pi(x) = y\]

This means: if you’re in the current state $x$, which $y$ (an action) should you choose?

And what does the policy $\pi$ try to do? It tries to find chains of action which ensure the maximum reward $r$.

Assumptions

The kinds of assumptions that we can make about the source of the data or the solution include:

The data are independent and identically distributed (IID). It means the data is not interdependence and have the same distribution.
The data are generated by a Markov chain. It means the data is serially dependent, but only dependent on the previous state:

\[P(x_t | x_{t - 1}, x_{t - 2}, \dots) = P(x_t | x_{t - 1})\]

The process generating the data might be adversarial.
The “true” model that is generating the data can be perfectly described by one of some particular set of hypotheses.

The goal of making assumptions is to reduce the size or the complexity of the problem.

For example, when we make an assumption like the data is IID, we are narrowing the range of the model. When we have a appropriate range, we can operate it more easily, like use less data to train.

Evaluation criteria

But, what do we base on when evaluating whether a hypothesis/ assumption/ prediction is pretty good or not. It is the basic concept of Evaluation criteria.

We have discussed basically about it in the Supervised learning part, the quality of a predictions from a learned model is often expressed in terms of a loss function $\mathcal{L}(g, a)$.

Now, we will discuss about some types of loss function.

0-1 Loss: This type of loss function applies to predictions drawn from finite domains. The concept of finite domain is simplified like this: If the actual values come from continuous distribution (continuous range), the probability of the event $g = a$ is seemed to be $0$.

\[\mathcal{L}(g, a) = \begin{cases} 0 & \text{if } g = a \\ 1 & \text{otherwise} \end{cases}\]

Squared loss:

\[\mathcal{L}(g, a) = (g - a)^2\]

Linear loss:

\[\mathcal{L}(g, a) = |g - a|\]

Asymmetric loss: This type of loss function is used to consider a situation in which you are trying to predict whether someone is having a heart attack. It might be worse to predict “no” when the actual answer is “yes”, than the other way around.

\[\mathcal{L}(g, a) = \begin{cases} 1 & \text{if } g = 1 \text{ and } a = 0 \\ 10 & \text{if } g = 0 \text{ and } a = 1 \\ 0 & \text{otherwise} \end{cases}\]

Model type

Recall that the goal of a machine-learning system is typically to estimate or generalize, based on data provided. Below, we examine the role of model-making in machine learning.

No model:

In some simple cases, in response to queries, we can generate predictions directly from the training data, without the construction of any intermediate model.

For example, in regression or classification, we might generate an answer to a new query by averaging answers to recent queries, as in the nearest neighbor method.

Prediction rule:

In case of a model is essential, there’re two steps in processing a prediction:

“Fit” a model to the training data.
Use the model directly to make predictions.

In the prediction rule setting of regression or classification, the model will be some hypothesis or prediction rule $y = h(x; \theta)$.

The idea of $\theta$ is that it is called parameters, maybe a individual value or a vector contains a set of values.

For example:

\[h(x, \theta) = \theta_0 + \theta_1 \cdot x\] \[\theta = (\theta_0, \theta_1)\]

Another example, we have the function below:

\[h(x, p) = x^p\]

Then, the actually, or full-form function is decided by the value of $p$.

We need to fit the model, which means we find the value of $\theta$ so the prediction rule $h(x;\theta)$ is best fit with the training data.

When we can solve the above problem, given a new $x^{(n+1)}$, we would then make the prediction $h(x^{(n+1)}; \theta)$.

The fitting process is often represented as an optimization problem: Find a value $\theta$ that minimizes the loss function.

There’re two cases in fitting process:

If we knew the actual distribution on our data, $\Pr(X, Y)$ would be to predict the value of $y$ that minimizes the expected loss, which is also know as test error.
But in the reality, we rarely have that actual underlying distribution, or even an estimate version, so we can approach by minimizing the training error, that is, finding the prediction rule $h$ that minimizes the average loss on our training dataset. So we would find out the value of $\theta$ that minimizes:

\[\mathcal{E}_n(\theta) = \frac{1}n \cdot \sum_{i = 1}^{n} \mathcal{L}(h(x^{(i)}; \theta), y^{(i)})\]

Overfitting: But we should not try to fit the data too strongly, and end up with a hypothesis that does not generalize well when presented with new value $x$. This is call overfitting, when the model works well on the training data, but bad on new data.

The reason is, when we try to fit the data too strongly, model can learn the noise data, lead to bad predictions in new data (bad generalization).

An example: When a student encounters an upcoming exam, this student try to memorize all the answer without any thinking. So in the exam, they can fail if the question only have small changes (“học vẹt” in Vietnamese).

Model class and parameter fitting

A model class $M$ is a set of possible models, usually represented by a vector of parameters $\theta$.

For example, with a linear regression problem, we might try to find a linear function:

\[h(x; \theta_0, \theta_1) = \theta^T x + \theta_0\]

The parameter vector $\theta = (\theta, \theta_0)$.

In other words, model class is the general form of the model, such as: linear regression, logistic regression, neural network, decision tree, …

A model class is structured by 2 parts: model struct (the above list) and the number of parameters.

Model selection: This is the process that we try and evaluate through models, then we choose the appropriate model for the current problem.

Parametric vs Non-parametric:

Parametric model: Has a fixed number of parameters.
Non-parametric model: The number of parameters can be changed based on the feature of the dataset.

Algorithm

Once we have chosen the appropriate model class and have evaluated this model class is good or bad based on loss function, we need to find out the best model by solving an algorithmic problem.

Sometimes we can use software that was designed, generically, to perform optimization. In many other cases, we use algorithms that are specialized for machine-learning problems, or for particular hypotheses classes.

This post is licensed under CC BY 4.0 by the author.

General information

Problem class

Supervised learning

Unsupervised learning

Reinforcement learning

Assumptions

Evaluation criteria

Model type

Model class and parameter fitting

Algorithm

Trending Tags