Notes about perceptrons and sigmoid neurons
Perceptron
A Perceptron behaves like a classifier. Suppose we have an input
which we want to classify as either belonging to a category (labeled as 1) or another category (labeled as 0).
Its inner formula is given by
Given an input
, for example
, the Perceptron calculates a score: when
the input belongs to the category labeled as 1, if
the input belongs to a category labeled as 0.
Visually, when the score is positive the input
lays on the positive side of the half-hyperplane
, when negative it lays on the other side. The bigger the score the further the input from the line
A common way to define a Perceptron is using a the
function on the top of the previous score
where
represents the expected category. Unfortunately this function is not continuous so we cannot use any known minimization algorithm like Gradient Descent to devise the best parameters
and b.
Perceptron Learning Rule
If we consider an input vector
and the weights
we can still write the the Perceptron formula like
which is a half-plane divided by a line through the origin, with
the normal vector to it.
If an input
is incorrectly classified we can rotate the line, forcing the normal vector
to move into the same direction of vector
by adding a small vector
.
If we apply this step rule several times, at some time the half-plane will include the input
and it will be classified correctly. We can also rotate in the opposite direction subtracting the same quantity.
One important thing to notice is that the more a point is mis-classified, the more it takes to realign the weight vector which means a Perceptron is very slow to learn.
This rule is very important because it is very close to what happens when we apply Gradient Descend to a Sigmoid neuron to minimize the Cross-Entropy cost function described later. We can prove that if
is the expected value and
is the value predicted by the Sigmoid Neuron for the input
, in order to better classify the input we need to add to the weight vector
the quantity
Please note that
is either 0 or 1 while
can assume any values between 0 and 1.
In this case the learning rate is proportional to the mis-classification, the difference between the expected value and the predicted value, so a Sigmoid Neuron together with a Cross-Entropy cost function is faster to learn.
Sigmoid Neuron
We notice that the original formula gives a very high positive score the more likely is for the input
to be classified in category 1 and a very negative score the more likely to be classified in category 0. Using this idea we can apply a function on the top of the score to covert all possible scores from the domain
into the domain
. Such formula is the sigmoid function
When score is very negative this function becomes 0 and when very positive this function becomes 1. Using this idea we can convert the formula (2) into a continuous function which return a classification probability
instead.
If we have an input
which we know belongs to category 1, the probability for this input to be correctly classified will be
If we have an input
which we know belongs to category 0, the probability of being correctly classified becomes
Classification of multiple inputs
Assume now we have a series of inputs and expected categories
. We want the Perceptron to properly classify all the inputs, i.e. we want to find the weights
and bias
which maximize the probability for each input to be correctly classified. We create a new probability function multiplying the probability functions for each input.
There is a numerical problem with this approach: we multiply a lot of small number all less than 1 which is not a good idea with floating point numbers. A better approach is to use the logarithmic notation and convert the above multiplication into the summation
Cross-Entropy cost function
Let’s define
as follows
When the expected value for the input
is
we can write
When the expected value for the input
is
we can write
The above summation becomes
To convert the problem into a minimization problem we can add a minus sign in front of the formula and we obtain what is called Cross-Entropy cost function
which is the negative logarithm of the probability for each input
to be correctly classified.
Softmax Activation Function and Softmax Function
Softmax Activation Function and Softmax Function are two different functions. Softmax Function represents a smoothing version of the
function and it is defined as follows
where
is the input vector. For example, the discontinuous function
can be approximated by the softplus function
. The derivative of the softplus function is the logistic function
The Softmax Activation Function is a generalization of the logistic function. It is used to convert a vector of scores in
into a vector of probabilities in
. It is defined as follows
It is normally used in a multi classification problem, at the output layer of a neural network, to convert output scores into output probabilities like the logistic function.
Derivative of the logistic function
Theorem
The derivative of the logistic
is
Proof
Let
be the logistic function defined as
We calculate the derivative as follows
Training-Loss function
The more training data the bigger is the Cross-Entropy Cost Function. Dividing the Cross-Entropy Cost Function by the number of training data we obtain the Training-Loss Function which is a better measure of the classification error.
Theorem
Assume we have a series of inputs vectors
with
Suppose
is the expected label for the input
,
the weights vector and
the prediction value given by
The gradient of the training-loss function (the average of the cross-entropy cost function)
is given by
Proof
The gradient of
is given by the partial derivatives
Using Gradient Descend we can use
to change
and
to minimize the error function
Let’s call
the i-th element of the summary above
We first calculate the derivative of
with respect to
Then we calculate the derivatives of the prediction
with respect to
and
Combining the two using the chain rule we obtain
From this it follows that
Validation Accuracy and Test Accuracy
Let define
as the matrix where each row represents the classification probabilities for a given sample. In the example below we have 7 classes and 5 samples.
is the matrix where each row represents the correct classifications with one‑hot labels.
is the vector indicating with 1 a correct prediction and with 0 a wrong prediction.
The validation accuracy, measured on the given validation samples, is the average of all values of
.
In our example
The validation accuracy is 100% only when all predictions made on the given validation samples are correct. The test accuracy follows the same idea but it is measured on all test samples instead.
No comments:
Post a Comment