Notes about perceptrons and sigmoid neurons

Perceptron

A Perceptron behaves like a classifier. Suppose we have an input

which we want to classify as either belonging to a category (labeled as 1) or another category (labeled as 0).

Its inner formula is given by

Given an input

, for example

, the Perceptron calculates a score: when

the input belongs to the category labeled as 1, if

the input belongs to a category labeled as 0.

Visually, when the score is positive the input

lays on the positive side of the half-hyperplane

, when negative it lays on the other side. The bigger the score the further the input from the line

A common way to define a Perceptron is using a the

function on the top of the previous score

where

represents the expected category. Unfortunately this function is not continuous so we cannot use any known minimization algorithm like Gradient Descent to devise the best parameters

and b.

Perceptron Learning Rule

If we consider an input vector

and the weights

we can still write the the Perceptron formula like

which is a half-plane divided by a line through the origin, with

the normal vector to it.

If an input

is incorrectly classified we can rotate the line, forcing the normal vector

to move into the same direction of vector

by adding a small vector

If we apply this step rule several times, at some time the half-plane will include the input

and it will be classified correctly. We can also rotate in the opposite direction subtracting the same quantity.

One important thing to notice is that the more a point is mis-classified, the more it takes to realign the weight vector which means a Perceptron is very slow to learn.

Figure 1 Perceptron Learning Rule

This rule is very important because it is very close to what happens when we apply Gradient Descend to a Sigmoid neuron to minimize the Cross-Entropy cost function described later. We can prove that if

is the expected value and

is the value predicted by the Sigmoid Neuron for the input

, in order to better classify the input we need to add to the weight vector

the quantity

Please note that

is either 0 or 1 while

can assume any values between 0 and 1.

In this case the learning rate is proportional to the mis-classification, the difference between the expected value and the predicted value, so a Sigmoid Neuron together with a Cross-Entropy cost function is faster to learn.

Sigmoid Neuron

We notice that the original formula gives a very high positive score the more likely is for the input

to be classified in category 1 and a very negative score the more likely to be classified in category 0. Using this idea we can apply a function on the top of the score to covert all possible scores from the domain

into the domain

. Such formula is the sigmoid function

When score is very negative this function becomes 0 and when very positive this function becomes 1. Using this idea we can convert the formula (2) into a continuous function which return a classification probability

instead.

If we have an input

which we know belongs to category 1, the probability for this input to be correctly classified will be

If we have an input

which we know belongs to category 0, the probability of being correctly classified becomes

Classification of multiple inputs

Assume now we have a series of inputs and expected categories

. We want the Perceptron to properly classify all the inputs, i.e. we want to find the weights

and bias

which maximize the probability for each input to be correctly classified. We create a new probability function multiplying the probability functions for each input.

There is a numerical problem with this approach: we multiply a lot of small number all less than 1 which is not a good idea with floating point numbers. A better approach is to use the logarithmic notation and convert the above multiplication into the summation

Cross-Entropy cost function

Let’s define

as follows

When the expected value for the input

we can write

When the expected value for the input

we can write

The above summation becomes

To convert the problem into a minimization problem we can add a minus sign in front of the formula and we obtain what is called Cross-Entropy cost function

which is the negative logarithm of the probability for each input

to be correctly classified.

Softmax Activation Function and Softmax Function

Softmax Activation Function and Softmax Function are two different functions. Softmax Function represents a smoothing version of the

function and it is defined as follows

where

is the input vector. For example, the discontinuous function

can be approximated by the softplus function

. The derivative of the softplus function is the logistic function

The Softmax Activation Function is a generalization of the logistic function. It is used to convert a vector of scores in

into a vector of probabilities in

. It is defined as follows

It is normally used in a multi classification problem, at the output layer of a neural network, to convert output scores into output probabilities like the logistic function.

Derivative of the logistic function

Theorem

The derivative of the logistic

Proof

Let

be the logistic function defined as

We calculate the derivative as follows

Training-Loss function

The more training data the bigger is the Cross-Entropy Cost Function. Dividing the Cross-Entropy Cost Function by the number of training data we obtain the Training-Loss Function which is a better measure of the classification error.

Theorem

Assume we have a series of inputs vectors

with

Suppose

is the expected label for the input

the weights vector and

the prediction value given by

The gradient of the training-loss function (the average of the cross-entropy cost function)

is given by

Proof

The gradient of

is given by the partial derivatives

Using Gradient Descend we can use

to change

and

to minimize the error function

Let’s call

the i-th element of the summary above

We first calculate the derivative of

with respect to

Then we calculate the derivatives of the prediction

with respect to

and

Combining the two using the chain rule we obtain

From this it follows that

Validation Accuracy and Test Accuracy

Let define

as the matrix where each row represents the classification probabilities for a given sample. In the example below we have 7 classes and 5 samples.

is the matrix where each row represents the correct classifications with one‑hot labels.

is the vector indicating with 1 a correct prediction and with 0 a wrong prediction.

The validation accuracy, measured on the given validation samples, is the average of all values of

In our example

The validation accuracy is 100% only when all predictions made on the given validation samples are correct. The test accuracy follows the same idea but it is measured on all test samples instead.

Friday, December 29, 2017

Perceptrons and Classification

Notes about perceptrons and sigmoid neurons

Perceptron

Perceptron Learning Rule

Sigmoid Neuron

Classification of multiple inputs

Cross-Entropy cost function

Softmax Activation Function and Softmax Function

Derivative of the logistic function

Theorem

Proof

Training-Loss function

Theorem

Proof

Validation Accuracy and Test Accuracy

No comments:

Post a Comment