# Notes about perceptrons and sigmoid neurons

## Perceptron

A Perceptron behaves like a classifier. Suppose we have an input which we want to classify as either belonging to a category (labeled as 1) or another category (labeled as 0).
Its inner formula is given by
Given an input , for example , the Perceptron calculates a score: when the input belongs to the category labeled as 1, if the input belongs to a category labeled as 0.
Visually, when the score is positive the input lays on the positive side of the half-hyperplane , when negative it lays on the other side. The bigger the score the further the input from the line
A common way to define a Perceptron is using a the function on the top of the previous score
where represents the expected category. Unfortunately this function is not continuous so we cannot use any known minimization algorithm like Gradient Descent to devise the best parameters and b.

## Perceptron Learning Rule

If we consider an input vector and the weights we can still write the the Perceptron formula like which is a half-plane divided by a line through the origin, with the normal vector to it.
If an input is incorrectly classified we can rotate the line, forcing the normal vector to move into the same direction of vector by adding a small vector .
If we apply this step rule several times, at some time the half-plane will include the input and it will be classified correctly. We can also rotate in the opposite direction subtracting the same quantity.
One important thing to notice is that the more a point is mis-classified, the more it takes to realign the weight vector which means a Perceptron is very slow to learn.
This rule is very important because it is very close to what happens when we apply Gradient Descend to a Sigmoid neuron to minimize the Cross-Entropy cost function described later. We can prove that if is the expected value and is the value predicted by the Sigmoid Neuron for the input , in order to better classify the input we need to add to the weight vector the quantity
Please note that is either 0 or 1 while can assume any values between 0 and 1.
In this case the learning rate is proportional to the mis-classification, the difference between the expected value and the predicted value, so a Sigmoid Neuron together with a Cross-Entropy cost function is faster to learn.

## Sigmoid Neuron

We notice that the original formula gives a very high positive score the more likely is for the input to be classified in category 1 and a very negative score the more likely to be classified in category 0. Using this idea we can apply a function on the top of the score to covert all possible scores from the domain into the domain . Such formula is the sigmoid function
When score is very negative this function becomes 0 and when very positive this function becomes 1. Using this idea we can convert the formula (2) into a continuous function which return a classification probability instead.
If we have an input which we know belongs to category 1, the probability for this input to be correctly classified will be
If we have an input which we know belongs to category 0, the probability of being correctly classified becomes

## Classification of multiple inputs

Assume now we have a series of inputs and expected categories . We want the Perceptron to properly classify all the inputs, i.e. we want to find the weights and bias which maximize the probability for each input to be correctly classified. We create a new probability function multiplying the probability functions for each input.
There is a numerical problem with this approach: we multiply a lot of small number all less than 1 which is not a good idea with floating point numbers. A better approach is to use the logarithmic notation and convert the above multiplication into the summation

## Cross-Entropy cost function

Let’s define as follows
When the expected value for the input is we can write When the expected value for the input is we can write
The above summation becomes
To convert the problem into a minimization problem we can add a minus sign in front of the formula and we obtain what is called Cross-Entropy cost function
which is the negative logarithm of the probability for each input to be correctly classified.

## Softmax Activation Function and Softmax Function

Softmax Activation Function and Softmax Function are two different functions. Softmax Function represents a smoothing version of the function and it is defined as follows
where is the input vector. For example, the discontinuous function can be approximated by the softplus function . The derivative of the softplus function is the logistic function
The Softmax Activation Function is a generalization of the logistic function. It is used to convert a vector of scores in into a vector of probabilities in . It is defined as follows
It is normally used in a multi classification problem, at the output layer of a neural network, to convert output scores into output probabilities like the logistic function.

## Derivative of the logistic function

### Theorem

The derivative of the logistic is

### Proof

Let be the logistic function defined as
We calculate the derivative as follows

## Training-Loss function

The more training data the bigger is the Cross-Entropy Cost Function. Dividing the Cross-Entropy Cost Function by the number of training data we obtain the Training-Loss Function which is a better measure of the classification error.

### Theorem

Assume we have a series of inputs vectors with Suppose is the expected label for the input , the weights vector and the prediction value given by
The gradient of the training-loss function (the average of the cross-entropy cost function)
is given by

### Proof

The gradient of is given by the partial derivatives
Using Gradient Descend we can use to change and to minimize the error function
Let’s call the i-th element of the summary above
We first calculate the derivative of with respect to
Then we calculate the derivatives of the prediction with respect to and
Combining the two using the chain rule we obtain
From this it follows that

## Validation Accuracy and Test Accuracy

Let define as the matrix where each row represents the classification probabilities for a given sample. In the example below we have 7 classes and 5 samples. is the matrix where each row represents the correct classifications with one‑hot labels. is the vector indicating with 1 a correct prediction and with 0 a wrong prediction.
The validation accuracy, measured on the given validation samples, is the average of all values of .
In our example
The validation accuracy is 100% only when all predictions made on the given validation samples are correct. The test accuracy follows the same idea but it is measured on all test samples instead.