본문 바로가기
Tech Development/Computer Vision (PyTorch)

PyTorch: Additional Concepts in Building NN

by JK from Korea 2023. 5. 15.

<PyTorch: Additional Concepts in Building NN> 

 

Date: 2023.05.09               

 

* The PyTorch series will mainly touch on the problem I faced. For actual code, check out my github repository.

 

 

[What are “Logits” in Machine Learning?]

It’s been a while, but nonetheless, job not finished.

 

Over the course of studying neural networks, I have not encountered the term “logits” at all.

What are Logits

 

Logits are unnormalized predictions (outputs) of the model. These can give results, but without the normalization process, it is quite difficult to interpret the meaning of raw values (probabilities). Here is a great explanation of logits.

 

https://datascience.stackexchange.com/questions/31041/what-does-logits-in-machine-learning-mean

 

What does Logits in machine learning mean?

"One common mistake that I would make is adding a non-linearity to my logits output." What does the term "logit" means here or what does it represent ?

datascience.stackexchange.com

 

[Logit Flow Chart]

 [Non-linearity]

I must admit this is a topic that has been frequently covered, but I overlooked its importance.

 

Non-linearity is quite intuitive. Literally means functions that are not linear. Some common examples we have seen so far include but are not limited to ReLU, Sigmoid, tanh (i.e. activation functions).

 

From layer to layer, NNs are a combination of linear functions. However, the true power lies within the nodes that have the activation function, typically ReLu. Not all models or input data can be explained via linear relationships, which is why we apply non-linearity through activation functions to unveil hidden information of input data.

 

This discussion leads to an interesting post I found.

Why not use non-linearity before final Softmax layer? 

 

The reason why we typically do not use a non-linearity for the output layer in a CNN is that the softmax function already applies a non-linear transformation to the input. The softmax function is a normalized exponential function that maps the output of the previous layer to a probability distribution over the output classes. This means that the output of the softmax function is already constrained to be between 0 and 1, and the sum of the outputs of the softmax function is 1.

 

Adding another non-linearity before the softmax function could potentially distort this probability distribution and make it harder to interpret. For example, if we were to use a ReLU activation function before the softmax function, some of the output values could become negative, which would violate the constraint that the outputs of the softmax function must be between 0 and 1.

 

Furthermore, using a non-linearity before the softmax function could make the model more complex and harder to train. This is because the softmax function already provides a non-linear decision boundary between the output classes, and adding another non-linearity could make the decision boundary overly complex and harder to learn.

 

In summary, The ReLU non-linearity (used now almost exclusively) will in this case simply throw away information without adding any additional benefit.

 

[Mathematical Definition of Non-linearity]

Now that we have a intuitive idea of what a nonlinear function is, let’s mathematically define it.

 

A function is considered nonlinear if it does not satisfy the property of superposition, which is a key characteristic of linear functions. In mathematical terms, a function f(x) is nonlinear if it does not satisfy the following property:

 

f(a * x₁ + b * x₂) ≠ a * f(x₁) + b * f(x₂)

 

where a and b are constants, and x and x are variables. In other words, a function is nonlinear if the sum of two inputs passed through the function is not equal to the function of each input summed separately and multiplied by a constant. For example, functions such as quadratic, exponential, logarithmic, trigonometric, and sigmoid functions are nonlinear because they do not satisfy the above property of superposition.

 

On the other hand, linear functions such as y = mx + b, where m and b are constants, satisfy the property of superposition, and any linear combination of such functions is also linear.

 

[Binary Cross Entropy Loss Function]

Binary cross entropy loss is a type of loss function that is commonly used in neural network training for binary classification tasks. It measures the difference between the predicted probability distribution and the true probability distribution of the binary classes.

Binary cross entropy loss can be expressed as follows:

 

L(y, ŷ) = -[y * log(ŷ) + (1-y) * log(1-ŷ)]

 

where:

- y is the true label of the input data (either 0 or 1)

- ŷ is the predicted probability of the neural network output for the input data (also between 0 and 1)

 

The formula consists of two terms, one for the case when the true label is 1 (y=1) and the other for the case when the true label is 0 (y=0). When the true label is 1, the first term y * log(ŷ) measures the loss when the predicted probability is close to 1, while the second term (1-y) * log(1-ŷ) measures the loss when the predicted probability is close to 0. When the true label is 0, the roles of the two terms are reversed.

 

The overall binary cross entropy loss is then calculated by taking the sum of the individual losses over all training examples.

728x90
반응형

댓글