Relationship between logits & sigmoid

Single-label vs Multi-label vs Multi-class

Single-label refers to the case where a prediction is expected to have a single label, e.g. either a cat or a dog or a tiger. Note that this can be binary, where there are only two choices or multi-class, where there are multiple choices. Single-label contrasts with multi-label, where predictions have multiple labels. For example, this blog post might require multiple tags, or a photo might require multiple people to be identified.

In the final layer of a neural network, the last hidden layer, or logits, need to be normalized into a probability. We focus here on the single-label case here (note that this means binary or muti-class), where the sum of these normalizations is 1. The labels are mutually exclusive and comprehensively exhaustive.

Single-label variables

Some definitions:

  • Logits: The hidden variable that is the output of each layer within a neural network, typically denoted as $z$. In binary classification, this is one dimension for each sample. In multi-class classification, there are multiple $z_i$, one for each class. To allow for generalizability, neural networks typically use two dimensions, even for binary classification uses two variables.
  • Sigmoid $\sigma(z)$: this maps a real number with range $-\infty$ to $\infty$ and is the range of logits $z$, to the interval [0,1]. This allows for the interpretation of the sigmoid output as a probability.
    $\sigma(z) = \frac{e^z}{1+e^z}:= p$
  • Log-odds: In turn, this allows $z$, or logits, to be interpreted as log-odds because:
    $z = \sigma^{-1}(p)=log\frac{p}{1-p}$
  • Multi-class: The logits have multiple dimenions, one for each $k$ class: $z_i \{i\epsilon 0,1,...k\}$.
  • Softmax: The multi-class analog to the sigmoid is called the softmax.
    $\sigma(z_i)=\frac{z_i}{\sum_j 1+e^{z_j}}:=p$

One confusing point (for me) was that I expected the softmax to reduce to the sigmoid when $k=2$. However, the subtlety is that sigmoid only operates on a logit that has one dimension and the alternative class is assumed to be the complement. In order for the softmax to operate on the binary case, the logit requires two dimenstions, one for each class. What surprised me, in order words, you can't design one neural network model for binary prediction and decide at the end to use torch's softmax or sigmoid function. Fundamentally, the model architecture, one or two dimensional logit outputs will drive the choice. There is not limiting case one can practically show equivalence.

This will be more apparent when we look at loss functions in the next post.