
If you are familiar with logits and probability distributions, but curious about log probs, this notebook is for you.
The source notebook can be found here: https://github.com/yaoshiang/logprobs/blob/main/Logprobs.ipynb
Many LLM APIs now expose log-probs of the tokens it generates. This is a high resolution estimate of confidence of the model.
Most importantly, you can use these probabilities to estimate the probability or confidence of a model when trying to judge its calibration.
For example, this is the official Open AI cookbook for working with log probs, and the log-probs allow you to use an LLM as a calibrated classifier:
https://cookbook.openai.com/examples/using_logprobs
Before we get into log-probs, let's cover some basics.
Why do we use sigmoid and softmax as activation functions for classifiers?
Becuase they turn the output of a neural network model, which are in the range (-inf, inf), into the range (0, 1). And we can interpret numbers in the range (0, 1) as percentages or probabilities.
The outputs of a neural network before applying sigmoid or softmax are called logits.
Let's make sure we can believe the statement that sigmoid turns logits, or numbers in the range (-inf, inf), into numbers in the range (0, 1). Let's pick some corner case numbers, run sigmoid, and see if the output is indeed in the range (0,1).
Look good! The logits do indeed get squished into the range (0, 1). This means that for the numbers that we get out of a neural network, we can run the sigmoid function and interpret it as a 1% chance, 27% chance, 50% chance, 73% chance, and 99% chance of being a positive.
Softmax takes a set of logits and like sigmoid, squishes them into the range (0,1). It has another important attribute: the sum of the numbers will be 1.0, aka 100%. Let's test that empirically.
Suppose the output of a model classifying dogs, cats, and birds is 2.0, 1.0, and -1.0. Those clearly don't sum to one. And one of the values is negative! This is definitely not a probability distribution.
But after running softmax, we do get a probably distribution: each value is between zero and one, and the sum of 71%, 26%, and 4% is indeed 100%.
Now let's see what happens if we shift these logits by a constant amount. Let's add 10 to all the logits.
Interesting! The output is the same! The fancy way of saying this is that the softmax function is invariant to constant shifts in the input logits.
However, just because softmax is mathematically invariant to constant shifts, doesn't mean we can go crazy when actually doing math on silicon. A 32 bit floating point number has about 7 significant digits (ask ChatGPT about mantissa and exponent if you wanna learn more).
So if we add big number, like 10,000,000, the computer will round away the 2, 1 and -1 and just see 10,000,000. And running softmax of a vector of three numbers, all of which are the same, will result in a uniform distribution: 33.3%, 33.3%, 33.3%.
The problem actually gets worse than that. When training a neural network, we run an algorithm called back propagation, which requires that we calculate gradients of a function. Gradients are the first-order derivative - the slope of a line. If the line is really flat, your gradients will round down to zero and your model won't train. If the line is really vertical, your gradients will approach infinity.
Unfortunately, the softmax function uses exponents, and the loss function of a model uses logs. Both of these functions can get very vertical and very flat at the extremes. This means that for the shifted logits, even if they should give the same numbers after running softmax, due to numerical instability, you'll instead get a NaN or inf.
Let's run a simple backprop algorithm in pytorch to prove this.
Notice that the gradients from the big logits (102 and 101) are NaN, or Not A Number. This means the values overflowed, which would ruin any neural network training.
Notice that the gradients from the big logits (102 and 101) are NaN, or Not A Number. This means the values overflowed, which would ruin any neural network training.
We are going to have to dig into the greek letters to figure out how to stabilize this math.
Softmax is defined as:
Where 𝑧_𝑖 is the logit for class 𝑖.
Notice all those exponents... running 𝑒^101 is what caused our previous code to overflow into NaNs.
Crossentropy is the loss function for a probability distribution, based on the basic theory of Maximum Likelihood Estimation which underpins neural network training. Crossentropy is defined as:
Where 𝑦𝑖 is really just a switch that turns on a specific probability 𝑝𝑖 for the true label (e.g. dog), and 𝑝𝑖 is the output of the softmax function.
Combining the two equations allows us to calculate crossentropy on a logit:
We know that the log(𝑎𝑏)=log(𝑎)−log(𝑏), and log(𝑒^𝑥)=𝑥 so we can adjust the above to
Let's look deeper inside the key of that equation: it appears to be just a constant shift of all the logits, 𝑧 _1,𝑧_2,𝑧_3,...,𝑧_𝑘. In fact, that equation is the log-softmax of the logits.
We have freed the logits 𝑧_1,𝑧_2,𝑧_3,...,𝑧_𝑘 from that nasty log-exp combo, but, we haven't solved the calculation of the right hand side of the equation. We still see the log of the sum of the exponent of a series of numbers. Are we still stuck taking the exponent of big numbers?
Luckily, we have the log-sum-exp trick to stabilize that part of the equation.
Anytime you are taking the log of the sum of the exponents, know that you can factor out some of the big numbers (e.g 𝑒^101) using the log-sum-exp technique.
First, we can immediately subtract a big number 𝑀 from all the logits, then multiply the resulting value by 𝑒𝑀.
Then, we find the log of a product of two values, and can move that to be the sum of two logs.
And we know the log of an exponent of a value is just the value...
so we have effectively reduced the scale of our logits while adding just a constant value to the log-sum-exp of the reduced logits. Voila! Numerical stability.
The log-sum-exp trick uses the maximum M, but we need a specific M that generates the exact log-probs - but since our numbers are numerically stable, we can calculcate the log-sum-exp exactly a second time without using this trick, to essentially get a second 𝑀 value to add to the logits to find the log-probs.
The log prob output by APIs like Open AI are really just the logits shifted by a constant amount so that they have some nicer properties than raw logits.
Since probabilities are always between zero and one, the log-prob will always be less than zero. A high confidence prediction would have a prob of say 0.999, yielding a log-prob of something like -1e-3. A low confidence prediction would have a log prob that looks something like -0.5.
If you want to calculate the actual probability of a specific token, just take the exponent of the log-prob. No need for worrying about the other [30,521] possible tokens that the model output to calculate a sum - a single log-prob for a single token is all you need to know the probability of that token.
Log probs always can be flowed into a loss function almost directly, without need for logs or exponents or crossentropy or softmax. The loss function for this in pytorch is
https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html
Let's try to prove some of the math above. Let's set up some logits, calculate the log-softmax. We see that indeed, all the values are negative.
Now let's see if the logits and the log-probs output the same probability distribution. Yup, that checks out too!
Finally, let's prove that we can calculate the probabilities simply by running the exp function, not the softmax, when we have the log_probs. Once again, things look good.
Finally, let's make sure that this is indeed numerically stable. Can we get good gradients with big numbers? Recall that we previously ran a naive approach to calculcating this gradient and got NaN for gradients.
Yup! We got good gradients despite large logits!
If you're an SAP customer, you may be accessing Open AI models through SAP Generative AI Hub.
According to the documentation there, when you access Open AI models there, you are actually accessing Azure Open AI models and indeed, the SAP documentation points you to the Azure Open AI docs.
And Azure Open AI, like the direct Open AI API, gives you the ability to access the log-probs outputted by an LLM like GPT-4o. With those logprobs, you can build calibrated classifiers and maybe even double check if your chatbot answers aren't as good as they could be.
If you've been programming in Tensorflow, you might be a little worried right now - you've been activating your neural networks with Softmax and then using the categorical_crossentropy loss function, just like you were taught in that basic ML course. Were you risking numerical instability?
TF/Keras ignores what you implemented and operates directly on the logits. See this code block:
Then, when it's operating on the logits, it uses the log_softmax function to apply the constant shift of the logits by the numerically stable implementation of log-sum-exp.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
13 | |
12 | |
9 | |
7 | |
7 | |
7 | |
6 | |
6 | |
5 | |
5 |