Distinguishing uncertainties helps us better understand neural networks.
Neural networks are riddled with uncertainty. After feeding an input through a neural network to retrieve an output, we cannot be sure that the output we got is a correct description of reality. Moreover, the uncertainty of a neural network needs to be separated into two different kinds:
1. Aleatoric Uncertainty: Does not resolve with more data.
2. Epistemic Uncertainty: Gets better with more data.
Separating uncertainty into these two independent components provides a better understanding of how and what neural nets actually learn. Also, dealing with these two types of uncertainties requires vastly different techniques as we will see in a bit.
But first things first. Let’s start slow and make the problem more concrete with an example.
Let’s say we are programming a medical software to predict patients’ risks for heart attacks. For prediction, we use a neural net that takes as input a patient’s data consisting of age, height, and weight. As an output, the network produces a percentage, e.g. 2 %, meaning the patient has a 2 % chance of having a heart attack within the next 10 years.
Needless to say, we assume we did everything by the book. We had a high-quality dataset to train on, split the data into training-, validation-, and testing parts, and designed and evaluated multiple architectures. As a result, we ended up with a neural net which we believe does the job of predicting heart attack risks as well as possible.
Now Peter, a patient, comes along. We feed Peter’s data into our neural network and it spits out a 40 % heart attack risk! That’s a very high risk and Peter would understandably want to know how certain our prediction is.
A possible uncertainty of our prediction could have two reasons:
1. People with the same data as Peter (age, height, weight) might have very different risks for heart attacks. What our network is outputting is just the mean of all these potential risks. More realistic than the mean value would be a probability distribution over possible risks. The more spread out this distribution is (the higher its variance), the higher the uncertainty of our prediction. This is what’s called aleatoric uncertainty.
2. The second potential source of uncertainty is the following: Maybe Peter’s data is quite special and during training, we haven’t encountered anything similar or very few data points like his. So basically the input is unfamiliar to our neural net and it has no clue whatsoever. So it just outputs 40 % because it has to give some output. This kind of uncertainty is very different from aleatoric uncertainty and is called epistemic uncertainty.
Knowing these two kinds of uncertainties and the difference between them doesn’t help us with advising Peter. The neural network spits out a 40 % risk, take it or leave it. We have no chance of figuring out whether the neural net is dead certain or has no clue at all.
So, how can we build a neural network which also tells us how certain it is? Because aleatoric and epistemic uncertainties are so different we need to attack them with different techniques.
Dealing With Aleatoric Uncertainty
Remember, aleatoric uncertainty is the uncertainty inherent to the data. No matter how much training data we gather, there will always be people with the same age, height, and weight but different heart rate risks. So, instead of a single prediction per input, we change the neural net to output a probability distribution.
How can we do that? First, we choose a type of distribution, for example, the normal distribution N(μ,σ²). The normal distribution has two parameters, the mean µ, and the variance σ².
Now, instead of our neural net producing a single heart risk percentage, we change it to output a value for the mean µ and a value for the variance σ².
Then, the loss function is adapted in a way that the trained network’s output µ and σ² maximize the likelihood of observing the training data. Essentially, our neural network was only predicting the mean µ but now it additionally predicts the variance σ² — the aleatoric uncertainty — from the data.
Fun fact: Minimizing MSE loss (mean squared error loss) is the same as maximizing the likelihood with respect to the distribution N(μ,1), meaning the variance σ² is not learned but fixed to be 1.
If you want to get your hands dirty and try these concepts out for yourself, TensorFlow has a pretty awesome extension, Tensorflow Probability. You can just choose your distribution and the library takes care of everything else.
Back to our uncertainties. We made our neural network tell us the heart risk percentage together with the uncertainty within the data it has seen. But what if there isn’t enough data? What if, coming back to our example, Peter’s data is special and the training data contains only very few data points similar to Peter’s. Then our network will just give us some random risk percentage and some random variance but basically will have no idea. This, intuitively, explains why aleatoric and epistemic uncertainties are independent and why we have to tackle epistemic uncertainty separately.
Dealing With Epistemic Uncertainty
Epistemic uncertainty is uncertainty due to incomplete information, due to not having seen all the data. In most real-world scenarios, we will never have all the data about our problem at hand. So some epistemic uncertainty will always remain. Still, epistemic uncertainty reduces with more data.
Let’s pause for a moment and think about how we could model the current epistemic uncertainty of our neural net. What are we actually uncertain about? What would change if we were given more data?
The answer is right in front of us. Our network’s weights. With more data, our weights would change. We are uncertain about our weights. So how about we model our current epistemic uncertainty by not having fixed numbers as weights but proper probability distributions.
This is exactly what Bayesian neural networks (BNNs) are. BNNs view weights not as numbers but as probability distributions.
They update these weight probability distributions with more data using Bayesian inference (that’s where the name is from). They are more expensive to train but additionally to the network’s prediction we get a number telling us how epistemically uncertain our network is.
If you want to give Bayesian neural networks a go, Tensorflow Probability supports them.
Conclusion
Outputs of neural nets are always loaded with uncertainty. Moreover, the uncertainty of neural networks can be due to variance in the data (aleatoric uncertainty) or due to not having seen all the data (epistemic uncertainty). Both of these two types of uncertainties can be addressed and quantified with their own techniques.
Medical predictions, like in our example the heart rate risk of Peter, are only one area where it is essential to know how uncertain a neural net actually is. In safety-critical environments, where often human lives are at stake, probabilistic deep learning techniques can make neural networks safer and more reliable.
This article was written by guest author Marcel Moosbrugger from TU Munich. Marcel also appeared on the ACIT Science Podcast (Listen to the episode on the origins of computer science here). Feel free to reach out to him with private questions and comments on LinkedIn.