Confidence Scores of Neural Networks
Henry Jones and Daniel Lewinsohn
May 2023
1 Introduction
Image-processing neural networks for semantic segmentation and classification tasks continue to
improve in accuracy and efficiency, and there is great potential for these tools in areas with other-
wise expensive and consequential classification scenarios. As such, building estimates of a model’s
reported confidence of classification for a particular sample is crucial for applications such as med-
ical diagnosis where low-confidence classifications could be reviewed by expert practitioners while
accelerating ‘easier’ diagnoses.
Confidence estimates are mainly used as criteria within a model’s decision making process where
the estimate should reflect the expected sample accuracy, or as a threshold for the model to rescind
its classification. Despite increasing generalization accuracy (accuracy of test classifications), it
has been shown that many modern neural networks report higher confidences than expected which
suggests the need for improved model calibration [1] [2].
ImageNet [3] is a commonly used data set to train and benchmark machine learning algorithms
for image recognition. This data set contains millions of images belonging to 1000 different classes.
Recent years have seen improved success of Deep Neural Networks (DNN) in classifying these
images. Top performers include ResNet [4], ShuffleNet [5], and DenseNet [6]. Impressively, the
DNNs are able to classify unseen images with accuracy to one of the 1000 classes present in
ImageNet, serving as stepping stones to full computer vision.
To explore the calibration and confidence estimates of these popular image recognition models,
we generated class predictions and confidence values from each model on a set of over 40,000 images
from ImageNet. We then explored the calibration of these models by visualizing their confidence
versus accuracy for each prediction and calculating the Expected Calibration Error (ECE) [7]. We
proceeded to estimate the sampling distribution of the ECE for each data set via non-parametric
bootstrap sampling, allowing us to better understand the potential variability of each model’s
calibration and perform hypothesis tests to compare each model.
We found that DenseNet and ResNet show superior calibration to ShuffleNet. However, all three
models show overconfidence in the majority of predictions. We also demonstrate the successful
application of bootstrap sampling and hypothesis testing to confirm these findings.
1
2 Estimates of Confidence and Calibration
Following the work of [1], we will introduce several key concepts involved with model calibration
and confidence estimates.
Definition 1 (Well Calibrated Confidence Estimate). A confidence estimates is considered well
calibrated if the estimate is, by some criteria, ‘close’ to the expected probability of the particular
input being correctly classified.
In practice the exact probability of correct classification is unknown, so estimates for the prob-
ability of correct classification are often computed by taking the sample average accuracy of all
data points of the same class of with the same set of features. For smaller sample sizes, grouping
data with similar features can be a sufficient approximation.
Particularly for settings where the output of a confidence estimate is used by an expert or the
model in a decision making process, the interpretability of the confidence estimate as a probability
is crucial. For instance Jiang et al. [1][10] decribes a procedure in which confidence estimates
from ICU mortality models are used to determine the continuation or discontinuation of various
therapies. In such a scenario, the interpretation of the confidence estimate as a probability is more
conducive to interpretation and decision making than more abstract quantities.
2.1 Framework
Central to much of the analysis later in this paper are reliability diagrams, which quantify the
calibration of a model by displaying the relationship between confidence and accuracy. Such
diagrams rely on the principles which we detail below.
Consider a classification problem where the training and testing data are realizations of the
same distribution. Given K different classes, we assume the data is of the form (x, y) where
y {1, . . . , K} denotes the class of x. We can then let the model h be defined as h(x, y) = (ˆy, ˆp)
where ˆy is the model’s prediction of the class label of the particular x and ˆp is the confidence value
association with that the prediction. In theory, a perfectly calibrated model’s confidence values
would match the expected sample accuracy.
In practice however, it is highly unlikely that there are a sufficient number of sufficiently
identical data which one could use to approximate the expected sample accuracy, so some grouping
is required. If we group model predictions into M bins of equal length based on the confidence
values, we can define the accuracy of a particular bin B
m
.
Definition 2 (Accuracy of a Bin).
acc(B
m
) =
1
|B
m
|
X
iB
m
1
(ˆy
i
= y
i
).
Here
1
is the indicator function which returns 1 if ˆy
i
= y
i
and 0 otherwise. We can also define
the average confidence of B
m
.
Definition 3 (Confidence of a Bin).
conf(B
m
) =
1
|B
m
|
X
iB
m
ˆp
i
.
2
Having these statistics defined, the reliability diagram is created by generating a bin B
m
of
height acc(B
m
) for each m {1, . . . , M}. A bin B
m
is considered perfectly calibrated if acc(B
m
)
= conf(B
m
). Reliability diagrams give information about the degree of calibration within each bin,
but don’t provide insight into the miscalibration across all predictions as the number of samples
per bin is not considered.
For this reason, Expected Calibration Error (ECE) is a informative summary statistic giving
information about the calibration across all predictions which is weights the difference between
accuracy and confidence with the number of samples in each bin.
Definition 4 (Expected Calibration Error (ECE)).
ECE =
M
X
m=1
|B
m
|
n
|acc(B
m
) conf(B
m
)|
Having defined the relevant statistics, we turn to our analysis.
3 Methods
Treating the ECE as our estimator of calibration, we sought to employ various techniques from
this class to better understand the estimator for three popular pre-trained image classification
neural networks: ShuffleNet V2, DenseNet, and ResNet. After generating reliability diagrams
for each model, we used non-parametric bootstrap to construct the sampling distribution and
confidence intervals for the each model’s ECE estimator. With the confidence interval established,
we then examined hypothesis tests to compare observed ECE estimators to an ideal, perfectly
calibrated model. The code used to generate the data and results in this paper is available at
https://github.com/lewinsohndp/nn-confidence.
3.1 Generating Predictions for Each Model
We generated predictions and confidence scores for three models, ShuffleNet V2, DenseNet 121, and
ResNet 18 for a set of 41,101 images from ImageNet. These images were pre-labeled to one of 1000
classes and were not included in the training set for each model. To avoid the high computational
cost of training any of these models, we used the pre-trained versions available from PyTorch.
Each model contains 1000 output nodes corresponding to the 1000 classes in ImageNet. After the
application of a SoftMax layer, each model’s output for a single image can be treated as a 1000-
dimensional vector of probabilities for each class. Top-1 and top-5 error rates were calculated for
each model from these outputs. For further analysis, we chose the class with the highest probability
as the model’s prediction and recorded both the probability/confidence score and the model’s class
prediction.
3.2 Construction of Reliability Diagrams and Calculation of ECE
We used the reliability-diagrams GitHub Repository [8] to create reliability diagrams and calcu-
late the ECE for each data set. We provided a vectors containing class predictions, probabili-
ties/confidence values, and real class labels to the necessary functions.
3
3.3 Non-parametric Bootstrap
We used non-parametric bootstrap to generate a sampling distribution for the ECE estimator of
each model. For each model’s predictions consisting of class predictions, probabilities, and true
labels we did the following. We first drew a random sample of size 49,101 (the size of the sample
for each model) with replacement from the set of predictions with replacement. We then calculate
the ECE of this new sample. We repeat this process 4,000 times to yield a bootstrap sampling
distribution for the ECE of a given model.
3.4 Bootstrap Confidence Intervals
To construct the 95% confidence intervals from the sampling distribution of the ECE for each
model, we simply calculate the empirical 2.5% and 97.5% quantiles from the generated sampling
distributions.
3.5 Hypothesis Tests
Let X
i
be the random variable representing a particular model’s ECE statistic. By the in-class
result for large sample size, bootstrap approximations of an estimator, we assume that the X
i
’s
are distributed approximately normally with unknown mean µ and unknown variance σ
2
. That is,
X
i
N(µ, σ
2
).
Then consider the hypothesis test
H
0
: µ = µ
0
H
1
: µ = µ
0
,
where we compare the ECE estimator with a particular value µ
0
(µ
0
= 0) in the case of comparison
with a perfectly calibrated model). We then define our test statistic T (X) as
T (X) =
¯
X
n
µ
0
˜σ
n
,
where n is the sample size and ˜σ is the unbiased sample variance. Consider the power function
Π(µ) for some threshold value c > 0 :
Π(µ) = Pr(T c|µ) = 1 Pr(T < c) = 1 Pr
¯
X
n
µ
0
˜σ
n
< c
= 1 Pr
c <
¯
X
n
µ
0
˜σ
n
< c
= 1 Pr
c˜σ
n
+ µ
0
<
¯
X
n
<
c˜σ
n
+ µ
0
= 1 Pr
c˜σ
n
+ µ
0
µ <
¯
X
n
µ <
c˜σ
n
+ µ
0
µ
= 1 Pr
c +
n(µ
0
µ)
˜σ
<
n(
¯
X
n
µ)
˜σ
< c +
n(µ
0
µ)
˜σ
.
We have now bounded the random variable
n(
¯
X
n
µ)
˜σ
which we know to be T-distributed with
n 1 degrees of freedom, so we can use rewrite the power function in terms of the cdf of the
4
T-distribution, T
cdf
(x, n 1).
Π(µ) = 1
T
cdf
c +
n(µ
0
µ)
˜σ
, n 1
T
cdf
c +
n(µ
0
µ)
˜σ
, n 1

.
From the definition of calculating the p-value for such a test δ
c
that rejects H
0
if T c, we can
compute the p-value as follows.
sup
µ
0
Π(µ|δ
c
) = Π(µ
0
|δ
c
) = Pr(T c|µ
0
).
Thus for an observed test statistic T = t, we calculate the p-value as
P (T t|µ
0
) = 1 [T
cdf
(t, n 1) T
cdf
(t, n 1)]
= 2(1 T
cdf
(t, n 1)).
Additionally, the comparison between the distributions of ECE estimators for the three models
we compare in this paper can be understood with the following hypothesis test:
H
0
: µ
1
µ
2
= 0
H
1
: µ
1
µ
2
= 0
Based on our observations of the results for the three model’s ECE estimator’s distributions
seen in Figure 4, and the computed ratio’s of sample variances, we assumed that despite the
distributions having unknown mean and variance, all variances σ are equal for the sake of avoiding
the complications of a Welch’s t-Test and simplifying our calculations.
We proceed by defining our two samples X and Y of size m and n respectively with the defined
quantities
S
2
X
=
m
X
i=1
(X
i
¯
X
m
)
2
and
S
2
Y
=
n
X
i=1
(Y
i
¯
Y
n
)
2
.
If we define the test statistic T (X, Y) as
T (X, Y) =
M + n 2(
¯
X
m
¯
Y
n
)
q
1
m
+
1
n
p
S
2
X
+ S
2
Y
,
the distribution of the random variable T is the t distribution with m + n 2 degrees of freedom.
The proof of this assertion is given in the proof of Theorem 9.6.1 in the course text, where it is
shown that for µ
1
= µ
2
, T (X, Y) can be written as
T (X, Y) =
Z
q
W
m+n2
where Z is distributed by the standard normal and W has the χ
2
distribution with m + n 2
degrees of freedom. This is seen by letting
Z =
¯
X
m
¯
Y
n
σ
q
1
m
+
1
n
, and
W =
S
2
X
+ S
2
Y
σ
2
.
5
To compute the p-value of this test, we first note that the two sided t test of size α
0
rejects H
0
if |T (X, Y)| c, where c = T
1
cdf
(1
α
0
2
, m +n 2). Then for an observed value of the test statistic,
the p-value is computed as the size of the test that rejects H
0
when |T (X, Y)| |t|. Since T (X, Y)
has t distribution with m + n 2 degrees of freedom when H
0
is true, then the size of the test
that rejects H
0
when |T (X, Y)| |t| is the probability that ±|t| bounds a t distributed random
variable with m + n 2 degrees of freedom. Then using the t distribution cdf, this probability is
T
cdf
(−|t|, m + n 2) + 1 T
cdf
(|t|, m + n 2) = 2[1 T
cdf
(|t|, m + n 2)],
where we have used the symmetry of the t distribution. This is the form of the desired p-value.
We note this result could have been found by a similar method to that used for the prior, simpler
hypothesis test using the power function, however the power function computations would be more
intensive.
4 Results
4.1 ResNet
The ResNet model obtained a top-1 error of 30.1% and top-5 error of 10.8%. These results are
comparable to the errors reported on the PyTorch website for this model.
Figure 1 shows the reliability diagram for the ResNet model. Interestingly, for predictions of
lower confidence, we see under confidence from the model, observing sample accuracies that are
higher than the confidence value. However, once the confidence of the model is above 0.4, the
model becomes over confident. This is clear as the sample accuracy is lower than the model’s
confidence. We also observe from the confidence value histogram that the majority of predictions
from the ResNet model are highly confident. This suggests that in general, ResNet is overconfident
in its predictions. Overall, ResNet is fairly well calibrated with an ECE of 2.65 (scaled by a factor
of 100).
4.2 DenseNet
The DenseNet model obtained a top-1 error of 25.5% and top-5 error of 7.8%. Again, these error
rates are comparable to those reported on the PyTorch website.
Figure 2 shows the reliability diagram for the DenseNet model. We notice similar results to
ResNet, with the model transitioning from under confident to overconfident. DenseNet does make
this transition at a slightly higher confidence of 0.45. We also see the majority of predictions with
high confidence by observing the confidence histogram. In general, ResNet also demonstrates good
calibration with an ECE of 2.51.
4.3 ShuffleNet
The ShuffleNet model obtained a top-1 error of 30.5% and top-5 error of 11.6%. The error rates
are comparable to those reported on the PyTorch website for this model.
6
0.0
0.2
0.4
0.6
0.8
1.0
Expected Accuracy
ECE=2.65
resnet18
Gap
Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
Confidence
15000.0
10000.0
5000.0
0.0
Count
Accuracy
Avg. confidence
Figure 1: Reliability diagram for the ResNet Model. The top panel shows the confidence versus
accuracy including both a dashed line that indicates the ideal model with perfect calibration and
bars showing the actual model calibration. The bottom panel is a histogram of the amount of
predictions for each confidence.
7
0.0
0.2
0.4
0.6
0.8
1.0
Expected Accuracy
ECE=2.51
densenet121
Gap
Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
Confidence
20000.0
10000.0
0.0
Count
Accuracy
Avg. confidence
Figure 2: Reliability diagram for the DenseNet Model. The top panel shows the confidence versus
accuracy including both a dashed line that indicates the ideal model with perfect calibration and
bars showing the actual model calibration. The bottom panel is a histogram of the amount of
predictions for each confidence.
8
0.0
0.2
0.4
0.6
0.8
1.0
Expected Accuracy
ECE=7.28
shufflenet_v2_x1_0
Gap
Accuracy
0.0 0.2 0.4 0.6 0.8 1.0
Confidence
20000.0
10000.0
0.0
Count
Accuracy
Avg. confidence
Figure 3: Reliability diagram for the DenseNet Model. The top panel shows the confidence versus
accuracy including both a dashed line that indicates the ideal model with perfect calibration and
bars showing the actual model calibration. The bottom panel is a histogram of the amount of
predictions for each confidence.
9
Figure 4: Histogram showing the counts of ECE values from non-parametric bootstrap for each
model. Vertical lines indicate 95% confidence intervals for the ECE.
Figure 3 shows the reliability diagram for the ShuffleNet model. ShuffleNet demonstrates
consistent over confidence with this trend worsening as the confidence of the model increases.
Again, we observe the majority of assignments by the model having high confidence. ShuffleNet
exhibits the worse calibration of any model with an ECE of 7.28.
4.4 Sampling Distributions of the ECEs
Figure 4 shows all three sampling distributions of the ECE from each model. These were generated
from non-parametric bootstrap as described previously. The mean of each distribution corresponds
well with the observed ECE, with DenseNet having the lowest, followed by ResNet, and lastly
ShuffleNet. We also observe that these distributions appear normally distributed, allowing us to
apply our hypothesis tests. It is also clear that the ECE sampling distributions for DenseNet and
ResNet and highly overlapping, while the ECE sampling distribution for ShuffleNet is much higher
than the others.
4.5 Confidence Intervals for the ECE
Figure 4 also shows the 95% confidence intervals for the ECE from each model. For ResNet, the
confidence interval is 0.0239 to 0.0302. For DenseNet, the confidence interval is 0.0232 to 0.0291.
Finally, for ShuffleNet, the confidence interval is 0.0699 to 0.0763. All of the intervals are fairly
similar in size. Additionally, the high overlap between the DenseNet and ResNet ECE sampling
distributions is confirmed by the overlapping confidence intervals.
10
4.6 Comparing ECEs to the Ideal ECE
For a perfectly calibrated model the ECE would be zero. To compare our observed ECE for each
model to this ideal, we employed the first hypothesis test described above. We let µ
0
= 0 and
reject the null hypothesis with a significance level of .01. We assume each sampling distribution of
ECEs is normally distributed
For all three models we reject the null hypothesis that the ECE is equal to 0. In all three,
the p-value is reported as 0 by Python. With certainty, we can say that none of these image
classification models are perfectly calibrated.
4.7 Comparing Model ECEs
We were also interested in comparing ECEs between models to discover whether any of our models
were calibrated better than another. We used the hypothesis test described above to test the
difference between means of each model’s sampling distribution. We assume the variance of each
sampling distribution to be equal as the variances are very similar (2.65e-10, 2.13e-10, and 2.57e-
10). Between the ShuffleNet and DenseNet we calculated a test statistic of 1359.6 and a p-value
of 0. Between ShuffleNet and ResNet we calculated a test statistic of 1271.8 and a p-value of
0. Finally, between ResNet and DenseNet we calculated a test statistic of -30.1 and a p-value of
zero. Using the scipy’s ttest ind function we found the same test statistics, but a p-value of 7.24e-
189 between ResNet and DenseNet. Overall, this hypothesis testing procedure shows significant
difference between the mean of each model’s sampling distribution for the ECE. Given that the
ECE mean for DenseNet is the lowest, we can confidently say that DenseNet is the most calibrated
model we tested, followed by ResNet, and followed by ShuffleNet.
5 Discussion
Prior to our implementation, we introduced and explored the relevant statistics involved with
image-classification neural network calibration and laid some theoretical framework for the results
of our implementation. This framework involved several key assumptions and brief derivations of
the test statistics and formulae for the p-values of the hypothesis tests which we planned to use to
compare the distributions of the ECE test statistic.
We successfully explored the calibration of state-of-the-art image classification neural net-
works through their confidence values. We first generated predictions from three models, ResNet,
DenseNet, and ShuffleNet, on over 40,000 images from the ImageNet data set. We then constructed
reliability diagrams and calculated the ECE for each model to understand the model’s overall cal-
ibration. We next used non-parametric bootstrap to construct sample distributions for the ECE
of each model. With these sample distribution, we were able to calculate confidence intervals for
the ECE of each model. Next, we used hypothesis tests to both compare each model’s ECE to the
ideal, and to each other.
Our main findings were that none of our tested models were perfectly calibrated on the ImageNet
data set. We found DenseNet to have the best calibration, ResNet to have the second best, and
11
ShuffleNet to be far behind. Additionally, each model of miscalibrated in the sense of being
overconfident in the majority of its predictions. We were able to confirm these findings via the
sampling distribution of the ECE, confidence intervals, and hypothesis tests.
This project was limited in its breadth as we only studied three different models on one data
set. As a result, we are unable to make general conclusions about the calibration of specific neural
network architectures on general data sets. We are also limited as we were unable to test calibration
techniques as we did not train any of the tested models ourselves.
Going forward, it would be exciting to try to explore the calibration of more models and on
more data sets. This would allow us to make more general conclusions about neural network
calibration. Additionally, it would be very interesting to test which techniques during training
result in excellent model calibration. There a variety of interesting results on the ways in which
particular loss functions or regularization techniques impact model calibration in [1]. Finally, it
would be exciting to investigate the calibration of neural networks in different fields such as natural
language processing and audio.
Although limited, we can make the conclusion that DenseNet and ResNet have superior cali-
bration to ShuffleNet on the ImageNet data set. We can also confirm that DNNs remain limited
in their ability to provide confidence scores for predictions. This is important to consider as we
introduce the DNNs to new fields where misclassifying can be both costly and dangerous. Going
forward, researchers training and creating these models should consider the calibration as a key
component for both the model’s efficacy and safety.
References
[1] Nikita Vemuri. Scoring confidence in neural networks. 2020.
[2] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High
confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 427–436, 2015.
[3] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-
Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[5] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines
for efficient cnn architecture design. In Proceedings of the European conference on computer
vision (ECCV), pages 116–131, 2018.
12
[6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely con-
nected convolutional networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4700–4708, 2017.
[7] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated
probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intel-
ligence, volume 29, 2015.
[8] Hollance. Hollance/reliability-diagrams: Reliability diagrams visualize whether a classifier
model needs calibration.
13