Conﬁdence Scores of Neural Networks

Henry Jones and Daniel Lewinsohn

May 2023

1 Introduction

Image-processing neural networks for semantic segmentation and classiﬁcation tasks continue to

improve in accuracy and eﬃciency, and there is great potential for these tools in areas with other-

wise expensive and consequential classiﬁcation scenarios. As such, building estimates of a model’s

reported conﬁdence of classiﬁcation for a particular sample is crucial for applications such as med-

ical diagnosis where low-conﬁdence classiﬁcations could be reviewed by expert practitioners while

accelerating ‘easier’ diagnoses.

Conﬁdence estimates are mainly used as criteria within a model’s decision making process where

the estimate should reﬂect the expected sample accuracy, or as a threshold for the model to rescind

its classiﬁcation. Despite increasing generalization accuracy (accuracy of test classiﬁcations), it

has been shown that many modern neural networks report higher conﬁdences than expected which

suggests the need for improved model calibration [1] [2].

ImageNet [3] is a commonly used data set to train and benchmark machine learning algorithms

for image recognition. This data set contains millions of images belonging to 1000 diﬀerent classes.

Recent years have seen improved success of Deep Neural Networks (DNN) in classifying these

images. Top performers include ResNet [4], ShuﬄeNet [5], and DenseNet [6]. Impressively, the

DNNs are able to classify unseen images with accuracy to one of the 1000 classes present in

ImageNet, serving as stepping stones to full computer vision.

To explore the calibration and conﬁdence estimates of these popular image recognition models,

we generated class predictions and conﬁdence values from each model on a set of over 40,000 images

from ImageNet. We then explored the calibration of these models by visualizing their conﬁdence

versus accuracy for each prediction and calculating the Expected Calibration Error (ECE) [7]. We

proceeded to estimate the sampling distribution of the ECE for each data set via non-parametric

bootstrap sampling, allowing us to better understand the potential variability of each model’s

calibration and perform hypothesis tests to compare each model.

We found that DenseNet and ResNet show superior calibration to ShuﬄeNet. However, all three

models show overconﬁdence in the majority of predictions. We also demonstrate the successful

application of bootstrap sampling and hypothesis testing to conﬁrm these ﬁndings.

2 Estimates of Conﬁdence and Calibration

Following the work of [1], we will introduce several key concepts involved with model calibration

and conﬁdence estimates.

Deﬁnition 1 (Well Calibrated Conﬁdence Estimate). A conﬁdence estimates is considered well

calibrated if the estimate is, by some criteria, ‘close’ to the expected probability of the particular

input being correctly classiﬁed.

In practice the exact probability of correct classiﬁcation is unknown, so estimates for the prob-

ability of correct classiﬁcation are often computed by taking the sample average accuracy of all

data points of the same class of with the same set of features. For smaller sample sizes, grouping

data with similar features can be a suﬃcient approximation.

Particularly for settings where the output of a conﬁdence estimate is used by an expert or the

model in a decision making process, the interpretability of the conﬁdence estimate as a probability

is crucial. For instance Jiang et al. [1][10] decribes a procedure in which conﬁdence estimates

from ICU mortality models are used to determine the continuation or discontinuation of various

therapies. In such a scenario, the interpretation of the conﬁdence estimate as a probability is more

conducive to interpretation and decision making than more abstract quantities.

2.1 Framework

Central to much of the analysis later in this paper are reliability diagrams, which quantify the

calibration of a model by displaying the relationship between conﬁdence and accuracy. Such

diagrams rely on the principles which we detail below.

Consider a classiﬁcation problem where the training and testing data are realizations of the

same distribution. Given K diﬀerent classes, we assume the data is of the form (x, y) where

y ∈ {1, . . . , K} denotes the class of x. We can then let the model h be deﬁned as h(x, y) = (ˆy, ˆp)

where ˆy is the model’s prediction of the class label of the particular x and ˆp is the conﬁdence value

association with that the prediction. In theory, a perfectly calibrated model’s conﬁdence values

would match the expected sample accuracy.

In practice however, it is highly unlikely that there are a suﬃcient number of suﬃciently

identical data which one could use to approximate the expected sample accuracy, so some grouping

is required. If we group model predictions into M bins of equal length based on the conﬁdence

values, we can deﬁne the accuracy of a particular bin B

Deﬁnition 2 (Accuracy of a Bin).

acc(B

) =

i∈B

(ˆy

= y

Here

is the indicator function which returns 1 if ˆy

= y

and 0 otherwise. We can also deﬁne

the average conﬁdence of B

Deﬁnition 3 (Conﬁdence of a Bin).

conf(B

) =

i∈B

ˆp

Having these statistics deﬁned, the reliability diagram is created by generating a bin B

height acc(B

) for each m ∈ {1, . . . , M}. A bin B

is considered perfectly calibrated if acc(B

)

= conf(B

). Reliability diagrams give information about the degree of calibration within each bin,

but don’t provide insight into the miscalibration across all predictions as the number of samples

per bin is not considered.

For this reason, Expected Calibration Error (ECE) is a informative summary statistic giving

information about the calibration across all predictions which is weights the diﬀerence between

accuracy and conﬁdence with the number of samples in each bin.

Deﬁnition 4 (Expected Calibration Error (ECE)).

ECE =

m=1

|acc(B

) − conf(B

Having deﬁned the relevant statistics, we turn to our analysis.

3 Methods

Treating the ECE as our estimator of calibration, we sought to employ various techniques from

this class to better understand the estimator for three popular pre-trained image classiﬁcation

neural networks: ShuﬄeNet V2, DenseNet, and ResNet. After generating reliability diagrams

for each model, we used non-parametric bootstrap to construct the sampling distribution and

conﬁdence intervals for the each model’s ECE estimator. With the conﬁdence interval established,

we then examined hypothesis tests to compare observed ECE estimators to an ideal, perfectly

calibrated model. The code used to generate the data and results in this paper is available at

https://github.com/lewinsohndp/nn-confidence.

3.1 Generating Predictions for Each Model

We generated predictions and conﬁdence scores for three models, ShuﬄeNet V2, DenseNet 121, and

ResNet 18 for a set of 41,101 images from ImageNet. These images were pre-labeled to one of 1000

classes and were not included in the training set for each model. To avoid the high computational

cost of training any of these models, we used the pre-trained versions available from PyTorch.

Each model contains 1000 output nodes corresponding to the 1000 classes in ImageNet. After the

application of a SoftMax layer, each model’s output for a single image can be treated as a 1000-

dimensional vector of probabilities for each class. Top-1 and top-5 error rates were calculated for

each model from these outputs. For further analysis, we chose the class with the highest probability

as the model’s prediction and recorded both the probability/conﬁdence score and the model’s class

prediction.

3.2 Construction of Reliability Diagrams and Calculation of ECE

We used the reliability-diagrams GitHub Repository [8] to create reliability diagrams and calcu-

late the ECE for each data set. We provided a vectors containing class predictions, probabili-

ties/conﬁdence values, and real class labels to the necessary functions.

3.3 Non-parametric Bootstrap

We used non-parametric bootstrap to generate a sampling distribution for the ECE estimator of

each model. For each model’s predictions consisting of class predictions, probabilities, and true

labels we did the following. We ﬁrst drew a random sample of size 49,101 (the size of the sample

for each model) with replacement from the set of predictions with replacement. We then calculate

the ECE of this new sample. We repeat this process 4,000 times to yield a bootstrap sampling

distribution for the ECE of a given model.

3.4 Bootstrap Conﬁdence Intervals

To construct the 95% conﬁdence intervals from the sampling distribution of the ECE for each

model, we simply calculate the empirical 2.5% and 97.5% quantiles from the generated sampling

distributions.

3.5 Hypothesis Tests

Let X

be the random variable representing a particular model’s ECE statistic. By the in-class

result for large sample size, bootstrap approximations of an estimator, we assume that the X

’s

are distributed approximately normally with unknown mean µ and unknown variance σ

. That is,

∼ N(µ, σ

Then consider the hypothesis test

: µ = µ

: µ = µ

where we compare the ECE estimator with a particular value µ

(µ

= 0) in the case of comparison

with a perfectly calibrated model). We then deﬁne our test statistic T (X) as

T (X) =



− µ

˜σ

√



where n is the sample size and ˜σ is the unbiased sample variance. Consider the power function

Π(µ) for some threshold value c > 0 :

Π(µ) = Pr(T ≥ c|µ) = 1 − Pr(T < c) = 1 − Pr





− µ

˜σ

√



< c



= 1 − Pr



− c <

− µ

˜σ

√

< c



= 1 − Pr



−

c˜σ

√

+ µ

c˜σ

√

+ µ



= 1 − Pr



−

c˜σ

√

+ µ

− µ <

c˜σ

√

+ µ

− µ



= 1 − Pr



− c +

√

n(µ

− µ)

˜σ

√

− µ)

˜σ

< c +

√

n(µ

− µ)

˜σ



We have now bounded the random variable

√

−µ)

˜σ

which we know to be T-distributed with

n − 1 degrees of freedom, so we can use rewrite the power function in terms of the cdf of the

T-distribution, T

cdf

(x, n − 1).

Π(µ) = 1 −



cdf



c +

√

n(µ

− µ)

˜σ

, n − 1



− T

cdf



− c +

√

n(µ

− µ)

˜σ

, n − 1



From the deﬁnition of calculating the p-value for such a test δ

that rejects H

if T ≥ c, we can

compute the p-value as follows.

sup

µ∈Ω

Π(µ|δ

) = Π(µ

|δ

) = Pr(T ≥ c|µ

Thus for an observed test statistic T = t, we calculate the p-value as

P (T ≥ t|µ

) = 1 − [T

cdf

(t, n − 1) − T

cdf

(−t, n − 1)]

= 2(1 − T

cdf

(t, n − 1)).

Additionally, the comparison between the distributions of ECE estimators for the three models

we compare in this paper can be understood with the following hypothesis test:

: µ

− µ

= 0

: µ

− µ

= 0

Based on our observations of the results for the three model’s ECE estimator’s distributions

seen in Figure 4, and the computed ratio’s of sample variances, we assumed that despite the

distributions having unknown mean and variance, all variances σ are equal for the sake of avoiding

the complications of a Welch’s t-Test and simplifying our calculations.

We proceed by deﬁning our two samples X and Y of size m and n respectively with the deﬁned

quantities

i=1

−

)

and

i=1

−

)

If we deﬁne the test statistic T (X, Y) as

T (X, Y) =

√

M + n − 2(

−

)

+ S

the distribution of the random variable T is the t distribution with m + n − 2 degrees of freedom.

The proof of this assertion is given in the proof of Theorem 9.6.1 in the course text, where it is

shown that for µ

= µ

, T (X, Y) can be written as

T (X, Y) =

m+n−2

where Z is distributed by the standard normal and W has the χ

distribution with m + n − 2

degrees of freedom. This is seen by letting

Z =

−

, and

W =

+ S

To compute the p-value of this test, we ﬁrst note that the two sided t test of size α

rejects H

if |T (X, Y)| ≥ c, where c = T

−1

cdf

(1−

, m +n −2). Then for an observed value of the test statistic,

the p-value is computed as the size of the test that rejects H

when |T (X, Y)| ≥ |t|. Since T (X, Y)

has t distribution with m + n − 2 degrees of freedom when H

is true, then the size of the test

that rejects H

when |T (X, Y)| ≥ |t| is the probability that ±|t| bounds a t distributed random

variable with m + n − 2 degrees of freedom. Then using the t distribution cdf, this probability is

cdf

(−|t|, m + n − 2) + 1 − T

cdf

(|t|, m + n − 2) = 2[1 − T

cdf

(|t|, m + n − 2)],

where we have used the symmetry of the t distribution. This is the form of the desired p-value.

We note this result could have been found by a similar method to that used for the prior, simpler

hypothesis test using the power function, however the power function computations would be more

intensive.

4 Results

4.1 ResNet

The ResNet model obtained a top-1 error of 30.1% and top-5 error of 10.8%. These results are

comparable to the errors reported on the PyTorch website for this model.

Figure 1 shows the reliability diagram for the ResNet model. Interestingly, for predictions of

lower conﬁdence, we see under conﬁdence from the model, observing sample accuracies that are

higher than the conﬁdence value. However, once the conﬁdence of the model is above 0.4, the

model becomes over conﬁdent. This is clear as the sample accuracy is lower than the model’s

conﬁdence. We also observe from the conﬁdence value histogram that the majority of predictions

from the ResNet model are highly conﬁdent. This suggests that in general, ResNet is overconﬁdent

in its predictions. Overall, ResNet is fairly well calibrated with an ECE of 2.65 (scaled by a factor

of 100).

4.2 DenseNet

The DenseNet model obtained a top-1 error of 25.5% and top-5 error of 7.8%. Again, these error

rates are comparable to those reported on the PyTorch website.

Figure 2 shows the reliability diagram for the DenseNet model. We notice similar results to

ResNet, with the model transitioning from under conﬁdent to overconﬁdent. DenseNet does make

this transition at a slightly higher conﬁdence of 0.45. We also see the majority of predictions with

high conﬁdence by observing the conﬁdence histogram. In general, ResNet also demonstrates good

calibration with an ECE of 2.51.

4.3 ShuﬄeNet

The ShuﬄeNet model obtained a top-1 error of 30.5% and top-5 error of 11.6%. The error rates

are comparable to those reported on the PyTorch website for this model.

0.0

0.2

0.4

0.6

0.8

1.0

Expected Accuracy

ECE=2.65

resnet18

Gap

Accuracy

0.0 0.2 0.4 0.6 0.8 1.0

Confidence

15000.0

10000.0

5000.0

0.0

Count

Accuracy

Avg. confidence

Figure 1: Reliability diagram for the ResNet Model. The top panel shows the conﬁdence versus

accuracy including both a dashed line that indicates the ideal model with perfect calibration and

bars showing the actual model calibration. The bottom panel is a histogram of the amount of

predictions for each conﬁdence.

0.0

0.2

0.4

0.6

0.8

1.0

Expected Accuracy

ECE=2.51

densenet121

Gap

Accuracy

0.0 0.2 0.4 0.6 0.8 1.0

Confidence

20000.0

10000.0

0.0

Count

Accuracy

Avg. confidence

Figure 2: Reliability diagram for the DenseNet Model. The top panel shows the conﬁdence versus

accuracy including both a dashed line that indicates the ideal model with perfect calibration and

bars showing the actual model calibration. The bottom panel is a histogram of the amount of

predictions for each conﬁdence.

0.0

0.2

0.4

0.6

0.8

1.0

Expected Accuracy

ECE=7.28

shufflenet_v2_x1_0

Gap

Accuracy

0.0 0.2 0.4 0.6 0.8 1.0

Confidence

20000.0

10000.0

0.0

Count

Accuracy

Avg. confidence

Figure 3: Reliability diagram for the DenseNet Model. The top panel shows the conﬁdence versus

accuracy including both a dashed line that indicates the ideal model with perfect calibration and

bars showing the actual model calibration. The bottom panel is a histogram of the amount of

predictions for each conﬁdence.

Figure 4: Histogram showing the counts of ECE values from non-parametric bootstrap for each

model. Vertical lines indicate 95% conﬁdence intervals for the ECE.

Figure 3 shows the reliability diagram for the ShuﬄeNet model. ShuﬄeNet demonstrates

consistent over conﬁdence with this trend worsening as the conﬁdence of the model increases.

Again, we observe the majority of assignments by the model having high conﬁdence. ShuﬄeNet

exhibits the worse calibration of any model with an ECE of 7.28.

4.4 Sampling Distributions of the ECEs

Figure 4 shows all three sampling distributions of the ECE from each model. These were generated

from non-parametric bootstrap as described previously. The mean of each distribution corresponds

well with the observed ECE, with DenseNet having the lowest, followed by ResNet, and lastly

ShuﬄeNet. We also observe that these distributions appear normally distributed, allowing us to

apply our hypothesis tests. It is also clear that the ECE sampling distributions for DenseNet and

ResNet and highly overlapping, while the ECE sampling distribution for ShuﬄeNet is much higher

than the others.

4.5 Conﬁdence Intervals for the ECE

Figure 4 also shows the 95% conﬁdence intervals for the ECE from each model. For ResNet, the

conﬁdence interval is 0.0239 to 0.0302. For DenseNet, the conﬁdence interval is 0.0232 to 0.0291.

Finally, for ShuﬄeNet, the conﬁdence interval is 0.0699 to 0.0763. All of the intervals are fairly

similar in size. Additionally, the high overlap between the DenseNet and ResNet ECE sampling

distributions is conﬁrmed by the overlapping conﬁdence intervals.

4.6 Comparing ECEs to the Ideal ECE

For a perfectly calibrated model the ECE would be zero. To compare our observed ECE for each

model to this ideal, we employed the ﬁrst hypothesis test described above. We let µ

= 0 and

reject the null hypothesis with a signiﬁcance level of .01. We assume each sampling distribution of

ECEs is normally distributed

For all three models we reject the null hypothesis that the ECE is equal to 0. In all three,

the p-value is reported as 0 by Python. With certainty, we can say that none of these image

classiﬁcation models are perfectly calibrated.

4.7 Comparing Model ECEs

We were also interested in comparing ECEs between models to discover whether any of our models

were calibrated better than another. We used the hypothesis test described above to test the

diﬀerence between means of each model’s sampling distribution. We assume the variance of each

sampling distribution to be equal as the variances are very similar (2.65e-10, 2.13e-10, and 2.57e-

10). Between the ShuﬄeNet and DenseNet we calculated a test statistic of 1359.6 and a p-value

of 0. Between ShuﬄeNet and ResNet we calculated a test statistic of 1271.8 and a p-value of

0. Finally, between ResNet and DenseNet we calculated a test statistic of -30.1 and a p-value of

zero. Using the scipy’s ttest ind function we found the same test statistics, but a p-value of 7.24e-

189 between ResNet and DenseNet. Overall, this hypothesis testing procedure shows signiﬁcant

diﬀerence between the mean of each model’s sampling distribution for the ECE. Given that the

ECE mean for DenseNet is the lowest, we can conﬁdently say that DenseNet is the most calibrated

model we tested, followed by ResNet, and followed by ShuﬄeNet.

5 Discussion

Prior to our implementation, we introduced and explored the relevant statistics involved with

image-classiﬁcation neural network calibration and laid some theoretical framework for the results

of our implementation. This framework involved several key assumptions and brief derivations of

the test statistics and formulae for the p-values of the hypothesis tests which we planned to use to

compare the distributions of the ECE test statistic.

We successfully explored the calibration of state-of-the-art image classiﬁcation neural net-

works through their conﬁdence values. We ﬁrst generated predictions from three models, ResNet,

DenseNet, and ShuﬄeNet, on over 40,000 images from the ImageNet data set. We then constructed

reliability diagrams and calculated the ECE for each model to understand the model’s overall cal-

ibration. We next used non-parametric bootstrap to construct sample distributions for the ECE

of each model. With these sample distribution, we were able to calculate conﬁdence intervals for

the ECE of each model. Next, we used hypothesis tests to both compare each model’s ECE to the

ideal, and to each other.

Our main ﬁndings were that none of our tested models were perfectly calibrated on the ImageNet

data set. We found DenseNet to have the best calibration, ResNet to have the second best, and

ShuﬄeNet to be far behind. Additionally, each model of miscalibrated in the sense of being

overconﬁdent in the majority of its predictions. We were able to conﬁrm these ﬁndings via the

sampling distribution of the ECE, conﬁdence intervals, and hypothesis tests.

This project was limited in its breadth as we only studied three diﬀerent models on one data

set. As a result, we are unable to make general conclusions about the calibration of speciﬁc neural

network architectures on general data sets. We are also limited as we were unable to test calibration

techniques as we did not train any of the tested models ourselves.

Going forward, it would be exciting to try to explore the calibration of more models and on

more data sets. This would allow us to make more general conclusions about neural network

calibration. Additionally, it would be very interesting to test which techniques during training

result in excellent model calibration. There a variety of interesting results on the ways in which

particular loss functions or regularization techniques impact model calibration in [1]. Finally, it

would be exciting to investigate the calibration of neural networks in diﬀerent ﬁelds such as natural

language processing and audio.

Although limited, we can make the conclusion that DenseNet and ResNet have superior cali-

bration to ShuﬄeNet on the ImageNet data set. We can also conﬁrm that DNNs remain limited

in their ability to provide conﬁdence scores for predictions. This is important to consider as we

introduce the DNNs to new ﬁelds where misclassifying can be both costly and dangerous. Going

forward, researchers training and creating these models should consider the calibration as a key

component for both the model’s eﬃcacy and safety.

References

[1] Nikita Vemuri. Scoring conﬁdence in neural networks. 2020.

[2] Anh Nguyen, Jason Yosinski, and Jeﬀ Clune. Deep neural networks are easily fooled: High

conﬁdence predictions for unrecognizable images. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 427–436, 2015.

[3] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng

Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-

Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer

Vision (IJCV), 115(3):211–252, 2015.

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,

pages 770–778, 2016.

[5] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuﬄenet v2: Practical guidelines

for eﬃcient cnn architecture design. In Proceedings of the European conference on computer

vision (ECCV), pages 116–131, 2018.

[6] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely con-

nected convolutional networks. In Proceedings of the IEEE conference on computer vision and

pattern recognition, pages 4700–4708, 2017.

[7] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated

probabilities using bayesian binning. In Proceedings of the AAAI conference on artiﬁcial intel-

ligence, volume 29, 2015.

[8] Hollance. Hollance/reliability-diagrams: Reliability diagrams visualize whether a classiﬁer

model needs calibration.