Middlebury Stereo Evaluation v.3
In version 3 of the Middlebury stereo evaluation we have made
important changes and added many new features, integrating lessons
learned from our previous evaluations and good ideas from other
evaluations, in particular KITTI, PASCAL, and Sintel.
- The benchmark now consists of a larger number of datasets (2 x 15
image pairs), providing a varied set of challenges, while still
allowing all results to be shown in a single table.
- The datasets have high resolution - most are taken
from the new 2014 datasets, with a resolution of 5-6 MP.
- Maximum disparities range from 200 to 800 pixels at full resolution.
Disparities are stored in floating-point PFM format, with infinity
representing unknown values. Disparities are evaluated only for the
left view. A conservative estimate of the number of disparity levels
(nd) is provided for each dataset and may be utilized by the stereo
- A few older datasets are included (full-resolution versions of Teddy, Art, and Computer) to provide a comparison
with existing results.
- Most datasets have imperfect rectification (see our GCPR
2014 paper for details), with vertical y-disparities of up to
several pixels. For some of the datasets with the largest
rectification errors, we also include versions with "perfect"
rectification (suffix 'P') to allow evaluating the effect of
- We also include versions with changed exposure (suffix 'E') and
changed lighting (suffix 'L') to allow evaluating the effect of
Training and test sets
- The datasets are split into test and training sets with 15 image
pairs each. Both datasets have public tables listing the results of
all submitted methods. We provide ground-truth disparities only for
the training set.
- For the training set, there is a fully automatic online evaluation
mechanism, similar to the old (v.2) stereo evaluation. This allows
evaluating results on the training set and previewing the results in
comparison to all other submitted results in an online table.
In addition, we also provide the evaluation code in our SDK (see
below) to aid algorithm development and parameter tuning.
- In order to publish results in the public tables, results for BOTH
training and test sets must be uploaded, generated with identical
parameter settings across all 30 images. Multiple submissions are not
allowed. Researchers thus have only one shot at evaluating their
results on the test data and will not be able to see the test results
until they have been published. Both sets of results (training and
test) will appear in the public tables. A publication request
requires our approval, which might take a few days.
- All published results on the training set will become public and can
be downloaded by clicking on the method name in the training table.
- We provide the datasets in 3 resolutions (F-full, H-half, and
Q-quarter). Researchers can submit their results at any resolution,
but evaluation is always at full resolution (we upsample submitted H
and Q results). The same resolution must be used for all 30 image
- Our initial table contains results in multiple resolutions for
several methods to illustrate that higher resolution does not always
yield better performance. For each submitted method, however, we only
allow one entry in the table. Researchers can use the training set to
determine at which resolution their method works best.
- We collect and display the following information for each method:
- A brief name or acronym for the table (no more than 10 characters)
- Reference (authors, title, publication / submission venue, date)
- The URL of the paper or a project page
- Resolution (F, H, Q)
- Whether color information was utilized
- A summary of runtime environment (software and hardware)
- A brief description of method, so that other researchers can get an
idea of how the method works without having to read the paper.
- Specific parameter settings used for reproducibility
- During the submission process we also collect information about the
runtime of the method on each dataset, which can be displayed in the
- We support anonymous submission for double-blind review processes.
In this case the following information does not need to be provided:
authors, description, parameter settings, URL. Once a publication
decision has been made, the authors should provided updated
information via email, or (if the paper was rejected) request deletion
of the results from the table.
Dense vs. sparse disparity maps
- We provide two tables for each of training and test sets, displaying
both dense results (every pixel has a disparity estimate) and sparse
results (for methods that leave some pixels unlabeled, typically in
half-occluded or textureless regions).
- We support submission of dense, sparse, or both sets of results. If
only dense results are provided, they are also used for the sparse
table. If only sparse results are uploaded, we create dense results
using simple scanline-based hole filling.
- In the sparse table, the percentage of invalid pixels can be displayed.
- For each table we provide 10 disparity accuracy measures, including
bad pixel percentages for thresholds 0.5, 1.0, 2.0, and 4.0; average
absolute and RMS disparity errors; and error quantiles A50 (median
error), A90, A95, and A99. As mentioned, all metrics are computed at
full resolution. In addition, we provide runtimes for each method,
as well as runtimes normalized by number of pixels and number of
- The default metric used to determine the overall ranking of methods
is bad2.0, the precentage of bad pixels with disparity error > 2.0
pixels. This corresponds to an error threshold of 0.5 pixels at
- The three old datasets (Teddy, Art, Computer) only have integer GT.
Submitted disparities are rounded to integers on these datasets for a
- We clip the submitted disparity values to the known disparity range
[0, nd] for each dataset in fairness to those methods that do not
utilize the given disparity range.
- We currently evaluate all results using two mask, "nonocc"
(non-occluded pixels visible in both views), and "all" (all pixels).
Unlike in the old evaluation (v.2), where "all" was the default mask,
we are reverting back to using "nonocc" as the default, since the new
datasets are significantly more challenging and have larger
- We might add additional masks in the future, for instance focusing on
difficult regions with fine detail and/or lack of texture.
- We no longer use the average rank to produce the overal ranking, since
it had several problems (most notably it made it difficult to assess
the magnitude of performance differences between methods). Instead,
we now use a weighted average of the selected metric. The weights
serve to compensate for the varying difficulty of the different
datasets, and will be adjusted as the state of the art advances. They
are visualized with green bars above each dataset name.
- Initially, we set the weights to down-weight the 5 most difficult
datasets in each group, based on the current best-performing methods
- We plan to adjust these weights yearly based on how the state of the
art evolves, for instance, by increasing the weights of difficult
datasets and decreasing the weights of easy datasets that are
virtually solved by most methods.
- The overall goal is to extend the lifespan of usefulness of the
evaluation by focusing on unsolved problems by periodically changing
weights, and possibly also masks and the default metric.
History of results
- We provide links to periodic snapshots of the table to create a
record of past results. In particular, we will create a final
snapshot of the table before we change the computation of the overall
Visualization and interactive features
- We display color-coded disparity maps and error maps for each result
if the mouse is moved over a table cell. The error maps depend on the
selected metric. Clicking the cell switches to the ground-truth
- The images can be displayed in 3 resolutions to accommodate for
different screen sizes and network speeds.
- To prevent reverse engineering of hidden data, we do not provide
high-resolution images of the test set results, and the ground-truth
images for the test set are always rendered at lowest resolution. In
contrast, all submitted results for the training set can be downloaded
in full resolution by clicking the method name, and the training set
ground truth is available on the submit page.
- The table can be sorted by column and by row by clicking on the
- The table allows selection of specific table cells (by clicking in
the cell) and of whole rows (via the checkbox). Selections persist
during sorting and also when switching between metrics and between
sparse and dense. This aids the interactive comparison of a subset of
- We also support plotting of the selected methods.
Moving the mouse over the data points in
the plot displays their values numbers and highlights corresponding
SDK and cvkit
- On the submit page
we provide our SKD containing code and scripts
for running algorithms, evaluating their results, and creating a zip
archive of results for upload to the online evaluation mechanism. We
also provide cvkit,
code for visualization and 3D rendering of PFM