Middlebury Stereo Evaluation - Version 3

Middlebury Stereo Evaluation v.3

New Features

5/22/2015

In version 3 of the Middlebury stereo evaluation we have made important changes and added many new features, integrating lessons learned from our previous evaluations and good ideas from other evaluations, in particular KITTI, PASCAL, and Sintel.

Datasets

The benchmark now consists of a larger number of datasets (2 x 15 image pairs), providing a varied set of challenges, while still allowing all results to be shown in a single table.
The datasets have high resolution - most are taken from the new 2014 datasets, with a resolution of 5-6 MP.
Maximum disparities range from 200 to 800 pixels at full resolution. Disparities are stored in floating-point PFM format, with infinity representing unknown values. Disparities are evaluated only for the left view. A conservative estimate of the number of disparity levels (nd) is provided for each dataset and may be utilized by the stereo method.
A few older datasets are included (full-resolution versions of Teddy, Art, and Computer) to provide a comparison with existing results.
Most datasets have imperfect rectification (see our GCPR 2014 paper for details), with vertical y-disparities of up to several pixels. For some of the datasets with the largest rectification errors, we also include versions with "perfect" rectification (suffix 'P') to allow evaluating the effect of rectification errors.
We also include versions with changed exposure (suffix 'E') and changed lighting (suffix 'L') to allow evaluating the effect of radiometric changes.

Training and test sets

The datasets are split into test and training sets with 15 image pairs each. Both datasets have public tables listing the results of all submitted methods. We provide ground-truth disparities only for the training set.
For the training set, there is a fully automatic online evaluation mechanism, similar to the old (v.2) stereo evaluation. This allows evaluating results on the training set and previewing the results in comparison to all other submitted results in an online table. In addition, we also provide the evaluation code in our SDK (see below) to aid algorithm development and parameter tuning.
In order to publish results in the public tables, results for BOTH training and test sets must be uploaded, generated with identical parameter settings across all 30 images. Multiple submissions are not allowed. Researchers thus have only one shot at evaluating their results on the test data and will not be able to see the test results until they have been published. Both sets of results (training and test) will appear in the public tables. A publication request requires our approval, which might take a few days.
All published results on the training set will become public and can be downloaded by clicking on the method name in the training table.

Resolution

We provide the datasets in 3 resolutions (F-full, H-half, and Q-quarter). Researchers can submit their results at any resolution, but evaluation is always at full resolution (we upsample submitted H and Q results). The same resolution must be used for all 30 image pairs.
Our initial table contains results in multiple resolutions for several methods to illustrate that higher resolution does not always yield better performance. For each submitted method, however, we only allow one entry in the table. Researchers can use the training set to determine at which resolution their method works best.

Metadata

We collect and display the following information for each method:
- A brief name or acronym for the table (no more than 10 characters)
- Reference (authors, title, publication / submission venue, date)
- The URL of the paper or a project page
- Resolution (F, H, Q)
- Whether color information was utilized
- A summary of runtime environment (software and hardware)
- A brief description of method, so that other researchers can get an idea of how the method works without having to read the paper.
- Specific parameter settings used for reproducibility
During the submission process we also collect information about the runtime of the method on each dataset, which can be displayed in the table.
We support anonymous submission for double-blind review processes. In this case the following information does not need to be provided: authors, description, parameter settings, URL. Once a publication decision has been made, the authors should provided updated information via email, or (if the paper was rejected) request deletion of the results from the table.

Dense vs. sparse disparity maps

We provide two tables for each of training and test sets, displaying both dense results (every pixel has a disparity estimate) and sparse results (for methods that leave some pixels unlabeled, typically in half-occluded or textureless regions).
We support submission of dense, sparse, or both sets of results. If only dense results are provided, they are also used for the sparse table. If only sparse results are uploaded, we create dense results using simple scanline-based hole filling.
In the sparse table, the percentage of invalid pixels can be displayed.

Evaluation metrics

For each table we provide 10 disparity accuracy measures, including bad pixel percentages for thresholds 0.5, 1.0, 2.0, and 4.0; average absolute and RMS disparity errors; and error quantiles A50 (median error), A90, A95, and A99. As mentioned, all metrics are computed at full resolution. In addition, we provide runtimes for each method, as well as runtimes normalized by number of pixels and number of disparity hypotheses.
The default metric used to determine the overall ranking of methods is bad2.0, the precentage of bad pixels with disparity error > 2.0 pixels. This corresponds to an error threshold of 0.5 pixels at quarter resolution.
The three old datasets (Teddy, Art, Computer) only have integer GT. Submitted disparities are rounded to integers on these datasets for a fair evaluation.
We clip the submitted disparity values to the known disparity range [0, nd] for each dataset in fairness to those methods that do not utilize the given disparity range.

Masks

We currently evaluate all results using two mask, "nonocc" (non-occluded pixels visible in both views), and "all" (all pixels). Unlike in the old evaluation (v.2), where "all" was the default mask, we are reverting back to using "nonocc" as the default, since the new datasets are significantly more challenging and have larger half-occluded regions.
We might add additional masks in the future, for instance focusing on difficult regions with fine detail and/or lack of texture.

Overall ranking

We no longer use the average rank to produce the overal ranking, since it had several problems (most notably it made it difficult to assess the magnitude of performance differences between methods). Instead, we now use a weighted average of the selected metric. The weights serve to compensate for the varying difficulty of the different datasets, and will be adjusted as the state of the art advances. They are visualized with green bars above each dataset name.
Initially, we set the weights to down-weight the 5 most difficult datasets in each group, based on the current best-performing methods
We plan to adjust these weights yearly based on how the state of the art evolves, for instance, by increasing the weights of difficult datasets and decreasing the weights of easy datasets that are virtually solved by most methods.
The overall goal is to extend the lifespan of usefulness of the evaluation by focusing on unsolved problems by periodically changing weights, and possibly also masks and the default metric.

History of results

We provide links to periodic snapshots of the table to create a record of past results. In particular, we will create a final snapshot of the table before we change the computation of the overall default ranking.

Visualization and interactive features

We display color-coded disparity maps and error maps for each result if the mouse is moved over a table cell. The error maps depend on the selected metric. Clicking the cell switches to the ground-truth disparity map.
The images can be displayed in 3 resolutions to accommodate for different screen sizes and network speeds.
To prevent reverse engineering of hidden data, we do not provide high-resolution images of the test set results, and the ground-truth images for the test set are always rendered at lowest resolution. In contrast, all submitted results for the training set can be downloaded in full resolution by clicking the method name, and the training set ground truth is available on the submit page.
The table can be sorted by column and by row by clicking on the arrows.
The table allows selection of specific table cells (by clicking in the cell) and of whole rows (via the checkbox). Selections persist during sorting and also when switching between metrics and between sparse and dense. This aids the interactive comparison of a subset of results.
We also support plotting of the selected methods. Moving the mouse over the data points in the plot displays their values numbers and highlights corresponding cells.

SDK and cvkit

On the submit page we provide our SKD containing code and scripts for running algorithms, evaluating their results, and creating a zip archive of results for upload to the online evaluation mechanism. We also provide cvkit, code for visualization and 3D rendering of PFM disparity maps.