Report on IEEE Computer Society Workshop on Empirical Evaluation of Computer Vision Algorithms
This workshop took place on June 21--22 1998, just prior to the CVPR'98 conference, at the University of California at Santa Barbara. The program consisted of two invited talks, twelve paper presentations, and one panel discussion. The workshop followed a format in which the sessions of presented papers had a session chair who was allowed time at the beginning of the session to present their own comments. This helped to spur comment and discussion and allowed a greater number of people to participate in the program.
The fist day of the workshop began with an invited talk by Henry Baird, now at Xerox PARC. His presentation was titled The impact of standard databases, benchmarks, and generative models on document image analysis research. Dr. Baird gave a wonderful talk about how databases and benchmarks have influenced the development of the document recognition field. NIST, the University of Washington (under ARPA support) and the University of Nevada Las Vegas (under DOE support) have created publicly available databases, although the reported cost of the UNLV database seemed prohibitive for most academic research. There were many questions from the audience and all seemed to agree that the was an informative and useful presentation.
The theme of the first session of papers was Performance evaluation of algorithms, and the session was chaired by Pat Flynn, now at the Ohio State University. The first paper was titled A benchmark for graphics recognition systems, by Atul Chhabra (Bell Atlantic Network Systems) and Ihsin Phillips (Seattle University). This paper was about a benchmark specifically for evaluating graphics recognitions systems of engineering drawings, based on straight lines, circles, circular arcs, text blocks. The second paper was titled Performance evaluation of clustering algorithms for scalable image retrieval, Mohamed Abdel-Mottaleb, Santhana Krishnamachari, and Nicholas Mankovich (Philips Research). The point of this work was to compare clustering-based retrieval with non-clustering-based retrieval. The third paper was titled Analysis of PCA-based face recognition algorithms, by Hyeonjoon Moon (SUNY Buffalo) and Jonathon Phillips (NIST). This paper analyzed various factors in the performance of face recognition based on principal component analysis. An interesting result was that performance actually improved for images subjected to lossy JPEG compression.
The second session of papers was titled Evaluation of Edge Detectors, and was chaired by Peter Meer (Rutgers). The first paper was titled Analytical and empirical performance evaluation of subpixel line and edge detection, by Carsten Steger (Technische Universit"at M"unchen). This paper was motivated by industrial inspection applications which require highly accurate subpixel line extraction. The second paper was titled Objective evaluation of edge detectors using a formally defined framework, by Sean Dougherty and Kevin Bowyer (University of South Florida). This paper dealt with comparing edge detection algorithms by a count of true positive and false positive edge pixels summarized in an ROC curve. The third paper was titled Evaluation of edge detection algorithms using a structure from motion task, by Min Shin, Dmitry Goldgof and Kevin Bowyer (University of South Florida). This paper dealt with comparing edge detection algorithms by the accuracy of shape and motion recovered by an edge-basd SFM algorithm.
The third session of papers was titled Motion, and was chaired by Sandor Der (Army Research Laboratory). The first paper was titled Shape of motion and the perception of human gaits, by Jeffrey Boyd (UCSD) and James Little (UBC). This paper dealt with human evaluation of human gait sequences, in order to determine what a vision system might be expected to do. The second paper was titled Performance assessment by resampling: rigid motion estimators, by Bogdan Matei, Peter Meer and David Tyler (Rutgers). This work uses statistical techniques to make use of limited amounts of real data yet still assess the confidence in the parameter estimate.
The second day began with an invited talk by J. Michael Fitzpatrick of Vanderbilt University. Fitzpatrick's talk was titled A blinded evaluation and comparison of image registration methods. Professor Fitzpatrick gave an excellent talk about the problem of registering images from different modalities, and his own experience in directing a project in which researchers at a number of institutions compared their algorithms on a standard data set.
The next session of papers was titled Aerial Image Analysis and ATR, and was chaired by Adam Hoover (UCSD). The first paper was titled Empirical evaluation of laser radar recognition algorithm using synthetic and real data, by Sandor Der (ARL) and Qinfen Zheng (University of Maryland). This presentation actually addressed two pieces of work. One was the use of synthetic data in evaluating the design of a laser radar system prior to it actually being built. The second was an implementation and evaluation of different FLIR ATR algorithms. The second paper was titled Empirical evaluation of automatically extracted road axes, by Christian Wiedemann, Christian Heipke and Helmut Mayer (Technische Universit"at M"unchen). This presentation discussed an evaluation of three algorithms for extracting roads in aeriel images, using real images with manually-specified ground truth and performance metrics related to completeness, correctness, quality and redundancy.
The theme of the last session of papers was Modeling Image Formation, and was chaired by W. Philip Kegelmeyer (Sandia National Laboratories). The first paper was titled Fingerprint image enhancement: algorithm and performance evaluation, by Lin Hong, Yifei Wan and Anil Jain (Michigan State University). This paper dealt with the enhancement of live-scan fingerprint images, and the evaluation of how enhancement can improvement performance in fingerprint identification. The second paper was titled Sensor errors and the uncertainties in stereo reconstruction, by Gerda Kamberova and Ruzena Bajcsy (University of Pennsylvania). This paper dealt with how the errors in the cameras affect performance of shape reconstruction from stereo.
The last session of the workshop was a panel discussion titled Editors' Expectations for Empirical Evaluation in Journal Papers. The panelists included Jim Duncan (Medical Image Analysis), Avi Kak (Computer Vision and Image Understanding), Rangachar Kasturi (Pattern Analysis and Machine Intelligence), Gerard Medioni (Image and Vision Computing) and Mohan Trivedi (Machine Vision and Applications). This session naturally sparked a great deal of interest and discussion Kasturi described the procedures that PAMI uses specifically for empirical-evaluation style survey papers, and stated that PAMI will publish short papers describing archive databases. Duncan said a few words about the nature of MedIA, since it is a relatively new journal that may not yet be familiar to most people in the community. They publish both hard copy and CD-ROM. He suggested that a new method with one or two example results might have a hard time getting accepted, unless rated very highly by all reviewers. MedIA does not have separate special guidelines for evaluation papers. Kak pointed out that there is some variance in area editors (either the area and/or the editor), with acceptance ratios varying between 1/10 and 1/2. A common reason for rejecting a paper is lack of sufficient experimental results. He suggested that performance evaluation should all be goal-directed, rather than general purpose. Medioni suggested that five years (or longer) ago there was not much point in publishing evaluation papers as nothing worked well enough to make it worthwhile. In constrast, today we may now be too insistent on comparative work. He suggested that the important point is for authors to more completely honest about the errors of their algorithm. Trivedi stated that approximately 25% of MV&A papers have a strong evaluation component, and 50% of rejections are due to inadequate evaluation. The general question and answer session for the panel generated some lively audience interaction. Avi Kak suggested that we could be lulled into false confidence by the use of large databases, as they are never varied along all the relevant dimensions. Pat Flynn suggested that it is as important to vary the imaging modality as it is to vary the algorithms studied. There was a question about whether medical image analysis work should be published in an image analysis journal, a more medical/clinical journal, or both. It is not always possible to split the work neatly in such a way as to support a paper for each audience. Jim Duncan suggested that it was often necessary to have the work published primarily in the medical/clinical journals, in order to keep the interest and involvement of the medical collaborators. There was also discussion about whether and how journal might maintain relevant databases for performance evaluation.
The workshop ended with thanks to all of the participants, and many people expressing a desire to see another such workshop in the future. For information on a possible follow-up workshop next year, contact Jonathon Phillips (firstname.lastname@example.org).
The papers from the workshop are published in a book by IEEE Computer Society Press, Empirical Evaluation Techniques in Computer Vision, edited by K.W. Bowyer and P.J. Phillips, 255+ pages, ISBN 0-8186-8401-1. IEEE CS order number BP08401.
Kevin Bowyer <email@example.com>, 3-Aug-1998