The aim of the task is to detect, localize and track speakers from audio-visual sequences. The data used are some scenarios of the “interaction” part in the Ravel data set [1]. See the data set web site [2] for more information on which sequences to use.

Evaluation metric

In order to evaluate the results, the (euclidean) distance matrix between the detected speakers and the ground-truth speakers should be computed. Each ground-truth speaker should be associated at most to one detected speaker. The assignment procedure is as follows. For each detected speaker its closest ground-truth speaker is computed. If it is not closer than a threshold τloc it is marked as false positive, otherwise the detected speaker is assigned to the ground-truth speaker. Then, for each ground-truth speaker the number of detected clusters are assigned to it is checked. If there is none, it is marked as missing detection. Otherwise, the closest detected speaker becomes the true positive and the remaining ones become false positives. Recall, precision and accuracy values should be shown in tables (and occasionally in figures also) for different values of τloc in the range 1cm – 50cm.


Please see [2].


[1] X. Alameda-Pineda, J. Sanchez-Riera, V. Franc, J. Wienke, J. Cech, K. Kulkarni, A. Deleforge, and R. P. Horaud. RAVEL: An Annotated Corpus for Training Robots with Audiovisual Abilities. In Journal on Multimodal User Interfaces, 2012.
[2] The RAVEL data set.