INTRODUCTION
A standing conversational group (also known as F-formation) occurs when two or more people sustain a social interaction, such as chatting at a cocktail party. Detecting such interactions in images or videos is of fundamental importance in many contexts, like surveillance, social signal processing, social robotics or activity classification. This paper presents an approach to this problem which models the socio-psychological concept of an F-formation. Essentially, an F-formation defines some constraints on how subjects have to be mutually located and oriented. We develop a game-theoretic framework, embedding these constraints, which is supported by a statistical modeling of the uncertainty associated with the position and orientation of people. Specifically, we propose two novel ways of handling the uncertainty in the position and orientation of the head of each individual with respect to the remaining members of the group in terms of tracking errors and the true body orientation. First, we use a novel representation of the affinity between candidate pairs by expressing the distance between distributions over the most plausible oriented region of attention. Additionally, we integrate temporal information over multiple frames while taking into account the social context established in previous frames. We do this in a principled way by using recent notions from multi-payoff evolutionary game theory. Experiments on several benchmark datasets consistently show the superiority of the proposed approach over state of the art, and its robustness under severe noise conditions.