Action and activity recognition are paramount active areas in computer vision. The task here typically consists in the classification of a fully observed action or activity. More recently, the community has also started to investigate a few variants, extending the paradigm to the early recognition of partially observed actions or even to the prediction of unseen, future actions.
As a different paradigm, we posit that not only human actions, but even the overarching goal embedded in an action sequence can be automatically inferred: we term this intention prediction.
Actually, predicting human intentions is much more challenging than actions since the same class of actions (e.g., hand-rising) can be performed with very different intentions (hand-shake versus high-five). Therefore, the prediction must rely on the kinematics, namely capitalizing on the few available motion patterns which allow to anticipate the intention that originated the displayed action.
Further, another important aspect of the entire activity recognition problem is that the current techniques mainly exploit the scene context to support the classification (e.g., the objects present in the scene and the prior knowledge of the possible related actions). However, although this information could help, it can be insufficient to solve the task or, worse, the context may not always be available or easily recognizable, being also misleading when the scene is too noisy or cluttered.
For all these reasons, we want to predict human intentions by exclusively leveraging the kinematics in a context-free manner. As to demonstrate the validity of this approach, a set of experiments was designed in which 17 subjects were asked to grasp a bottle, in order to either 1) pour some water into a glass, 2) pass the bottle to a co-experimenter, 3) drink from it, or 4) place the bottle into a box. A new dataset is acquired as composed by a) 3D trajectories of 20 motion capture (VICON) markers outfitted over the hand of the participants and b) optical video sequences lasting about one second, with an occlusive camera view in which only the arm and the bottle are visible. The goal is to classify the intentions associated with the observed grasping-a-bottle movement, i.e. to predict the agent’s intention.
In a broad experimental evaluation, we set a baseline analysis of engineered and data-driven 3D, 2D and multi-modal (3D+2D) techniques, as well as experimenting existing action prediction pipelines on our dataset: in all cases, we register a performance which significantly exceeds the random chance level, certifying that such challenge is affordable.