Automatic textual description of interactions between two objects in surveillance videos
AbstractThe purpose of our work is to automatically generate textual video description schemas from surveillance video scenes compatible with police incidents reports. Our proposed approach is based on a generic and flexible context-free ontology. The general schema is of the form [actuator] [action] [over/with] [actuated object] [+ descriptors: distance, speed, etc.]. We focus on scenes containing exactly two objects. Through elaborated steps, we generate a formatted textual description. We try to identify the existence of an interaction between the two objects, including remote interaction which does not involve physical contact and we point out when aggressivity took place in these cases. We use supervised deep learning to classify scenes into interaction or no-interaction classes and then into subclasses. The chosen descriptors used to represent subclasses are keys in surveillance systems that help generate live alerts and facilitate offline investigation.