Topic Title: Real-time Multi-Person Atomic Action/Activity Detection

 

Technical Area: Computer Vision, Video Analysis

 

Background

The real-time detection and recognition of multi-person atomic action become urgent needs in the area of analysis on fine-grained surveillance video, such as remote home monitoring and criminal behavior detection. Traditional industrial companies adopt rule-based algorithms to detect a limited number of predefined human actions, such as "cross-border/line" and "hover". Such algorithms cannot be further applied to the common atomic actions in the offline retail scenarios like "pick product", "pay", anomalous behavior of "shoplift" and etc., which are more general in shop or security monitoring.

 

Generally, the real-time multi-person atomic action/activity detection algorithm is to detect the person instances based on surveillance videos and classify them into different behaviors. A good atomic action detection algorithm can be firstly used to anti-theft, and further can be used to analyze the consuming behaviors, which eventually serves for the online-offline information integration, i.e., the recommendation system in “new retails”. Therefore, we believe that the offline customers' atomic action data is worth to be collected, because it reveals the customers’ shopping habits and also contains the very useful information for sales promotions in physical store.

 

The research on human action/activity was well studied in the past decades. Before the era of deep convolutional neural networks, handcrafted features such as the improved dense trajectory are used to classify the human activities at the video-level. Later, the methods of two-stream 3D convolutional neural networks, e.g., C3D, I3D, P3D, greatly boosted the classification accuracy of video-level activities, which provide a good basis for classification on fine-grained level activities, i.e., atomic actions of human instances. For example, in 2017, the research community releases a well annotated dataset named "Atomic Visual Actions" (AVA), which annotates humans in the same video with boxes and action labels. However, unlike the promising results that were obtained by video-level activity classification, the atomic action classification algorithm currently suffers from the poor performance problem on the state-of-the-art 3D networks.

 

Although the problem is challenging, atomic action detection applies to boarder application scenarios than that of general video-level action classification in the surveillance system. Thus, we want to invite the leading researchers to focus on the topic of multi-person atomic action detection, and make this technique to be workable, useful and efficient in analyzing customers' behaviors in the offline physical stores.

 

Target

From the perspective of the action representation, our first stage target is to study a temporal representation that can be efficiently computed for labelling the human atomic actions via 3D convolution operations over RGB and optical flow modalities. Particularly, the current method of calculating optical flow (e.g., TV_L1) is too expensive in computation, which should be replaced with a more efficient way. In the second stage, we plan to boost the action detection into a fine-grained level, i.e., to develop a temporal representation of atomic actions from accurately estimated pose sequences.

From the perspective of semantics relationships, we aim to judge the motion interactions between "person-products" and "person-person", "person-place", e.g., "X picks up a cup ", "X is stealing from Y", "X is hovering around in the mall", etc. Generally, an intelligent surveillance system is expected to be equipped with the ability to understand a person's atomic actions and their interactions.

 

Related Research Topics

Recently, many existing works focus on the real-time detection and recognition of multi-person atomic actions, which are mainly based on the multiple components. However, there still exists the challenging problems and the related studies rely in the following areas:

 

1. Temporal sequential model

The detection and recognition of action or activity in video must process video clips or sequential frames. Traditional methods usually analyze each individual video frame, which actually ignore the temporal relationship between consecutive frames. In the last year, many 3D convolutional deep models are proposed to extract temporal features of frame sequences, such as C3D, I3D and P3D, etc. Therefore, it is possible to explore more effective methods by modelling the temporal sequential features.

 

2. Pose estimation

One of the major difficulties in human action/activity recognition is the large pose variation of human and occlusion of limbs. Thus, pose estimation is a good solution for action/activity recognition since different actions have different pose appearances. Actually, human pose estimation is a very hot topic in computer vison field, where Openpose and Mask RCNN are two popular frameworks. However, in the real-life scenarios, the pose estimation problems still cannot be handled well, which is due to the pose variation and occlusion of limbs.

 

3. Multi-modality information

Existing works show that the optical flow is more discriminative and robust than RBG information in characterizing the action feature. This is because the optical flow can preserve temporal-spatial information of the actions/activities. Therefore, in order to achieve better performance, the framework of two or three streams can be adopted, which might contain RGB, optical flow and acoustic information. Thus, it brings the challenge that how to fuse the multi-modality information.


4. Efficiency issues for edge devices

In many scenarios, due to the limitation of the bandwidth for the video data transmission, the action/activity analysis algorithm is required to run in the edge devices located near the cameras. However, edge devices normally have limited computation resource, thus the complexity of algorithm constrains its efficiency improvement. In this case, in order to run in real-time, the deep model optimizations like compression and quantization might be adopted to solve this problem. In addition, engineering speeding-up tools on GPU are also considered to achieve this purpose.