scholarly journals Action Recognition Using Close-Up of Maximum Activation and ETRI-Activity3D LivingLab Dataset

Sensors ◽  
2021 ◽  
Vol 21 (20) ◽  
pp. 6774
Author(s):  
Doyoung Kim ◽  
Inwoong Lee ◽  
Dohyung Kim ◽  
Sanghoon Lee

The development of action recognition models has shown great performance on various video datasets. Nevertheless, because there is no rich data on target actions in existing datasets, it is insufficient to perform action recognition applications required by industries. To satisfy this requirement, datasets composed of target actions with high availability have been created, but it is difficult to capture various characteristics in actual environments because video data are generated in a specific environment. In this paper, we introduce a new ETRI-Activity3D-LivingLab dataset, which provides action sequences in actual environments and helps to handle a network generalization issue due to the dataset shift. When the action recognition model is trained on the ETRI-Activity3D and KIST SynADL datasets and evaluated on the ETRI-Activity3D-LivingLab dataset, the performance can be severely degraded because the datasets were captured in different environments domains. To reduce this dataset shift between training and testing datasets, we propose a close-up of maximum activation, which magnifies the most activated part of a video input in detail. In addition, we present various experimental results and analysis that show the dataset shift and demonstrate the effectiveness of the proposed method.

Author(s):  
Mohsen Tabejamaat ◽  
Hoda Mohammadzade

Recent years have seen an increasing trend in developing 3D action recognition methods. However, despite the advances, existing models still suffer from some major drawbacks including the lack of any provision for recognizing action sequences with some missing frames. This significantly hampers the applicability of these methods for online scenarios, where only an initial part of sequences are already provided. In this paper, we introduce a novel sequence-to-sequence representation-based algorithm in which a query sample is characterized using a collaborative frame representation of all the training sequences. This way, an optimal classifier is tailored for the existing frames of each query sample, making the model robust to the effect of missing frames in sequences (e.g. in online scenarios). Moreover, due to the collaborative nature of the representation, it implicitly handles the problem of varying styles during the course of activities. Experimental results on three publicly available databases, UTKinect, TST fall, and UTD-MHAD, respectively, show 95.48%, 90.91%, and 91.67% accuracy when using the beginning 75% portion of query sequences and 84.42%, 60.98%, and 87.27% accuracy for their initial 50%.


2020 ◽  
Vol 2020 (4) ◽  
pp. 116-1-116-7
Author(s):  
Raphael Antonius Frick ◽  
Sascha Zmudzinski ◽  
Martin Steinebach

In recent years, the number of forged videos circulating on the Internet has immensely increased. Software and services to create such forgeries have become more and more accessible to the public. In this regard, the risk of malicious use of forged videos has risen. This work proposes an approach based on the Ghost effect knwon from image forensics for detecting forgeries in videos that can replace faces in video sequences or change the mimic of a face. The experimental results show that the proposed approach is able to identify forgery in high-quality encoded video content.


2021 ◽  
Vol 11 (11) ◽  
pp. 4940
Author(s):  
Jinsoo Kim ◽  
Jeongho Cho

The field of research related to video data has difficulty in extracting not only spatial but also temporal features and human action recognition (HAR) is a representative field of research that applies convolutional neural network (CNN) to video data. The performance for action recognition has improved, but owing to the complexity of the model, some still limitations to operation in real-time persist. Therefore, a lightweight CNN-based single-stream HAR model that can operate in real-time is proposed. The proposed model extracts spatial feature maps by applying CNN to the images that develop the video and uses the frame change rate of sequential images as time information. Spatial feature maps are weighted-averaged by frame change, transformed into spatiotemporal features, and input into multilayer perceptrons, which have a relatively lower complexity than other HAR models; thus, our method has high utility in a single embedded system connected to CCTV. The results of evaluating action recognition accuracy and data processing speed through challenging action recognition benchmark UCF-101 showed higher action recognition accuracy than the HAR model using long short-term memory with a small amount of video frames and confirmed the real-time operational possibility through fast data processing speed. In addition, the performance of the proposed weighted mean-based HAR model was verified by testing it in Jetson NANO to confirm the possibility of using it in low-cost GPU-based embedded systems.


Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 325
Author(s):  
Zhihao Wu ◽  
Baopeng Zhang ◽  
Tianchen Zhou ◽  
Yan Li ◽  
Jianping Fan

In this paper, we developed a practical approach for automatic detection of discrimination actions from social images. Firstly, an image set is established, in which various discrimination actions and relations are manually labeled. To the best of our knowledge, this is the first work to create a dataset for discrimination action recognition and relationship identification. Secondly, a practical approach is developed to achieve automatic detection and identification of discrimination actions and relationships from social images. Thirdly, the task of relationship identification is seamlessly integrated with the task of discrimination action recognition into one single network called the Co-operative Visual Translation Embedding++ network (CVTransE++). We also compared our proposed method with numerous state-of-the-art methods, and our experimental results demonstrated that our proposed methods can significantly outperform state-of-the-art approaches.


Author(s):  
Kimiaki Shirahama ◽  
Kuniaki Uehara

This paper examines video retrieval based on Query-By-Example (QBE) approach, where shots relevant to a query are retrieved from large-scale video data based on their similarity to example shots. This involves two crucial problems: The first is that similarity in features does not necessarily imply similarity in semantic content. The second problem is an expensive computational cost to compute the similarity of a huge number of shots to example shots. The authors have developed a method that can filter a large number of shots irrelevant to a query, based on a video ontology that is knowledge base about concepts displayed in a shot. The method utilizes various concept relationships (e.g., generalization/specialization, sibling, part-of, and co-occurrence) defined in the video ontology. In addition, although the video ontology assumes that shots are accurately annotated with concepts, accurate annotation is difficult due to the diversity of forms and appearances of the concepts. Dempster-Shafer theory is used to account the uncertainty in determining the relevance of a shot based on inaccurate annotation of this shot. Experimental results on TRECVID 2009 video data validate the effectiveness of the method.


Robotics ◽  
2019 ◽  
Vol 8 (3) ◽  
pp. 58
Author(s):  
Yusuke Adachi ◽  
Masahide Ito ◽  
Tadashi Naruse

This paper addresses a strategy learning problem in the RoboCupSoccer Small Size League (SSL). We propose a novel method based on action sequences to cluster an opponent’s strategies online. Our proposed method is composed of the following three steps: (1) extracting typical actions from geometric data to make action sequences, (2) calculating the dissimilarity of the sequences, and (3) clustering the sequences by using the dissimilarity. This method can reduce the amount of data used in the clustering process; handling action sequences instead of geometric data as data-set makes it easier to search actions. As a result, the proposed clustering method is online feasible and also is applicable to countering an opponent’s strategy. The effectiveness of the proposed method was validated by experimental results.


2020 ◽  
Vol 34 (07) ◽  
pp. 12233-12240
Author(s):  
Wenjing Wang ◽  
Jizheng Xu ◽  
Li Zhang ◽  
Yue Wang ◽  
Jiaying Liu

Recently, neural style transfer has drawn many attentions and significant progresses have been made, especially for image style transfer. However, flexible and consistent style transfer for videos remains a challenging problem. Existing training strategies, either using a significant amount of video data with optical flows or introducing single-frame regularizers, have limited performance on real videos. In this paper, we propose a novel interpretation of temporal consistency, based on which we analyze the drawbacks of existing training strategies; and then derive a new compound regularization. Experimental results show that the proposed regularization can better balance the spatial and temporal performance, which supports our modeling. Combining with the new cost formula, we design a zero-shot video style transfer framework. Moreover, for better feature migration, we introduce a new module to dynamically adjust inter-channel distributions. Quantitative and qualitative results demonstrate the superiority of our method over other state-of-the-art style transfer methods. Our project is publicly available at: https://daooshee.github.io/CompoundVST/.


2013 ◽  
Vol 631-632 ◽  
pp. 1303-1308
Author(s):  
He Jin Yuan

A novel human action recognition algorithm based on key posture is proposed in this paper. In the method, the mesh features of each image in human action sequences are firstly calculated; then the key postures of the human mesh features are generated through k-medoids clustering algorithm; and the motion sequences are thus represented as vectors of key postures. The component of the vector is the occurrence number of the corresponding posture included in the action. For human action recognition, the observed action is firstly changed into key posture vector; then the correlevant coefficients to the training samples are calculated and the action which best matches the observed sequence is chosen as the final category. The experiments on Weizmann dataset demonstrate that our method is effective for human action recognition. The average recognition accuracy can exceed 90%.


2005 ◽  
Vol 05 (01) ◽  
pp. 111-133 ◽  
Author(s):  
HONGMEI LIU ◽  
JIWU HUANG ◽  
YUN Q. SHI

In this paper, we propose a blind video data-hiding algorithm in DWT (discrete wavelet transform) domain. It embeds multiple information bits into uncompressed video sequences. The major features of this algorithm are as follows. (1) Development of a novel embedding strategy in DWT domain. Different from the existing schemes based on DWT that have explicitly excluded the LL subband coefficients from data embedding, we embed data in the LL subband for better invisibility and robustness. The underlying idea comes from our qualitative and quantitative analysis of the DWT coefficients magnitude distribution over commonly used images. The experimental results confirm the superiority of the proposed embedding strategy. (2) To combat temporal attacks, which will destroy the synchronization of hidden data that is necessary in data retrieval, we develop an effective temporal synchronization technique. Compared with the sliding correlation proposed in the existing algorithms, our synchronization technique is more advanced. (3) We adopt a new 3D interleaving technique to combat bursts of errors, while reducing random error probabilities in data detection by exploiting ECC (error correcting coding). The detection error rate with 3D interleaving is much lower than that without 3D interleaving when frame loss rate is below 50%. (4) Take a carefully designed measure in bit embedding to guarantee the invisibility of information. In experiments, we can embed a string of 402 bytes (excluding the redundant bits associated with ECC) in 96 frames of the CIF format sequence. The experimental results have demonstrated that the embedded information bits are perceptually transparent when the frames in the sequence are viewed either as still images or played continuously. The hidden information is robust to manipulations, such as MPEG-2 compression, scaling, additive random noise, and frame loss.


Sign in / Sign up

Export Citation Format

Share Document