temporal alignment
Recently Published Documents


TOTAL DOCUMENTS

134
(FIVE YEARS 39)

H-INDEX

12
(FIVE YEARS 2)

2021 ◽  
pp. 1-19
Author(s):  
Marcella Cornia ◽  
Lorenzo Baraldi ◽  
Rita Cucchiara

Image Captioning is the task of translating an input image into a textual description. As such, it connects Vision and Language in a generative fashion, with applications that range from multi-modal search engines to help visually impaired people. Although recent years have witnessed an increase in accuracy in such models, this has also brought increasing complexity and challenges in interpretability and visualization. In this work, we focus on Transformer-based image captioning models and provide qualitative and quantitative tools to increase interpretability and assess the grounding and temporal alignment capabilities of such models. Firstly, we employ attribution methods to visualize what the model concentrates on in the input image, at each step of the generation. Further, we propose metrics to evaluate the temporal alignment between model predictions and attribution scores, which allows measuring the grounding capabilities of the model and spot hallucination flaws. Experiments are conducted on three different Transformer-based architectures, employing both traditional and Vision Transformer-based visual features.


2021 ◽  
pp. 1-19
Author(s):  
Alexandra N. Scurry ◽  
Daniela M. Lemus ◽  
Fang Jiang

Abstract Reliable duration perception is an integral aspect of daily life that impacts everyday perception, motor coordination, and subjective passage of time. The Scalar Expectancy Theory (SET) is a common model that explains how an internal pacemaker, gated by an external stimulus-driven switch, accumulates pulses during sensory events and compares these accumulated pulses to a reference memory duration for subsequent duration estimation. Second-order mechanisms, such as multisensory integration (MSI) and attention, can influence this model and affect duration perception. For instance, diverting attention away from temporal features could delay the switch closure or temporarily open the accumulator, altering pulse accumulation and distorting duration perception. In crossmodal duration perception, auditory signals of unequal duration can induce perceptual compression and expansion of durations of visual stimuli, presumably via auditory influence on the visual clock. The current project aimed to investigate the role of temporal (stimulus alignment) and nontemporal (stimulus complexity) features on crossmodal, specifically auditory over visual, duration perception. While temporal alignment revealed a larger impact on the strength of crossmodal duration percepts compared to stimulus complexity, both features showcase auditory dominance in processing visual duration.


Author(s):  
Zhongyi Zhou ◽  
Anran Xu ◽  
Koji Yatani

The beauty of synchronized dancing lies in the synchronization of body movements among multiple dancers. While dancers utilize camera recordings for their practice, standard video interfaces do not efficiently support their activities of identifying segments where they are not well synchronized. This thus fails to close a tight loop of an iterative practice process (i.e., capturing a practice, reviewing the video, and practicing again). We present SyncUp, a system that provides multiple interactive visualizations to support the practice of synchronized dancing and liberate users from manual inspection of recorded practice videos. By analyzing videos uploaded by users, SyncUp quantifies two aspects of synchronization in dancing: pose similarity among multiple dancers and temporal alignment of their movements. The system then highlights which body parts and which portions of the dance routine require further practice to achieve better synchronization. The results of our system evaluations show that our pose similarity estimation and temporal alignment predictions were correlated well with human ratings. Participants in our qualitative user evaluation expressed the benefits and its potential use of SyncUp, confirming that it would enable quick iterative practice.


2021 ◽  
Vol 141 ◽  
pp. 107175
Author(s):  
Qiuquan Yan ◽  
Yiming Li ◽  
Jun Zhang ◽  
Xin Zheng ◽  
Dan Wu ◽  
...  

Author(s):  
Songyang Zhang ◽  
Jiale Zhou ◽  
Xuming He

Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications. However, it is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting. To address this, we propose a novel matching-based few-shot learning strategy for video sequences in this work. Our main idea is to introduce an implicit temporal alignment for a video pair, capable of estimating the similarity between them in an accurate and robust manner. Moreover, we design an effective context encoding module to incorporate spatial and feature channel context, resulting in better modeling of intra-class variations. To train our model, we develop a multi-task loss for learning video matching, leading to video features with better generalization. Extensive experimental results on two challenging benchmarks, show that our method outperforms the prior arts with a sizable margin on Something-Something-V2 and competitive results on Kinetics.


Sensors ◽  
2021 ◽  
Vol 21 (14) ◽  
pp. 4777
Author(s):  
Jan Christian Brønd ◽  
Natascha Holbæk Pedersen ◽  
Kristian Traberg Larsen ◽  
Anders Grøntved

Combining accelerometry from multiple independent activity monitors worn by the same subject have gained widespread interest with the assessment of physical activity behavior. However, a difference in the real time clock accuracy of the activity monitor introduces a substantial temporal misalignment with long duration recordings which is commonly not considered. In this study, a novel method not requiring human interaction is described for the temporal alignment of triaxial acceleration measured with two independent activity monitors and evaluating the performance with the misalignment manually identified. The method was evaluated with free-living recordings using both combined wrist/hip (n = 9) and thigh/hip device (n = 30) wear locations, and descriptive data on initial offset and accumulated day 7 drift in a large-scale population-based study (n = 2513) were calculated. The results from the Bland–Altman analysis show good agreement between the proposed algorithm and the reference suggesting that the described method is valid for reducing the temporal misalignment and thus reduce the measurement error with aggregated data. Applying the algorithm to the n = 2513 samples worn for 7-days suggest a wide and substantial issue with drift over time when each subject wears two independent activity monitors.


2021 ◽  
Vol 6 (3) ◽  
pp. 4297-4304
Author(s):  
Feng Lu ◽  
Baifan Chen ◽  
Xiang-Dong Zhou ◽  
Dezhen Song

Author(s):  
Congqi Cao ◽  
Yajuan Li ◽  
Qinyi Lv ◽  
Peng Wang ◽  
Yanning Zhang

Sign in / Sign up

Export Citation Format

Share Document