scholarly journals Accurate and Efficient Video De-Fencing Using Convolutional Neural Networks and Temporal Information

Author(s):  
Chen Du ◽  
Byeongkeun Kang ◽  
Zheng Xu ◽  
Ji Dai ◽  
Truong Nguyen
Author(s):  
Jorge F. Lazo ◽  
Aldo Marzullo ◽  
Sara Moccia ◽  
Michele Catellani ◽  
Benoit Rosa ◽  
...  

Abstract Purpose Ureteroscopy is an efficient endoscopic minimally invasive technique for the diagnosis and treatment of upper tract urothelial carcinoma. During ureteroscopy, the automatic segmentation of the hollow lumen is of primary importance, since it indicates the path that the endoscope should follow. In order to obtain an accurate segmentation of the hollow lumen, this paper presents an automatic method based on convolutional neural networks (CNNs). Methods The proposed method is based on an ensemble of 4 parallel CNNs to simultaneously process single and multi-frame information. Of these, two architectures are taken as core-models, namely U-Net based in residual blocks ($$m_1$$ m 1 ) and Mask-RCNN ($$m_2$$ m 2 ), which are fed with single still-frames I(t). The other two models ($$M_1$$ M 1 , $$M_2$$ M 2 ) are modifications of the former ones consisting on the addition of a stage which makes use of 3D convolutions to process temporal information. $$M_1$$ M 1 , $$M_2$$ M 2 are fed with triplets of frames ($$I(t-1)$$ I ( t - 1 ) , I(t), $$I(t+1)$$ I ( t + 1 ) ) to produce the segmentation for I(t). Results The proposed method was evaluated using a custom dataset of 11 videos (2673 frames) which were collected and manually annotated from 6 patients. We obtain a Dice similarity coefficient of 0.80, outperforming previous state-of-the-art methods. Conclusion The obtained results show that spatial-temporal information can be effectively exploited by the ensemble model to improve hollow lumen segmentation in ureteroscopic images. The method is effective also in the presence of poor visibility, occasional bleeding, or specular reflections.


2019 ◽  
Vol 16 (1) ◽  
pp. 172988141882509 ◽  
Author(s):  
Hanbo Wu ◽  
Xin Ma ◽  
Yibin Li

Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.


2017 ◽  
Vol 37 (4-5) ◽  
pp. 452-471 ◽  
Author(s):  
Connor Schenck ◽  
Dieter Fox

Liquids are an important part of many common manipulation tasks in human environments. If we wish to have robots that can accomplish these types of tasks, they must be able to interact with liquids in an intelligent manner. In this paper, we investigate ways for robots to perceive and reason about liquids. That is, a robot asks the questions What in the visual data stream is liquid? and How can I use that to infer all the potential places where liquid might be? We collected two data sets to evaluate these questions, one using a realistic liquid simulator and another using our robot. We used fully convolutional neural networks to learn to detect and track liquids across pouring sequences. Our results show that these networks are able to perceive and reason about liquids, and that integrating temporal information is important to performing such tasks well.


Sign in / Sign up

Export Citation Format

Share Document