image descriptions
Recently Published Documents


TOTAL DOCUMENTS

101
(FIVE YEARS 31)

H-INDEX

14
(FIVE YEARS 3)

2021 ◽  
Vol 4 ◽  
Author(s):  
Nikolai Ilinykh ◽  
Simon Dobnik

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.


2021 ◽  
Author(s):  
Emory James Edwards ◽  
Kyle Lewis Polster ◽  
Isabel Tuason ◽  
Emily Blank ◽  
Michael Gilbert ◽  
...  

2021 ◽  
Author(s):  
Abigale Stangl ◽  
Nitin Verma ◽  
Kenneth R. Fleischmann ◽  
Meredith Ringel Morris ◽  
Danna Gurari

2021 ◽  
Author(s):  
Alessandra Helena Jandrey ◽  
Duncan Dubugras Alcoba Ruiz ◽  
Milene Selbach Silveira

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Carola Strandberg ◽  
Maria Ek Styvén

Purpose This paper aims to explore how place identity can be expressed in residents’ place image descriptions, addressing differences and similarities in place identity expressions between residents’ descriptions of the image of their place and the image of the place as described to others. Design/methodology/approach In-depth interviews were conducted with residents of a Swedish city. Place image descriptions were analyzed through thematic analysis. Findings Different types of identity perspectives manifest in the place image descriptions of residents. Respondents’ associations reflect place, person and social group identity perspectives, including their own perspective as residents, but also as visitors, or a combination of both. Priming is needed when gathering place image perceptions, to establish which underlying identity perspective is expressed. Research limitations/implications This study offers a Nordic perspective on the organic communication of place image. The scope and qualitative nature of this study is a limitation to its generalizability but also suggests a rich ground for future cross-cultural studies on the topic. Practical implications Results point to the importance of accurately formulating questions to catch stakeholders’ place image. Insights are offered into how stakeholders communicate Nordic place image perceptions when engaging in communication about a place and into the effects of identity on organic place brand communication. Originality/value To the best of the authors’ knowledge, this study is among the first to explore how key stakeholders’ lenses to interpret a place brand are activated in the communication of place image, and how this influences their descriptions of the place.


2021 ◽  
Vol 146 ◽  
pp. 70-76
Author(s):  
Emre Boran ◽  
Aykut Erdem ◽  
Nazli Ikizler-Cinbis ◽  
Erkut Erdem ◽  
Pranava Madhyastha ◽  
...  

Author(s):  
Tasmia Tasmia ◽  
Md Sultan Al Nahian ◽  
Brent Harrison

In this work, we propose a deep neural architecture that uses an attention mechanism which utilizes region based image features, the natural language question asked, and semantic knowledge extracted from the regions of an image to produce open-ended answers for questions asked in a visual question answering (VQA) task. The combination of both region based features and region based textual information about the image bolsters a model to more accurately respond to questions and potentially do so with less required training data. We evaluate our proposed architecture on a VQA task against a strong baseline and show that our method achieves excellent results on this task.


Sign in / Sign up

Export Citation Format

Share Document