scholarly journals Multilabel Image Annotation Based on Double-Layer PLSA Model

2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Jing Zhang ◽  
Da Li ◽  
Weiwei Hu ◽  
Zhihua Chen ◽  
Yubo Yuan

Due to the semantic gap between visual features and semantic concepts, automatic image annotation has become a difficult issue in computer vision recently. We propose a new image multilabel annotation method based on double-layer probabilistic latent semantic analysis (PLSA) in this paper. The new double-layer PLSA model is constructed to bridge the low-level visual features and high-level semantic concepts of images for effective image understanding. The low-level features of images are represented as visual words by Bag-of-Words model; latent semantic topics are obtained by the first layer PLSA from two aspects of visual and texture, respectively. Furthermore, we adopt the second layer PLSA to fuse the visual and texture latent semantic topics and achieve a top-layer latent semantic topic. By the double-layer PLSA, the relationships between visual features and semantic concepts of images are established, and we can predict the labels of new images by their low-level features. Experimental results demonstrate that our automatic image annotation model based on double-layer PLSA can achieve promising performance for labeling and outperform previous methods on standard Corel dataset.

2014 ◽  
Vol 14 (03) ◽  
pp. 1450012
Author(s):  
Yongmei Liu ◽  
Tanakrit Wongwitit ◽  
Linsen Yu

Automatic image annotation is an important and challenging job for image analysis and understanding such as content-based image retrieval (CBIR). The relationship between the keywords and visual features is too complicated due to the semantic gap. We present an approach of automatic image annotation based on scene analysis. With the constrain of scene semantics, the correlation between keywords and visual features becomes simpler and clearer. Our model has two stages of process. The first stage is training process which groups training image data set into semantic scenes using the extracted semantic feature and visual scenes constructed from the calculation distances of visual features for every pairs of training images by using Earth mover's distance (EMD). Then, combine a pair of semantic and visual scene together and apply Gaussian mixture model (GMM) for all scenes. The second stage is to test and annotate keywords for test image data set. Using the visual features provided by Duygulu, experimental results show that our model outperforms probabilistic latent semantic analysis (PLSA) & GMM (PLSA&GMM) model on Corel5K database.


2017 ◽  
Vol 2017 ◽  
pp. 1-11
Author(s):  
Miao Zang ◽  
Huimin Xu ◽  
Yongmei Zhang

It remains a challenging task for automatic image annotation problem due to the semantic gap between visual features and semantic concepts. To reduce the gap, this paper puts forward a kernel-based multiview joint sparse coding (KMVJSC) framework for image annotation. In KMVJSC, different visual features as well as label information are considered as distinct views and are mapped to an implicit kernel space, in which the original nonlinear separable data become linearly separable. Then, all the views are integrated into a multiview joint sparse coding framework aiming to find a set of optimal sparse representations and discriminative dictionaries adaptively, which can effectively employ the complementary information of different views. An optimization algorithm is presented by extending K-singular value decomposition (KSVD) and accelerated proximal gradient (APG) algorithms to the kernel multiview framework. In addition, a label propagation scheme using the sparse reconstruction and weighted greedy label transfer algorithm is also proposed. Comparative experiments on three datasets have demonstrated the competitiveness of proposed approach compared with other related methods.


AusArt ◽  
2016 ◽  
Vol 4 (1) ◽  
pp. 19-28
Author(s):  
Pilar Rosado Rodrigo ◽  
Eva Figueras Ferrer ◽  
Ferran Reverter Comes

Esta investigación aborda el problema de la detección aspectos latentes en grandes colecciones de imágenes de obras de artista abstractas, atendiendo sólo a su contenido visual. Se ha programado un algoritmo de descripción de imágenes utilizado en visión artificial cuyo enfoque consiste en colocar una malla regular de puntos de interés en la imagen y seleccionar alrededor de cada uno de sus nodos una región de píxeles para la que se calcula un descriptor que tiene en cuenta los gradientes de grises encontrados. Los descriptores de toda la colección de imágenes se pueden agrupar en función de su similitud y cada grupo resultante pasará a determinar lo que llamamos “palabras visuales”. El método se denomina Bag-of-Words (bolsa de palabras). Teniendo en cuenta la frecuencia con que cada “palabra visual”  ocurre en cada imagen, aplicamos el modelo estadístico pLSA (Probabilistic Latent Semantic Analysis), que clasificará de forma totalmente automática las imágenes según su categoría formal. Esta herramienta resulta de utilidad tanto en el análisis de obras de arte como en la producción artística. Palabras-clave: visión artificial; modelo Bag-of-Words; CBIR (Recuperación de imágenes por contenido); pLSA (ANÁLISIS PROBABILÍSTICO DE ASPECTOS LATENTES); palabra visual From pixel to visual resonances: Images with voicesAbstractThe objective of our research is to develop a series of computer vision programs to search for analogies in large datasets—in this case, collections of images of abstract paintings—based solely on their visual content without textual annotation. We have programmed an algorithm based on a specific model of image description used in computer vision. This approach involves placing a regular grid over the image and selecting a pixel region around each node. Dense features computed over this regular grid with overlapping patches are used to represent the images. Analysing the distances between the whole set of image descriptors we are able to group them according to their similarity and each resulting group will determines what we call "visual words". This model is called Bag-of-Words representation Given the frequency with which each visual word occurs in each image, we apply the method pLSA (Probabilistic Latent Semantic Analysis), a statistical model that classifies fully automatically, without any textual annotation, images according to their formal patterns. In this way, the researchers hope to develop a tool both for producing and analysing works of art. Keywords: artificial visión; Bag-of-Words model; CBIR (Content-Based Image Retrieval); pLSA (Probabilistic Latent Semantic Analysis); visual word


2015 ◽  
Vol 24 (05) ◽  
pp. 1540021 ◽  
Author(s):  
Konstantinos Pliakos ◽  
Constantine Kotropoulos

The interest in image annotation and recommendation has been increased due to the ever rising amount of data uploaded to the web. Despite the many efforts undertaken so far, accuracy or efficiency still remain open problems. Here, a complete image annotation and tourism recommender system is proposed. It is based on the probabilistic latent semantic analysis (PLSA) and hypergraph ranking, exploiting the visual attributes of the images and the semantic information found in image tags and geo-tags. In particular, semantic image annotation resorts to the PLSA, exploiting the textual information in image tags. It is further complemented by visual annotation based on visual image content classification. Tourist destinations, strongly related to a query image, are recommended using hypergraph ranking enhanced by enforcing group sparsity constraints. Experiments were conducted on a large image dataset of Greek sites collected from Flickr. The experimental results demonstrate the merits of the proposed model. Semantic image annotation by means of the PLSA has achieved an average precision of 92% at 10% recall. The accuracy of content-based image classification is 82, 6%. An average precision of 92% is measured at 1% recall for tourism recommendation.


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Fengcai Qiao ◽  
Cheng Wang ◽  
Xin Zhang ◽  
Hui Wang

Near-duplicate image retrieval is a classical research problem in computer vision toward many applications such as image annotation and content-based image retrieval. On the web, near-duplication is more prevalent in queries for celebrities and historical figures which are of particular interest to the end users. Existing methods such as bag-of-visual-words (BoVW) solve this problem mainly by exploiting purely visual features. To overcome this limitation, this paper proposes a novel text-based data-driven reranking framework, which utilizes textual features and is combined with state-of-art BoVW schemes. Under this framework, the input of the retrieval procedure is still only a query image. To verify the proposed approach, a dataset of 2 million images of 1089 different celebrities together with their accompanying texts is constructed. In addition, we comprehensively analyze the different categories of near duplication observed in our constructed dataset. Experimental results on this dataset show that the proposed framework can achieve higher mean average precision (mAP) with an improvement of 21% on average in comparison with the approaches based only on visual features, while does not notably prolong the retrieval time.


2021 ◽  
Author(s):  
Rui Zhang

This thesis is primarily focused on the information combination at different levels of a statistical pattern classification framework for image annotation and retrieval. Based on the previous study within the fields of image annotation and retrieval, it has been well-recognized that the low-level visual features, such as color and texture, and high-level features, such as textual description and context, are distinct yet complementary in terms of their distributions and the corresponding discriminative powers of dealing with machine-based recognition and retrieval tasks. Therefore, effective feature combination for image annotation and retrieval has become a desirable and promising perspective from which the semantic gap can be further bridged. Motivated by this fact, the combination of the visual and context modalities and that of different features in the visual domain are tackled by developing two statistical patterns classification approaches considering that the features of the visual modality and those across different modalities exhibit different degrees of heterogeneities, and thus, should be treated differently. Regarding the cross-modality feature combination, a Bayesian framework is proposed to integrate visual content and context, which has been applied to various image annotation and retrieval frameworks. In terms of the combination of different low-level features in the visual domain, the problem is tackled with a novel method that combines texture and color features via a mixture model of their joint distribution. To evaluate the proposed frameworks, many different datasets are employed in the experiments, including the COREL database for image retrieval and the MSRC, LabelMe, PASCAL VOC2009, and an animal image database collected by ourselves for image annotation. Using various evaluation criteria, the first framework is shown to be more effective than the methods purely based on the low-level features or high-level context. As for the second, the experimental results demonstrate not only its superior performance to other feature combination methods but also its ability to discover visual clusters using texture and color simultaneously. Moreover, a demo search engine based on the Bayesian framework is implemented and available online.


Sign in / Sign up

Export Citation Format

Share Document