Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

Zhongwei Xie; Ling Liu; Yanzhao Wu; Luo Zhong; Lin Li

doi:10.1145/3490519

Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering

ACM Transactions on Information Systems ◽

10.1145/3490519 ◽

2022 ◽

Vol 40 (4) ◽

pp. 1-27

Author(s):

Zhongwei Xie ◽

Ling Liu ◽

Yanzhao Wu ◽

Luo Zhong ◽

Lin Li

Keyword(s):

Feature Engineering ◽

Two Phase ◽

Deep Feature ◽

Latent Space ◽

Joint Embedding ◽

Semantic Alignment ◽

Triplet Loss ◽

Efficient Learning ◽

Context Features ◽

Key Terms

This article introduces a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce ranking scores for key terms before generating the vector representation for each key term by using Word2vec. We leverage Wide ResNet50 and Word2vec to extract and encode the image category semantics of food images to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, taking into account also the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with deep feature engineering significantly outperforms the state-of-the-art approaches.

Download Full-text

Image annotation and retrieval based on efficient learning of contextual latent space

2009 IEEE International Conference on Multimedia and Expo ◽

10.1109/icme.2009.5202630 ◽

2009 ◽

Cited By ~ 9

Author(s):

Tatsuya Harada ◽

Hideki Nakayama ◽

Yasuo Kuniyoshi

Keyword(s):

Image Annotation ◽

Latent Space ◽

Efficient Learning

Download Full-text

Deep feature learning and latent space encoding for crop phenology analysis

Expert Systems with Applications ◽

10.1016/j.eswa.2021.115929 ◽

2022 ◽

Vol 187 ◽

pp. 115929

Author(s):

Arun Pattathal V ◽

Arnon Karnieli

Keyword(s):

Feature Learning ◽

Crop Phenology ◽

Deep Feature ◽

Latent Space ◽

Deep Feature Learning ◽

Space Encoding

Download Full-text

Person Re-ID by Fusion of Video Silhouettes and Wearable Signals for Home Monitoring Applications

Sensors ◽

10.3390/s20092576 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2576

Author(s):

Alessandro Masullo ◽

Tilo Burghardt ◽

Dima Damen ◽

Toby Perrett ◽

Majid Mirmehdi

Keyword(s):

Euclidean Distance ◽

Home Monitoring ◽

Video Data ◽

Free Living ◽

Specific Patient ◽

Video Clips ◽

Latent Space ◽

Monitoring Applications ◽

Short Video ◽

Triplet Loss

The use of visual sensors for monitoring people in their living environments is critical in processing more accurate health measurements, but their use is undermined by the issue of privacy. Silhouettes, generated from RGB video, can help towards alleviating the issue of privacy to some considerable degree. However, the use of silhouettes would make it rather complex to discriminate between different subjects, preventing a subject-tailored analysis of the data within a free-living, multi-occupancy home. This limitation can be overcome with a strategic fusion of sensors that involves wearable accelerometer devices, which can be used in conjunction with the silhouette video data, to match video clips to a specific patient being monitored. The proposed method simultaneously solves the problem of Person ReID using silhouettes and enables home monitoring systems to employ sensor fusion techniques for data analysis. We develop a multimodal deep-learning detection framework that maps short video clips and accelerations into a latent space where the Euclidean distance can be measured to match video and acceleration streams. We train our method on the SPHERE Calorie Dataset, for which we show an average area under the ROC curve of 76.3% and an assignment accuracy of 77.4%. In addition, we propose a novel triplet loss for which we demonstrate improving performances and convergence speed.

Download Full-text

Towards Integration of Domain Knowledge-Guided Feature Engineering and Deep Feature Learning in Surface Electromyography-Based Hand Movement Recognition

Computational Intelligence and Neuroscience ◽

10.1155/2021/4454648 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Wentao Wei ◽

Xuhui Hu ◽

Hua Liu ◽

Ming Zhou ◽

Yan Song

Keyword(s):

Surface Electromyography ◽

Domain Knowledge ◽

Hand Movement ◽

Feature Learning ◽

Feature Engineering ◽

Time Frequency ◽

Movement Recognition ◽

Deep Feature ◽

Deep Feature Learning ◽

Semg Signals

As a machine-learning-driven decision-making problem, the surface electromyography (sEMG)-based hand movement recognition is one of the key issues in robust control of noninvasive neural interfaces such as myoelectric prosthesis and rehabilitation robot. Despite the recent success in sEMG-based hand movement recognition using end-to-end deep feature learning technologies based on deep learning models, the performance of today’s sEMG-based hand movement recognition system is still limited by the noisy, random, and nonstationary nature of sEMG signals and researchers have come up with a number of methods that improve sEMG-based hand movement via feature engineering. Aiming at achieving higher sEMG-based hand movement recognition accuracies while enabling a trade-off between performance and computational complexity, this study proposed a progressive fusion network (PFNet) framework, which improves sEMG-based hand movement recognition via integration of domain knowledge-guided feature engineering and deep feature learning. In particular, it learns high-level feature representations from raw sEMG signals and engineered time-frequency domain features via a feature learning network and a domain knowledge network, respectively, and then employs a 3-stage progressive fusion strategy to progressively fuse the two networks together and obtain the final decisions. Extensive experiments were conducted on five sEMG datasets to evaluate our proposed PFNet, and the experimental results showed that the proposed PFNet could achieve the average hand movement recognition accuracies of 87.8%, 85.4%, 68.3%, 71.7%, and 90.3% on the five datasets, respectively, which outperformed those achieved by the state of the arts.

Download Full-text

Self-adaptive Re-weighted Adversarial Domain Adaptation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/440 ◽

2020 ◽

Author(s):

Shanshan Wang ◽

Lei Zhang

Keyword(s):

Negative Transfer ◽

Domain Adaptation ◽

Conditional Entropy ◽

The Arts ◽

Proposed Model ◽

Class Level ◽

Semantic Alignment ◽

Triplet Loss ◽

Domain Alignment ◽

Self Adaptive

Existing adversarial domain adaptation methods mainly consider the marginal distribution and these methods may lead to either under transfer or negative transfer. To address this problem, we present a self-adaptive re-weighted adversarial domain adaptation approach, which tries to enhance domain alignment from the perspective of conditional distribution. In order to promote positive transfer and combat negative transfer, we reduce the weight of the adversarial loss for aligned features while increasing the adversarial force for those poorly aligned measured by the conditional entropy. Additionally, triplet loss leveraging source samples and pseudo-labeled target samples is employed on the confusing domain. Such metric loss ensures the distance of the intra-class sample pairs closer than the inter-class pairs to achieve the class-level alignment. In this way, the high accurate pseudolabeled target samples and semantic alignment can be captured simultaneously in the co-training process. Our method achieved low joint error of the ideal source and target hypothesis. The expected target error can then be upper bounded following Ben-David’s theorem. Empirical evidence demonstrates that the proposed model outperforms state of the arts on standard domain adaptation datasets.

Download Full-text

Triplet Enhanced AutoEncoder: Model-free Discriminative Network Embedding

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/745 ◽

2019 ◽

Cited By ~ 3

Author(s):

Yao Yang ◽

Haoran Chen ◽

Junming Shao

Keyword(s):

Metric Learning ◽

Expressive Power ◽

Network Embedding ◽

Model Free ◽

The Neural Network ◽

Latent Space ◽

Label Information ◽

Triplet Loss ◽

Low Dimensional ◽

Public Datasets

Deep autoencoder is widely used in dimensionality reduction because of the expressive power of the neural network. Therefore, it is naturally suitable for embedding tasks, which essentially compresses high-dimensional information into a low-dimensional latent space. In terms of network representation, methods based on autoencoder such as SDNE and DNGR have achieved comparable results with the state-of-arts. However, all of them do not leverage label information, which leads to the embeddings lack the characteristic of discrimination. In this paper, we present Triplet Enhanced AutoEncoder (TEA), a new deep network embedding approach from the perspective of metric learning. Equipped with the triplet-loss constraint, the proposed approach not only allows capturing the topological structure but also preserving the discriminative information. Moreover, unlike existing discriminative embedding techniques, TEA is independent of any specific classifier, we call it the model-free property. Extensive empirical results on three public datasets (i.e, Cora, Citeseer and BlogCatalog) show that TEA is stable and achieves state-of-the-art performance compared with both supervised and unsupervised network embedding approaches on various percentages of labeled data. The source code can be obtained from https://github.com/yybeta/TEA.

Download Full-text

Global Optimal Structured Embedding Learning for Remote Sensing Image Retrieval

Sensors ◽

10.3390/s20010291 ◽

2020 ◽

Vol 20 (1) ◽

pp. 291 ◽

Cited By ~ 3

Author(s):

Pingping Liu ◽

Guixia Gou ◽

Xue Shan ◽

Dan Tao ◽

Qiuzhan Zhou

Keyword(s):

Remote Sensing ◽

Image Retrieval ◽

Metric Learning ◽

Remote Sensing Image ◽

Deep Feature ◽

Deep Embedding ◽

Similarity Structure ◽

Global Optimal ◽

Triplet Loss ◽

Mining Scheme

A rich line of works focus on designing elegant loss functions under the deep metric learning (DML) paradigm to learn a discriminative embedding space for remote sensing image retrieval (RSIR). Essentially, such embedding space could efficiently distinguish deep feature descriptors. So far, most existing losses used in RSIR are based on triplets, which have disadvantages of local optimization, slow convergence and insufficient use of similarity structure in a mini-batch. In this paper, we present a novel DML method named as global optimal structured loss to deal with the limitation of triplet loss. To be specific, we use a softmax function rather than a hinge function in our novel loss to realize global optimization. In addition, we present a novel optimal structured loss, which globally learn an efficient deep embedding space with mined informative sample pairs to force the positive pairs within a limitation and push the negative ones far away from a given boundary. We have conducted extensive experiments on four public remote sensing datasets and the results show that the proposed global optimal structured loss with pairs mining scheme achieves the state-of-the-art performance compared with the baselines.

Download Full-text

Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications

Scientometrics ◽

10.1007/s11192-019-03025-y ◽

2019 ◽

Vol 119 (1) ◽

pp. 257-277 ◽

Cited By ~ 8

Author(s):

Iqra Safder ◽

Saeed-Ul Hassan

Keyword(s):

Information Retrieval ◽

Full Text ◽

Feature Engineering ◽

Engineering Approach ◽

Deep Feature

Download Full-text

Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning

10.1145/3462244.3479892 ◽

2021 ◽

Author(s):

Zhongwei Xie ◽

Ling Liu ◽

Lin Li ◽

Luo Zhong

Keyword(s):

Deep Feature ◽

Joint Embedding

Download Full-text

Deep Feature Engineering for Noise Robust Spoofing Detection

IEEE/ACM Transactions on Audio Speech and Language Processing ◽

10.1109/taslp.2017.2732162 ◽

2017 ◽

Vol 25 (10) ◽

pp. 1942-1955 ◽

Cited By ~ 10

Author(s):

Yanmin Qian ◽

Nanxin Chen ◽

Heinrich Dinkel ◽

Zhizheng Wu

Keyword(s):

Feature Engineering ◽

Deep Feature ◽

Spoofing Detection ◽

Noise Robust

Download Full-text