Evaluating vector space models with canonical correlation analysis

AbstractVector space models are used in language processing applications for calculating semantic similarities of words or documents. The vector spaces are generated with feature extraction methods for text data. However, evaluation of the feature extraction methods may be difficult. Indirect evaluation in an application is often time-consuming and the results may not generalize to other applications, whereas direct evaluations that measure the amount of captured semantic information usually require human evaluators or annotated data sets. We propose a novel direct evaluation method based on canonical correlation analysis (CCA), the classical method for finding linear relationship between two data sets. In our setting, the two sets are parallel text documents in two languages. A good feature extraction method should provide representations that reflect the semantic content of the documents. Assuming that the underlying semantic content is independent of the language, we can study feature extraction methods that capture the content best by measuring dependence between the representations of a document and its translation. In the case of CCA, the applied measure of dependence is correlation. The evaluation method is based on unsupervised learning, it is language- and domain-independent, and it does not require additional resources besides a parallel corpus. In this paper, we demonstrate the evaluation method on a sentence-aligned parallel corpus. The method is validated by showing that the obtained results with bag-of-words representations are intuitive and agree well with the previous findings. Moreover, we examine the performance of the proposed evaluation method with indirect evaluation methods in simple sentence matching tasks, and a quantitative manual evaluation of word translations. The results of the proposed method correlate well with the results of the indirect and manual evaluations.

Download Full-text

Canonical Correlation Analysis for Multiview Semisupervised Feature Extraction

Artificial Intelligence and Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-13208-7_54 ◽

2010 ◽

pp. 430-436 ◽

Cited By ~ 9

Author(s):

Olcay Kursun ◽

Ethem Alpaydin

Keyword(s):

Feature Extraction ◽

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation

Download Full-text

BSmCCA: A block sparse multiple-set canonical correlation analysis algorithm for multi-subject fMRI data sets

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2017.7953373 ◽

2017 ◽

Cited By ~ 3

Author(s):

Abd-Krim Seghouane ◽

Asif Iqbal ◽

Nandakishor Desai

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Fmri Data ◽

Data Sets ◽

Analysis Algorithm

Download Full-text

Feature extraction based on canonical correlation analysis for appearance parameter estimation

10.1117/12.420919 ◽

2001 ◽

Author(s):

Michael Reiter ◽

Thomas Melzer

Keyword(s):

Feature Extraction ◽

Parameter Estimation ◽

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation

Download Full-text

A unified multiset canonical correlation analysis framework based on graph embedding for multiple feature extraction

Neurocomputing ◽

10.1016/j.neucom.2014.06.015 ◽

2015 ◽

Vol 148 ◽

pp. 397-408 ◽

Cited By ~ 30

Author(s):

XiaoBo Shen ◽

QuanSen Sun ◽

YunHao Yuan

Keyword(s):

Feature Extraction ◽

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Graph Embedding ◽

Analysis Framework ◽

Multiple Feature

Download Full-text

Constrained ERM Learning of Canonical Correlation Analysis: A Least Squares Perspective

Neural Computation ◽

10.1162/neco_a_00996 ◽

2017 ◽

Vol 29 (10) ◽

pp. 2825-2859 ◽

Cited By ~ 1

Author(s):

Jia Cai ◽

Hongwei Sun

Keyword(s):

Correlation Analysis ◽

Least Squares ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

Data Sets ◽

Risk Minimization ◽

Real World Data ◽

Empirical Risk ◽

Statistical Consistency ◽

Kaczmarz Method

Canonical correlation analysis (CCA) is a useful tool in detecting the latent relationship between two sets of multivariate variables. In theoretical analysis of CCA, a regularization technique is utilized to investigate the consistency of its analysis. This letter addresses the consistency property of CCA from a least squares view. We construct a constrained empirical risk minimization framework of CCA and apply a two-stage randomized Kaczmarz method to solve it. In the first stage, we remove the noise, and in the second stage, we compute the canonical weight vectors. Rigorous theoretical consistency is addressed. The statistical consistency of this novel scenario is extended to the kernel version of it. Moreover, experiments on both synthetic and real-world data sets demonstrate the effectiveness and efficiency of the proposed algorithms.

Download Full-text