Computing Reviews

Cross-modality feature learning via convolutional autoencoder
Liu X., Wang M., Zha Z., Hong R. ACM Transactions on Multimedia Computing, Communications, and Applications15(1s):1-20,2019.Type:Article
Date Reviewed: 12/10/20

This paper contributes to a hot research area that is the focus of many scientists, developers, and large corporations. The reason for the interest is that many important systems, for instance, ones for social media or data collection, produce large-scale multimedia datasets. Investigation by so-called “handcrafted features” becomes unsuitable for many non-numeric data types, such as text or pictures.

For many non-numeric data types, interesting features can be learned from the data itself. Different kinds of cross-modal feature learning are used in heterogeneous datasets/data stream analysis. Deep learning methods, among others, have been developed both for auto-encoding a data type (aiming at feature learning) and for attuned analysis of the determined component features of heterogeneous data.

For this purpose, the authors develop a sophisticated convolutional neural network (CNN), called multimodal convolutional autoencoder (MUCAE), and further develop some existing architectures. They use learning representative features from two modalities--pictures represented by image pixels, and text characters--to evaluate the method. To exploit the correlation between the hidden representations from the two modalities, the unified framework integrates an autoencoder and an objective function. The system jointly minimizes the representation learning error of each modality and the correlation divergence between different modalities.

The authors define the problem and describe the solution on an abstract level, showing the mathematical thoughts and the 11 levels of their CNN. There is no reference to the environment of the implementation; one can presume only that some of the powerful and popular tools and packages are used.

Some related work on multimodal, supervised, and unsupervised deep feature learning is enumerated. The paper contains precise figures about the efficiency of the implementation on two datasets: MIRFlickr, and a subset of NUS-WIDE. These results are compared to those of five former systems developed over the past decade. According to these results, MUCAE outperformed the others by two to ten percent for joint character-picture data analysis. The main parameters used in the algorithm are discussed. The behavior of the method depending on the size of the input dataset is not investigated.

By concretizing a bit of the essence of the abstract, the paper’s conclusion summarizes the approach, the method, the experiments, and the results. Neither intended (or further) developments of the method nor future directions are discussed.

I recommend the paper only for active specialists in the area.

Reviewer:  K. Balogh Review #: CR147134 (2104-0085)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy