2023-5-11 AI Insights: Decoding Videos from Visual Cortex, Six-modality joint embedding space, and more!
Learnable Latent Embeddings for Joint Behavioral and Neural Analysis
This research presents a new method for dimensionality reduction in time series data, particularly for the joint embedding of behavioral and neural data.
Problem Statement
The primary biological motivation of this paper is to find the neural origins of behavior. The authors note that existing tools for representation learning are either linear or fail to yield consistent embeddings across animals or repeated runs. Complex 3D forelimb reaching can be reduced to only 8-12 dimensions, and while PCA (principal component analysis) is often used to increase interpretability, it comes at the cost of performance. Nonlinear methods like UMAP and tSNE are excellent but do not explicitly use time information and are not as directly interpretable as PCA.
From the figure (d) above, the authors plotted the first three latents generated from each respective method and figure (e) Correlation matrices depict the R^2 after fitting a linear model between behavior-aligned embeddings of two animals, one as the target one as the source. Through these results, we can see that CEBRA parameterized by Behavior performs the best.
Related Works
The authors discuss label-guided VAEs (variational autoencoders) as a potential solution for improving interpretability by using behavioral variables. However, VAEs have restrictive assumptions on the underlying statistics of data and do not guarantee consistent neural embeddings across animals.
Key Contributions
The paper introduces CEBRA, a new self-supervised learning method for obtaining interpretable, Consistent EmBeddings of high-dimensional Recordings using Auxiliary variables. CEBRA's main features include:
Contrastive learning: A nonlinear ICA (independent component analysis) instance based on contrastive learning.
The ability to use both discrete and continuous variables to shape the distribution of positive and negative pairs.
A nonlinear encoder (CNN) trained with a novel contrastive objective to form a low-dimensional embedding space.
InfoNCE minimization: Using network 'f' to map neural activity to an embedding space of a defined dimension.
Task-invariant features: CEBRA groups information into task-relevant and task-irrelevant variables for use in various settings to infer latent embeddings without explicitly modeling the data generating process.
Hypothesis and discovery-driven science: CEBRA is invariant to constant distribution shifts that appear between recording days or sessions.
One notable achievement of CEBRA is its ability to capture consistent embeddings across modalities. It was tested on movie features as "behavior" labels by extracting high-level visual features from movies on a frame-by-frame basis using DINO (ViT model). Neuropixels or calcium data were then used to generate latent spaces from varying numbers of neurons from V1. The results showed that CEBRA-Time captured ten repeats of the movie, indicating a highly consistent latent space independent of the recording method.
Personal Thoughts and Future Works
I believe the significance of this paper lies in its innovative approach to analyzing time-series data with a novel dimensionality reduction method. CEBRA's flexibility in working with both discrete and continuous variables, as well as its task-invariant features, make it a promising tool for various applications, including those beyond the realm of biological research. For instance, applying CEBRA to quantitative finance to analyze time-series data paired with trading behavior could lead to interesting new discoveries.
ImageBind: One Embedding Space to Bind Them All
Problem Statement
The creation of tuples of corresponding paired data from different modalities all at once is infeasible, leading to a limitation in the final embeddings. For instance, video-audio embeddings cannot be directly used for image-text tasks. The major challenge in learning a true joint embedding is the absence of large quantities of multimodal data where all modalities are present together.
Key Contributions
ImageBind learns a single shared representation space by leveraging (Image, X) pairs, where images are the common link between all other modalities (X). The integration of various modalities leads to emergent understanding and outperforms specialist models. Modalities jointly aligned include web image-text, depth sensor data, web videos, thermal data, and egocentric videos. The architecture consists of a modality-specific encoder with a modality-specific linear projection head to obtain a fixed size d-dimension embedding, normalized and used in InfoNCE loss.
One of the most interesting aspects of ImageBind is its multimodal embedding space arithmetic, allowing for the fusion of modalities as inputs for emergent compositionality. Without retraining, existing models that use CLIP embeddings can be swapped to use ImageBind embeddings from other modalities.
Future Works
There are several areas for future research with ImageBind. Image alignment loss could be enriched by using other alignment data, such as audio with IMU. Since the embeddings are trained without a specific downstream task, they lag the performance of specialist models. Additionally, joint embeddings are limited to the concepts present in datasets; thermal datasets are limited to outdoor street scenes, and depth datasets are limited to indoor scenes.
Personal Thoughts
This paper is significant for early career researchers in AI and deep learning because it introduces a new approach to multimodal data representation that leverages the image as a common link between modalities. It demonstrates the potential for emergent understanding and compositional reasoning in a shared embedding space. Moreover, the paper opens the door for further research in enhancing and expanding the capabilities of multimodal embedding spaces, offering exciting opportunities for future investigation.
Multimodal Understanding Through Correlation Maximization and Minimization
Problem Statement
The authors want to understand the intrinsic nature of multimodal data beyond any particular downstream task. Their goal is to learn common and individual structures of multimodal data in an unsupervised manner, which can potentially lead to better models for various tasks.
Related Works
Canonical Correlation Analysis (CCA) is a technique used to find relationships between two sets of variables (commonalities). The authors build upon this concept and use a nonlinear variant to handle more complicated data modalities, such as images and audio. They reformulate CCA to help models better avoid local maxima during optimization.
Key Contributions
The authors propose MUC2M, a method that works with large pretrained networks to learn and visualize:
Latent representation for common structure: Using a reformulated deep canonical correlation analysis (deep CCA), the authors find pairs of common latent representations of original modalities that are maximally correlated. Orthogonality constraints ensure the representational efficiency of the learned common latent space and avoid trivial solutions.
Latent representation for individual structure: To ensure representations capture different information from common latents, the authors enforce sample correlation between individual and common to be zero for each view. They also use contractive regularization to prevent erroneous learning.
They also create a score function for common and individual structures, enabling the computation of gradient maps that reveal which parts of the input contributed most to the learned common structure.
What I find particularly useful is how this method can also be used to separate common and individual features where both modality inputs are images. I can forsee this being used in applications such as targeted image editing.
Future Works
The model is currently trained in two stages (common then individual) as it is not incentivized to learn both types of representations simultaneously. The authors suggest that future works should explore regularizing the representational capacities of both individual and common representations and weight the regularizations in a data-driven manner, thus eliminating the need for manual tuning when applying the model to new datasets.
Personal Thoughts
This paper is significant for early career researchers in AI and DL as it provides insights into understanding multimodal data's intrinsic nature. The MUC2M method has the potential to improve multimodal understanding in unsupervised settings and could lead to better-performing models across various tasks. The focus on visualizing and understanding the common and individual structures in multimodal data can be particularly valuable for researchers exploring novel approaches to representation learning and model interpretability.