2023-5-15 AI Insights: Multimodal Mixture of Experts & Datasets and No More Tokenization!
Google Research does their magic again! Meta Research with fundamental discovery
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
The Problem At Hand
In multimodal perception, we often face challenges such as varying structural I/O across modalities and the presence of multiple input modalities that require intricate architectural design and hyperparameter tuning. A common issue when training large-scale models in a distributed setting is the need for static input signatures and loss objectives, making the process inefficient.
Typically, this problem is addressed by mixed batching, where all possible inputs are constructed and inapplicable ones are padded and masked. However, this approach is computationally expensive and inefficient. Furthermore, as we add more tasks, the per-task batch size naturally reduces, creating additional difficulties.
A Novel Approach: Alternating Gradient Descent
Akbari and his team propose a solution: Alternating Gradient Descent (AGD). AGD is a method where at each gradient descent step, a different loss objective is optimized given a different set of model weights and input modalities. This approach allows the model to use a fraction of the computational cost and memory required by large-scale perception models, making it a significant step forward in the field.
Unlike mixed batching, AGD does not require the construction of all possible inputs, nor does it require padding and masking of inapplicable inputs. Instead, each task is treated individually, with the computation graph dynamically recompiled as needed. This method is more memory efficient and computationally optimal, leading to a more effective training process.
Integrating Multimodal Perception
By implementing AGD, the researchers successfully integrate numerous heterogenous unimodal and multimodal tasks into a single training pipeline with one shared model. This approach effectively leverages as many existing datasets as possible, can train on any combination of tasks or loss functions, and does not slow down with the addition of any new dataset, task, or loss function.
The Model Architecture
The proposed architecture consists of an embedder, a Mixture of Experts (MoE) encoder, and multiple heads. The embedder randomly samples datasets with varying modalities, resolutions, and objectives at each optimization step and feeds them into the model. The MoE encoder projects each modality into the same representation space, and experts-choose routing allows any token to be routed to any expert, regardless of modality. Finally, the heads apply projection to compute respective objective losses applicable to the encoded features.
The Significance
The findings from this research offer a new perspective on multimodal perception training. The study demonstrates that the combination of diverse tasks across multiple modalities can offer better convergence than single-task training. Moreover, the proposed method provides a design that allows seamless integration of any number of tasks and datasets without significant memory overhead.
Future Works
Despite the success of the approach, the authors acknowledge room for further exploration. For instance, the model excels at zero-shot video understanding but performs less well in image and audio understanding. Also, the question of how to best combine objectives during training remains open.
MegaByte: Predicting Million-Byte Sequences with Multiscale Transformers
The Problem Space
Traditionally, autoregressive transformers have struggled to handle long sequences of data, such as high-resolution images, podcasts, code, or books. This is mainly due to the quadratic cost of self-attention and the cost of large Feed Forward (FF) networks per position. Moreover, tokenization complicates preprocessing, multimodal modeling, domain adaptation, and obscures useful structure from the model, which means that state-of-the-art models aren't truly end-to-end.
Past Approaches
Efforts to create more efficient encoder models, such as Vision Transformers (ViT), often involve patchifying operations. However, these methods can't be easily applied to decoder-only models without leaking future bytes in the same patch. On the other hand, efficient decoder models usually involve chunking sequences into smaller blocks, linear alternatives to attention, and sparse approximations of attention. These approaches, while effective, still don't offer a solution to lossless compression.
Enter MEGABYTE
MEGABYTE takes a unique approach to sequence modeling that circumvents these challenges. Its architecture comprises a Patch Embedder, a Global Module, and a Local Module. The Patch Embedder encodes a patch by losslessly concatenating the embeddings of each byte. The Global Module is a large autoregressive transformer that inputs and outputs patch representations, and the Local Module is a small autoregressive model that predicts bytes within a patch.
One of the most exciting features of MEGABYTE is its implementation of sub-quadratic self-attention and per-patch feedforward layers. This approach allows for parallelism in decoding, which significantly improves the model's efficiency. MEGABYTE can process raw audio instead of relying on spectrograms, a significant breakthrough in the field.
Benchmarking MEGABYTE
In testing, MEGABYTE has shown impressive performance. It competes effectively with subword models on long context language modeling and achieves state-of-the-art density estimation on ImageNet. These results suggest that MEGABYTE could have a profound impact on our ability to work with long sequence data.
Looking Ahead
The authors have identified potential future work in scaling MEGABYTE to larger models and datasets. This suggests that there's still room for further improvement and adaptation, which is a thrilling prospect.
Personal Thoughts
The work presented in this paper represents a significant step forward in our ability to process and understand long sequence data. MEGABYTE not only promises to make sequence modeling more efficient but also opens the door to a more nuanced understanding of multimodal data. It's a potent reminder of the remarkable progress we're witnessing in the field of AI and Deep Learning.