How do we improve data curation at scale? Today's models are only as good as the data they're trained on, as evidenced by how Mistral 7B has the same architecture as Llama 7B but is trained with proprietary data. In this literature review, I cover how recent algorithms in Data-Centric AI, especially data pruning, allow for cost savings and performance boosts by selecting only for high-quality training data.
Share this post
Data Pruning at Scale
Share this post
How do we improve data curation at scale? Today's models are only as good as the data they're trained on, as evidenced by how Mistral 7B has the same architecture as Llama 7B but is trained with proprietary data. In this literature review, I cover how recent algorithms in Data-Centric AI, especially data pruning, allow for cost savings and performance boosts by selecting only for high-quality training data.