Data Pruning at Scale

Jan 5, 2024

How do we improve data curation at scale? Today's models are only as good as the data they're trained on, as evidenced by how Mistral 7B has the same architecture as Llama 7B but is trained with proprietary data. In this literature review, I cover how recent algorithms in Data-Centric AI, especially data pruning, allow for cost savings and performance boosts by selecting only for high-quality training data.

Read →

Comments

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

Haoli’s AI Insights

Data Pruning at Scale