Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.
大型语言模型(LLMS)经过大量人工编写的数据进行培训,但数据提供商通常仍然没有得到认可。为了应对此问题,已将数据评估(或数据归因)量化,该问题量化了每个数据对模型输出的贡献或价值,已被讨论为潜在的解决方案。然而,将现有的数据评估方法应用于最近的LLM及其广泛的培训数据集受到过度限制的计算和记忆成本。在这项工作中,我们专注于影响功能,这是一种流行的基于梯度的数据评估方法,并通过称为Logra的有效梯度投影策略可显着提高其可扩展性,该策略称为LOGRA,该策略利用了反向传播中的梯度结构。然后,我们提供了梯度投影方法的理论动机,以影响功能以促进对数据评估过程的信任。最后,我们通过引入Logix来降低实现数据评估系统的障碍,该软件包可以将现有的培训代码转换为数据评估代码,并以最小的努力将其转换为数据评估代码。在我们的数据估值实验中,Logra可针对更昂贵的基线实现竞争精确性,同时将吞吐量提高了6,500倍,而GPU存储器使用率则降低了5倍,而GPU存储器的使用情况则应用于LLAMA3-8B教学和1B-Token DataSet。