Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.
大型语言模型(llm)是在大量的人类编写的数据上进行训练的,但数据提供者往往没有得到认可。针对这个问题,数据评估(或数据归因)作为一种潜在的解决方案进行了讨论,它量化了每个数据对模型输出的贡献或价值。然而,将现有的数据评估方法应用于最近的法学硕士及其庞大的训练数据集在很大程度上受到高昂的计算和内存成本的限制。在这项工作中,我们专注于影响函数,一种流行的基于梯度的数据评估方法,并通过一种称为LoGra的高效梯度投影策略显著提高其可扩展性,该策略利用了反向传播中的梯度结构。然后,我们提供了梯度投影方法的理论动机来影响函数,以促进数据评估过程中的信任。最后,我们通过引入LogIX降低了实现数据评估系统的障碍,LogIX是一个软件包,可以用最小的努力将现有的训练代码转换为数据评估代码。在我们的数据评估实验中,当应用于Llama3-8B-Instruct和1B-token数据集时,LoGra实现了与更昂贵的基线相比具有竞争力的准确性,同时显示吞吐量提高了6,500倍,GPU内存使用减少了5倍。