TY - GEN
T1 - Interrogating LLM design under copyright law
AU - Tian-Zheng Wei, Johnny
AU - Wang, Maggie
AU - Godbole, Ameya
AU - Choi, Jonathan
AU - Jia, Robin
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/6/23
Y1 - 2025/6/23
N2 - The current discourse on large language models (LLMs) and copyright largely takes a "behavioral"perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural"perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning"by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.
AB - The current discourse on large language models (LLMs) and copyright largely takes a "behavioral"perspective, focusing on model outputs and evaluating whether they are substantially similar to training data. However, substantial similarity is difficult to define algorithmically and a narrow focus on model outputs is insufficient to address all copyright risks. In this interdisciplinary work, we take a complementary "structural"perspective and shift our focus to how LLMs are trained. We operationalize a notion of "fair learning"by measuring whether any training decision substantially affected the model's memorization. As a case study, we deconstruct Pythia, an open-source LLM, and demonstrate the use of causal and correlational analyses to make factual determinations about Pythia's training decisions. By proposing a legal standard for fair learning and connecting memorization analyses to this standard, we identify how judges may advance the goals of copyright law through adjudication. Finally, we discuss how a fair learning standard might evolve to enhance its clarity by becoming more rule-like and incorporating external technical guidelines.
KW - Copyright
KW - LLMs
KW - regression
UR - https://www.scopus.com/pages/publications/105010816786
U2 - 10.1145/3715275.3732193
DO - 10.1145/3715275.3732193
M3 - Conference contribution
AN - SCOPUS:105010816786
T3 - ACMF AccT 2025 - Proceedings of the 2025 ACM Conference on Fairness, Accountability,and Transparency
SP - 3030
EP - 3045
BT - ACMF AccT 2025 - Proceedings of the 2025 ACM Conference on Fairness, Accountability,and Transparency
PB - Association for Computing Machinery, Inc
T2 - 8th Annual ACM Conference on Fairness, Accountability, and Transparency, FAccT 2025
Y2 - 23 June 2025 through 26 June 2025
ER -