I was curious to know where's the boundary between similarity and copyright infringement, specifically in the context of using large language models for programming. https://writing.kemitchell.com/2025/01/16/Provisional-Guidance-LLM-Code is just what I needed: a high-level explanation that there is no such rule (yet), and what can be done in its absence.
Kyle's prose is rich and always takes me a while to read and digest, so if you're in a hurry, here's my takeaways i:
there is no specific number of how many characters/tokens/lines one has to generate for it to become an infringement
there's a continuum: autocompletion-generation-authorship. If one is auto-completing a simple line of code, it's probably fine. If one generates the same boilerplate that half the projects in the world contain, it's probably fine too, but make sure it's really boilerplate and nothing original. If one is asking for a complete implementation of some algorithm, the risks are way higher
one should document everything that's done by an LLM, to be used later as evidence of noninfringement. LLM's output should be stored as separate commits, containing the prompt. Human's edits should be in a separate commit to clearly delineate what was generated and what was authored.
Of course, there's still many more questions to be answered about LLMs: potential infringements during training, efficiency of training and inference compared to typing the code yourself, as well as more philosophical questions of where this brings programming as activity.