Why are AI people so monumentally *bad* at copyright?
I'm looking for ethical/copyright-safe training data sets. Common Corpus sells itself as that... but then I go read the paper and they include CC BY-SA scientific papers and GPL stuff from GitHub, and then in models trained on that dataset they proudly state:
> Only trained on open data under a permissible [sic?] license [...] By design, all Pleias model are unable to output copyrighted content.
Um, no?? CC BY-SA is not public domain, it's a copyright license. You can't train on CC BY-SA content and then claim your model is any more copyright-safe than whatever Google and Meta are releasing. It just means you're violating the copyright of people releasing content under open licenses only.


