Last week, the Common Crawl Foundation and Jeff Jarvis hosted an important conversation: AI & The Right to Learn on an Open Internet. However, this conversation has much broader scope -- who (or what) has the right to read anything and what can they do with this information?
td;dr; Everyone and everything has the right to read anything and everything. Including computers. Everyone and everything has the right to learn. However, the right to communicate what you've read or learned -- including synthesizing new ideas or content with or without attribution or compensation -- is at question.
The current hand wringing has to do with what are the intellectual property rights of information on the Internet -- how should content producers be compensated for their work and what attributions must be made? This is an old question (well at least a few decades). With generative AI and Large Language Models, the question becomes murkier -- how should attribution (and compensation) be given for "new" work that has been synthesized from previous work or content (crawled from the Internet)? This conference last Tuesday (April 30, 2024), focussed mostly on this.
But this question is much larger than content that is crawled on the Internet. For all the books that have ever been published, for any music or movie that has been created, or any content of any kind in general, the question of who should be compensated for what work under what kind of derivation is unanswered.
Kudos to the Common Crawl Foundation and Jeff Jarvis for addressing this broader question in addition to the specific about information on the Internet. In particular, through Generative AI and Large Language Models, there are no obvious answers of how to handle content (including text, audio, video) generated, learned, aggregated, and synthesized by machines (computers).
As pointed out, our current legal system is not up-to-speed on handling this question. The music industry has grapple with this for years, to no satisfactory solution. Ed Sheeran, Led Zeppelin, Robin Thicke, and even Taylor Swift are near the center of this controversy of who owns the right to music and derivative works. The labor dispute resulting in strikes by the Writers Guild of America and SAG-AFTRA show what is at stake for Hollywood (though this was not directly addressed by this conference).
"Fair use" doesn't seem to be an adequate "tool" for current use of content by Artificial Intelligence (or "derivative" content generation, AI or otherwise). Copyright, trademarks, patents, and trade secrets provide some guidance but seem inadequate to address current and emerging content largely driven by AI but technology in general. Ultimately, the courts need to get involved but it will be a long slog before it all gets sorted out. Perhaps the music industry provides some hints -- there are detailed rules and payment agreements on how copyright holders of music shall be compensated. However, this is far from settled, as disputes are working their way through the courts. Further, this is no guidance on how music created from Gen AI should be dealt with.
Ultimately, the issues are financial. It was posited that we should first get the ethical and moral issues sorted out first, and the monetary solutions will follow. Not surprising, this did not get much traction.
Right to Be Removed
One side topic of particular interest to me was a discussion about the "right to be removed." Once crawled and indexed into an archive from the open Internet, what are the rights of a publisher to be removed? At one level, once the information is "out there" can it ever really be "removed?" The Common Crawl data has been down loaded by many. Versions and back ups have been squired away. How is the data to be removed from all these copies?
Common Crawl is making efforts to remove content from an archive to satisfy practical requests, say remove links links to child pornography. However, this obviously does not affect the copies that have been distributed. And, even within the Common Crawl repository, I think these "removals" are just an update or branch from a given crawl. Think of the Common Crawl as a git repo -- all the versions are still there, even if there is an update to a branch.
Right to Read and Learn; Freedom of Speech; Intellectual Property Limits
As Americans, we have basic rights to read anything. Nothing should stop us from learning. Further, the First Amendment gives us broad rights to speak about nearly anything. There are some restrictions on what we do with what we synthesize from what we read and what we learn -- that's what Intellectual Property law is all about. The new question is whether or not machines have the same rights and if there are also somehow restricted. This is the issue discussed at this conference. With the onslaught of Generation AI and Large Language Models, the machines are spewing arguably new ideas synthesized from old. What should prevent the machines from reading, learning, and synthesizing?
Return to Xanadu
I've long been fascinated by the promise of Ted Nelson's Xanadu. While wildly impractical and unimplementable, especially when it was envisioned in the 1960s, the 17 principles are largely becoming true. In particular, in the context of attribution of source material for synthesized content by Gen AI systems, seems like "transclusions" and links as envisioned by Xanadu are relevant. The World Wide Web is often referred to as the simplified and practical implementation of Xanadu, even though Ted Nelson himself rejects this. I'd say Generative AI, transformers, and Large Language Models take us one step closer to Xanadu. And the hand wringing over how content creators (especially journalists) can be compensated for their work is found in the links/transclusions of Xanadu.
Hat tip to Rich Skrenta (Executive Director of Common Crawl) for bringing "Computer Lib : Dream Machines" (basically Ted Nelson and Xanadu) to the discussion.
Props and Acknowedgements
Thank you Rich Skrenta and Jeff Jarvis for envisioning and hosting this event. It is an important topic. But, all of the Common Crawl team were instrumental in making it happen. Joy Jing did a great job coordinating. Gil Elbaz is the founder of Common Crawl. Amazon AWS hosts the Common Crawl data (for free, I think). Mike Masnick was particularly insightful. Kearney, who hosted and sponsored much of the event, was particularly (and surprisingly) engaged in the issues of this conference. Great meeting the team from Tola Capital, sponsors of the event.