This article was written by James North and is made available under a Creative Commons Attribution-NonCommercial 4.0 International License. Share it, but don't exploit it for commercial purposes.
Read the legalcode for the full terms.
Given all the excitement around large language models recently, we’ve seen a dramatic increase in organizations crawling the web for data to train their machine learning algorithms. In May 2023, The Internet Archive was pummeled by a company using AWS to scrape its OCR files, and experienced a temporary outage. While The Internet Archive wants to serve these organizations’ near-bottomless demand for training data (they just wish they were more polite about it), some of us don’t.
Some of us would prefer to block machine learning crawlers altogether. Right now, our only W3C-compliant method of doing that is by requesting our site not be indexed at all. Most of us want our sites to be indexed by search engines, however, so it’s not a great option. So, the specification needs changing to adapt to this new situation.
Adding the noml
value is a very simple but powerful addition to the existing specification that will give webmasters more control over how their work is used. Governments may not have decided on how training data interacts with copyright law yet, but that doesn’t mean we can’t do anything about it now.
The NoML Proposal is a very sane suggestion, so I signed it and am officially stating my support for it in this post.
You can read more about the proposal here.