The NoML Proposal

noml.png

This article was written by James North and is made available under a Creative Commons Attribution-NonCommercial 4.0 International License. Share it, but don't exploit it for commercial purposes.

Read the legalcode for the full terms.


Given all the excitement around large language models recently, we’ve seen a dramatic increase in organizations crawling the web for data to train their machine learning algorithms. In May 2023, The Internet Archive was pummeled by a company using AWS to scrape its OCR files, and experienced a temporary outage. While The Internet Archive wants to serve these organizations’ near-bottomless demand for training data (they just wish they were more polite about it), that's not true of everyone.

Some would prefer to block machine learning crawlers altogether. Right now, the only W3C-compliant method of doing that is by requesting our site not be indexed at all. Most want their sites to be indexed by search engines, however, so it’s not a great option. The specification needs changing to adapt to this new situation.

Adding the noml value is a very simple but powerful addition to the existing specification that will give webmasters more control over how their work is used. Governments may not have decided on how training data interacts with copyright law yet, but that doesn’t mean nothing can be doone about it now.

You can read more about the proposal here.

Comments are closed