
Tim Berners-Lee once proclaimed, “When you put information on the web, you are sharing it with everybody.” Renowned futurist Stewart Brand famously declared, “Information wants to be free.” Many other internet pioneers and cyber libertarians—including John Perry Barlow, Jimmy Wales, Brewster Kahle, Eric Raymond, Richard Stallman, Aaron Swartz, and Larry Lessig—have defended the concept of the free and open web. With the rise of AI, and the era of open protocols giving way to walled gardens and big platforms, these foundational values have been under threat. Here at FAI, we want to take a stand for the cyberpunks and their unwavering belief in free information. To that end, we are now welcoming all bots and web crawlers via our robots.txt file, inviting them to access and scrape our content for building AI tools (or anything else).
# All bots are welcome. Train free or die.
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: llama-bot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Anthropic-AI
Allow: /
User-agent: CCBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Omgilibot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Perplexity-User/1.0
Allow: /
User-agent: MistralAI-User/1.0
Allow: /
User-agent: Bytespider
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: Applebot
Allow: /
User-agent: AwarioRssBot
Allow: /
User-agent: AwarioSmartBot
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: FacebookBot
Allow: /
This also intersects with an ongoing policy debate: Should AI developers be allowed to scrape the internet to train AI models? Should there be a regime for opting out? And if so, how do you implement it in a way that provides certainty to both AI developers and rights holders?
To be clear, I am a firm defender of broad fair use treatment for AI training data. If you’d like to learn more, I’d recommend starting with my paper with Tim Hwang, “Copyright, AI, and Great Power Competition,” or my article, “To Support AI, Defend the Open Internet and Fair Use.”
While we want our content to be available to any web user, human or robot, we recognize that others may not share this view. For those who wish to opt out, there is currently no uniform or workable technical standard.
Even if you were to disallow bots in robots.txt, there is no way for model developers to implement this uniformly. There is also no central database of who owns what, who has already opted out, or how to make sure a certain content item is not already included in a training data set. Even if there were a way to mark individual websites, they can appear in multiple places through cross-posts, pull quotes, or other mechanisms.
Before we can consider laws to enshrine opt-out regimes, we need to address hard technical and policy questions. Such questions are still evolving, and are likely to be different based on the content creator and medium in question. Currently, some sites and model developers are relying on informal opt-out mechanisms; some are pursuing exclusive licensing deals; and others are opening their doors to these new digital visitors. Allowing different preferences to guide technical solutions to such questions is in keeping with the values of cyberspace.
Access, sharing, and remixing of information have been foundational to the internet’s success. As it enters its next era, we must be careful not to screw this up. For now, to all wandering bots and web crawlers, please make yourselves at home.