The Data Crunch: Accelerating American AI through Government Data Access

Joshua Levine and Tim Hwang

March 23, 2026

The featured image for a post titled "The Data Crunch: Accelerating American AI through Government Data Access"

DOWNLOAD PDF

Introduction

Data is a critical component of progress in artificial intelligence. It remains foundational for the pretraining of frontier models that push the state of the art in general AI capabilities, as well as for the post-training of models to achieve proficiency in specific tasks.

However, recent years have seen a justified focus in AI policymaking not on data, but on computing power. This has been driven by the extreme scarcity of the high-end chips necessary for training and inference in AI. Computing power has also become a priority by virtue of the geopolitical dimensions of the semiconductor supply chain: many of the most advanced fabrication facilities for chips are in Taiwan. Policymakers have therefore prioritized appropriately. Without reliable access to computing power, no progress in AI can occur.

But the world has already substantially changed in the years since the global AI competition began. The essential elements of policy around export controls and high-performance computing are becoming settled. Furthermore, new computing platforms such as Google’s Tensor Processing Unit and investment by public and private entities to reshore domestic U.S. semiconductor production mean that the scarcity of chips that shaped the previous five years of AI competition may not characterize the next five years.

The future opportunities for the U.S. to materially change the contours of AI competition may lie in policy approaches that focus on securing critical, high-quality datasets. As a past FAI paper argued, this may include the prioritization of training data access as a specific target of trade policy, with agencies such as the Office of the U.S. Trade Representative working to secure valuable data in deals with other countries.

Pressure to acquire data may be compounded by two major factors. First, at this point, many of the leading frontier AI labs are effectively consuming much of the easily accessible, openly available data to support the advancement of their technology. Many of the most valuable remaining datasets are either proprietary, not properly structured, or otherwise hard to access, increasing the costs and time necessary to leverage them.

Second, copyright and intellectual property laws have been used aggressively by rightsholders to raise the risks of using openly available data on a fair use basis. Litigation has created significant uncertainty about the use of data without formal licensing and has produced some of the largest settlements ever seen in the copyright space. This too creates additional frictions in accelerating U.S. AI progress.

This paper argues that one major accelerant may be an asset that the federal government is already sitting on: its own data. The government should work with private partners to increase access to this data, bolstering the data commons, and make it AI-ready to support American AI development. The paper argues for the value of launching a U.S. Data Accelerator, specifies what types of data would be most valuable, identifies existing laws and regulations that could underlie such a project, and addresses some potential criticisms of this idea.