
This piece originally appeared in Lawfare.
It is no secret that dominance over cutting-edge technologies will play a major role in the geopolitical competition between the U.S. and China. Technological leadership will play a role in defining not just the economic health of each nation, but its military and soft power assets, as well. Artificial intelligence (AI) has emerged as one pivotal area in this competition, with both nations working to accelerate their capabilities in the technology and secure the necessary inputs for further development. China’s national and provincial governments are taking steps to create the infrastructure and regulatory regime to empower AI development and diffusion. The United States is currently ahead in this race—its companies are making the most dramatic breakthroughs in the technology and are implementing AI at scale through the economy. But this lead is fragile. In the United States, leading AI labs are facing an existential threat: copyright lawsuits. In these suits, rights holders argue that companies whose training data includes copyrighted material obtained by scraping the web without their express consent is a violation of copyright law. Because of the vast amount of data included in such training sets, the potential copyright penalties would bankrupt many AI developers. Though little talked about in discussions of geopolitical competition, it may ultimately be domestic battles over copyright that define whether the U.S. emerges as the definitive leader in the technological race with China, or falls behind. To remedy this issue and ensure the U.S. stays ahead, Congress – or ideally the courts – should take the bold and important step of affirming the legality of using publicly available data for training AI models in the United States.
China’s Data Directives
China is well-poised to catch up to the U.S. in AI and in some respects may already be ahead. One factor of AI production the Chinese government is working to support is access to high-quality, machine-readable datasets. The production function for AI includes talent, compute, and data. While all three inputs face bottlenecks, scarcity of this data is an issue that is increasingly relevant. According to AI research organization Epoch AI, current trends of data usage for model training will utilize the entire stock of human-generated public text sometime between 2026 and 2032. The size of datasets used to train frontier large language models (LLMs) is doubling approximately every seven months. Large amounts of data for model training is a prerequisite for pre-training, a necessary early step in the process of AI model development, as well as fine-tuning, a later process that necessitates more targeted or specialized datasets to impart specific knowledge or modality.
The Chinese government has adopted a two-pronged approach to ensure domestic model developers, particularly those partnering with state-supported industries; research and development (R&D); and academic institutions have the data necessary not just to compete with U.S. labs, but to best them.
Continue reading in Lawfare.