
We appreciate the opportunity to help inform the implementation of the Genesis Mission at the Department of Energy. To effectively execute this mission, the Department should establish an American scientific data consortium, to utilize the breadth of authorities available to the Department. We note that fine-tuned reasoning and language based models offer the best foundation for scientific models and that diverse types of scientific data should be accessed via API. Domains that are ripe for the development of scientific and self-improving models include structural biology, materials science, and quantum. The most high priority deployments of these models should be for improving materials research for energy and compute, for accelerating the improvement of scientific instruments, and for building the quantum infrastructure which will drive the next era of compute.
1. Mobilizing the National Labs
Establish an American Scientific Data Consortium to:
- Create Domestic Industry Tiers: Establish privileged partnership tiers with American companies first, particularly those with secure supply chains and U.S.-based operations
- Leverage Existing Infrastructure: Utilize the National Laboratory complex's unique computational assets (like Summit, Frontier, and Aurora supercomputers) as exclusive American AI training grounds, providing U.S. companies preferential access to these national resources
Data Curation Framework
- Standardized American Data Standards: Develop U.S.-led standards for scientific data formatting that become the global benchmark, ensuring American companies have first-mover advantages
- Federated Architecture: Implement a federated data system where DOE maintains sovereignty over its data while contributing to a unified, searchable ‘catalog’ accessible to U.S. entities
- Priority Access System: American companies receive accelerated access to curated datasets, with graduated access levels based on domestic content requirements and U.S. workforce commitments
Privacy-Preserving Approaches for Competitive Advantage
- Secure Multi-Party and Homomorphic Encryption Protocols: Invest in secure enclaves and homomorphic encryption capabilities that allow AI training on encrypted and proprietary data, ensuring that sensitive commercial, nuclear, energy grid, and defense-related datasets can strengthen American AI models without security or intellectual property risks
- Synthetic Scientific Data Generation: Develop American-owned synthetic data generation capabilities that create training datasets mimicking real scientific data patterns without exposing actual sensitive information
- Trusted Execution Environments: Deploy hardware-based security on American-manufactured chips to create isolated computing environments for sensitive data processing
Implementation Recommendations
- Secure American Supply Chains: Successful implementation of the Genesis Mission will require participants to support a robust, resilient U.S.-based supply chain
- Controlled Export of the American AI Stack: Export controlled model access of AI models trained with DOE data, ensuring American companies maintain competitive advantages
- Workforce Development: Incentivize consortium members to invest in training American workers in AI and data science, building domestic capacity
- Performance Metrics: Establish clear metrics prioritizing American job creation, U.S. patent generation, and domestic manufacturing capabilities
- Milestone-Based Funding: Tie tiered government support to these performance metrics
Implement "American Compute First" IP Framework Using Other Transactions Authority (OTA)
- Graduated IP Rights: Base government rights on public investment percentage
- Domestic Manufacturing Covenants: Require majority of manufacturing operations to be conducted in the U.S. before international deployment
- IP Pooling Arrangements: Create patent pools where government and partners cross-license with preferential terms for U.S. entities
Override standard FAR/DFARS Limitations through OTA by
- Eliminating unlimited rights triggers for segregated private development
- Creating "background IP safe harbors" for pre-existing private technology
- Establishing expedited IP dispute resolution through binding arbitration
Deploy "Fortress America Compute" Protections
- Technology Control Agreements: Mandatory screening of all foreign nationals and entities accessing consortium resources
- Allied Access Tiers: Create shared, permission-based AUKUS+ framework for sharing compute with Five Eyes partners after U.S. deployment tied to the domestic manufacturing covenant
- Clawback Provisions: Automatic reversion of rights if partners violate technology security requirements
2. Structuring Public-Private Consortium
DOE should utilize its OTA to bypass rigid "one-size-fits-all" data rights that mandate broad government-purpose licenses. A barrier to innovation for small and nontraditional businesses is the potential disclosure of proprietary hardware and software architectures through standard DOE data rights. By leveraging OTA flexibilities, DOE can establish negotiated IP protections that clearly distinguish "background IP" from "subject inventions," protecting a firm’s pre-existing technology while still allowing the government to benefit from resulting AI models. This approach incentivizes nontraditional contractors to contribute leading-edge technology without forfeiting the commercial benefits necessary to reach next gen compute milestones.
a. Fine-tuning and API use with general-purpose AI models
DOE should focus on partnering with commercial frontier labs or focused research organizations (FROs) to iteratively develop custom-built models for their activities. While the reasoning models of today are capable of performing a suite of tasks, their training sets lack the specific data, context, and understanding that the DOE is likely interested in to advance frontier scientific research and development. This is particularly relevant when it comes to niche scientific data such as DNA sequences, protein structures, specialized image data, experimental results, and proprietary information. As is discussed below, the use of APIs to query relevant data can ensure custom-built models can be iteratively improved and refreshed while limiting the need for full-scale training runs.
b. Combining general-purpose AI models with scientific and engineering data
In order to preserve flexibility, reduce costs, and ensure robustness in data and reasoning quality, DOE should pilot a system whereby most models trained on domain-specific scientific data are queried through APIs by general-purpose language and reasoning models. Within one year of implementation, DOE should report the effectiveness of this approach and evaluate subsequent investments, given the possibility of architectural advances in AI that may advantage the combination of general-purpose AI models with models trained on scientific data. Exceptions to this include models working on frontier math or other symbolic fields, where reasoning capabilities are essential.
- Separating the development of science-specific models and general-reasoning models allows DOE to build and adjust individual models independently. Various scientific models and reasoning models advance on different timelines and through different research communities. A tool-based architecture enables changes to specific components without retraining an integrated system.
- Scientific models often contain domain-specific inductive biases that transformer-based language models cannot easily accommodate within current architecture. Tool use lets reasoning models leverage these purpose-built scientific models without attempting to merge fundamentally different architectural approaches.
- Tool use also is currently the most robust method for extending language model capabilities into specialized domains. Frontier reasoning models already demonstrate strong performance in formulating queries, interpreting outputs, and synthesizing results across multiple tool calls. This capability is available today and improves with each model generation.
- Joint training approaches carry risks of capability degradation and require substantial compute investment with uncertain outcomes. Tool-based integration delivers incremental value with lower risk.
c. Science and engineering priorities for self-improving AI
The development of self-improving models will be served by advances in several domains. The single most important science and engineering priority for developing self-improving scientific AI models are constructing physical AI systems, specifically autonomous labs in which a model can hypothesize, plan, execute, and evaluate experiments in the physical world, assessing its own performance. Providing models with multi-modal inputs and outputs, including visual, audio, text and direct instrument data, will afford vast amounts of data to the model for its improvement. With current advances in model development the need for data cleaning or labeling is now minimal.
Second, there is a general need for data from null and negative results. Estimates suggest that approximately 80% of null research results do not get published, leading to a systematic bias in training data. Incentives to find and publish null data will be a crucial step in rectifying the model’s world-picture. This will ensure that self-improving AI models are able to improve based on accurate models of success and failure of scientific experiments and that they attach appropriate probabilities to the success of findings or experiments, especially when AI models are executing their own experiments.
Further areas that support the development of self-improving models are genomics and structural biology, where there is a wealth of existing data which can be utilized for the development of rapidly improving and self-improving models, and materials science. Advances in quantum and materials sciences will accelerate the creation of new computational architectures and energy systems necessary for building future generations of frontier models.
Finally, models should be trained on multi-modal scientific data such as high resolution spectroscopic and imaging data and hierarchical data structures including atomic/molecular (quantum chemistry calculations), mesoscale (materials microstructure, quantum device architectures), and system-level (power grid data, manufacturing processes). Additional valuable data streams include time-series data of monitoring of quantum coherence and real-time error syndrome extraction and hybrid experimental-simulation data combining physical quantum experiments with classical HPC simulations. In quantum sensing, we see potential in approaches to classical and quantum sensor data fusion.
d. Use cases and evaluations of self-improving models
Self-improving models trained on and for science and engineering problems present opportunities for tremendous advancements in our scientific discoveries, engineering capabilities, and for furthering the health, well-being, prosperity, and security of the people of the United States and the world.
Biology and Biotechnology: Through AlphaFold and other tools, proteins have been the subject of substantial AI research but self-improving models will be poised to handle dynamic protein modeling and the modeling of protein interactions, unlocking substantial biological discoveries furthering synthetic biology and enabling rapid developments in biotechnology. These AI models should also be used for the discovery and assessment of effective surrogate endpoints for improving drug discovery.
Materials Science: Self-improving models should be deployed for accelerating the discovery and synthesis of new materials and their properties. Particular priority should be given to materials with implications for improving the affordability of energy, such as battery materials or advanced reactor components, for infrastructure improvements, transportation and space travel, and the defense and national security apparatus.
Scientific Instrumentation Improvement: Many scientific processes are bottlenecked by the quality of the instruments which provide the data underlying experiments. Improving scientific instruments would allow for acceleration of both scientific discoveries and scientific AI model improvement. Further, providing for AI-native scientific instruments will also accelerate development of scientific AI models.
Quantum-Specific Deployments: AI systems that evolve to optimize quantum error correction codes are essential for achieving quantum advantage before strategic competitors. Self-improving models that dynamically optimize workload distribution between quantum and classical processors would maximize existing quantum investments.
3. Deployment of AI models for accelerating innovation
Amid the complexity of engineering self-improving AI models, the DOE should remember the value of interoperability and portability within the American science cloud. In seeking to improve collaboration across industry sectors, the consortium should consider:
- Utilizing a hybrid edge-cloud architecture to lower latency while retaining the accuracy and power of cloud-based models.
- Mitigating egress issues with internet bypassing and locally caching large datasets near their most frequent users.
- Leveraging tools such as cybersecurity mesh architectures, identity fabrics, and identity meshes to improve identity and access management across distributed networks within the consortium.
In addition to following the Department of Commerce’s best practices for opening data to generative AI, DOE should explore standards and tools to minimize friction created by different syntax and semantics. To that end, the consortium should consider:
- Adopting standards to encourage semantic and syntactic uniformity to bridge differences in data structure. This eliminates mistranslations and reduces computational complexity, making data more AI-ready and making the scientific discoveries of an eventual AI model more human auditable.
- Assessing and deploying standards for ontology, such as OWL, and metadata schema, such as Dublin Core or DataCite.
- Evaluating approaches that support behavioral interoperability of applications within the American science cloud such as containerization, infrastructure as code, GitOps operators that reduce configuration drift, and orchestration tools such as Kubernetes.