The age of artificial intelligence is upon us, or so we’re told. We hear daily pronouncements of its transformative power, fueled by staggering public and private investment. Yet, for most Americans, daily life hasn’t markedly changed. We envisioned revolutions in health care, education, and climate solutions; instead, we’ve largely received slightly more sophisticated chatbots and faster ways to generate memes.
This growing gap between AI’s potential and its current impact partly stems from a fundamental, yet often overlooked, bottleneck: data.
In the age of AI, as recognized by Scale AI CEO Alexandr Wang, OpenAI CEO Sam Altman, and many others, data is destiny. The sophisticated Large Language Models (LLMs) that promise to reshape our world are voracious consumers of information, and vast quantities of high-quality data are essential fuel. Too little data, and an LLM remains a niche tool, inapplicable outside narrow confines. Data riddled with errors, biases, or gaps yields unreliable, unfair, or even dangerous results via such a model. Much of the disappointment surrounding AI stems from this reality—we haven’t permitted access to high quantity and quality fuel for the engines we’re trying to build, especially those engines aimed at serving the public good.
If we want AI to truly transform education, accelerate medical breakthroughs, optimize our transportation networks, and help us tackle climate issues—and much more—we need a radical overhaul of how we collect, store, and share data. Our current legal frameworks, designed for a pre-AI era, are actively hindering progress where it matters most. That reality was made clear in a recent pre-publication version of a Copyright Office report on generative AI training that, if finalized and adhered to by courts or formalized through legislation, would throttle AI progress.
The Constitution mandates that copyright laws “promote the progress of science,” as set forth under Article I, Section 8, Clause 8. Under the Copyright Office’s draft report, however, AI labs would largely be foreclosed from training their models on copyrighted materials. Larger labs would face a setback, but they’d likely manage to maintain some degree of progress: Giants like OpenAI initially scraped vast swaths of the internet to train its models—a practice now facing legal challenges and likely infeasible for newcomers—and tech behemoths like Google and Meta possess unparalleled proprietary datasets. They also have the deep pockets necessary to license data from publications and other sources. On the other hand, smaller startups and academic labs focused on public interest applications would find themselves starved of critical data.
We need clarity and courage in pursuing a new data access paradigm. Congress should, at the very least, create an explicit, narrow fair use exception for the use of copyrighted data solely for training AI models aimed at addressing clearly defined public policy challenges—disease research, educational tools for underserved communities, climate modeling. This isn’t about gutting copyright; it’s about rebalancing it to serve its constitutional purpose. When the goal is societal benefit, not commercial duplication, the scales of fair use should tip towards enabling innovation.
Beyond copyright law, the second, perhaps more complex, barrier to data dominance lies at the intersection of privacy law and data accessibility. Here, we face a perverse paradox. On one hand, a shadowy ecosystem of private data brokers exploits our patchwork of state and federal privacy laws, vacuuming up intensely personal data about nearly every aspect of our lives with minimal transparency or meaningful consent. This data enriches private companies but remains largely inaccessible for public use.
Somewhat paradoxically, public and quasi-public institutions—government agencies, school districts, hospitals, transit authorities—sit on treasure troves of data directly relevant to solving pressing societal problems. Yet, this data is often locked away in fiercely guarded silos, stored in bespoke formats, and rarely shared, even between agencies within the same city or state. Legitimate privacy concerns, coupled with bureaucratic inertia and a lack of standardized frameworks, create data fiefdoms precisely where collaboration and openness are most needed. The result? Low-quality, fragmented public datasets alongside incredibly rich, privately held ones.
This is untenable. We cannot hope for private or public entities to build AI that improves public health if health data remains balkanized—whether held exclusively by private firms or siloed as a result of outdated laws. We cannot optimize public transit if mobility data isn’t shared and analyzed comprehensively. We need to fundamentally rethink privacy not as an absolute barrier, but as a framework that empowers individuals while enabling progress.
Imagine a system where citizens can easily consent to share specific, anonymized datasets for the explicit purpose of contributing to AI research on designated public challenges—a national cancer data trust, a regional climate impact modeling initiative, a statewide educational outcome project. This would require new legal and technical infrastructure for secure, privacy-preserving data sharing, moving beyond the current all-or-nothing approach dictated by confusing terms of service and exploited by brokers. It would also require mandates and incentives for public bodies to standardize and share non-personal or appropriately anonymized data relevant to their missions.
The path forward requires bold legal reforms. We must adapt copyright law to recognize the unique data needs of public interest AI and redesign our privacy and data governance frameworks to break down silos and facilitate consensual, purposeful data sharing for societal benefit—while perhaps simultaneously cracking down on the opaque practices of private data brokers.
Without this wholesale revision of our approach to data, AI’s potential to genuinely improve lives will remain frustratingly out of reach. We’ll continue to get better meme generators while the truly transformative applications—the ones that can educate our children, heal our sick, and protect our planet—languish for lack of fuel. Data is destiny, and it’s time our laws shaped a better one.
Please note that we at The Dispatch hold ourselves, our work, and our commenters to a higher standard than other places on the internet. We welcome comments that foster genuine debate or discussion—including comments critical of us or our work—but responses that include ad hominem attacks on fellow Dispatch members or are intended to stoke fear and anger may be moderated.
With your membership, you only have the ability to comment on The Morning Dispatch articles. Consider upgrading to join the conversation everywhere.