Data Is the New Oil: How to License Datasets for AI Without Losing Control (Part 1 of 2)

This article has also been published on The Outrigger Group blog, here.

If data is the new oil, then dataset licensors are the new energy companies—sitting atop a resource that’s increasingly essential, uniquely valuable, and perilously hard to control once extracted. But unlike physical oil fields, digital datasets can be copied, reverse-engineered, or embedded into machine learning models in ways that make true containment a technical and legal challenge. In the era of AI, licensing your dataset isn’t just about granting access—it’s about preventing exfiltration, even after the license has ended.

As AI developers race to fine-tune large language models and multimodal systems on ever-larger collections of text, image, video, and audio data, the licensing stakes have never been higher. What happens when a licensee, acting in bad faith or simply out of carelessness, walks away with your data or a model that’s functionally indistinguishable from it? A primary reason for licensing data is to retrieve it when the license ends, but how is that even possible?

I. Local vs. Cloud Licensing Models

There are two basic architectural approaches to dataset licensing in general, including in the AI space: (1) providing local, onsite access to licensees and (2) retaining full control through cloud-only access models.

A. Onsite Access: Letting the Data Leave the Building

This model, which gives the licensee a local copy of the dataset, is efficient for performance but risky from the licensor’s data security perspective. Even with encryption, audit logs, and deletion certifications, it’s difficult to guarantee that no data was duplicated or memorized by trained models. The more systems the data touches, the harder it is to clean up post-termination.

B. Cloud-Only Access: You Keep the Oil in the Ground

Here, the dataset never leaves the licensor’s infrastructure. The licensee accesses the data through a secure environment—a sandboxed cloud instance, a controlled Application Programming interface (API), or a confidential computing setup. This makes access revocable, leakage less likely, and deletion a non-issue.

[cont’d ↗]

Sidebar: Licensing Datasets? Watch for These 5 Red Flags

Even sophisticated AI companies can stumble when it comes to responsible dataset use. If you’re licensing out valuable training data, keep an eye out for these warning signs:

Vague Use Restrictions
No Model Retention Clause
Missing Post-Termination Obligations
Silence on Derivative Risks
Pushback on Cloud-Only Access

Your license isn’t just a formality—it’s your firewall. Build it accordingly.

Self-schedule a free 20-min. video or phone consult with Jim W. Ko of Ko IP and AI Law PLLC here.

II. Why Cloud Access Is Ideal—But Rarely the End of the Story

From the perspective of the dataset owner, cloud-only access is the gold standard for security and control. If the dataset never leaves your environment, then access is revocable, logs are centralized, and you avoid the downstream risk of data leakage, unauthorized retention, or circumvention. In the best-case scenario, the licensee logs in to a secure cloud environment, runs approved training jobs, and walks away with only the permitted outputs. No deletion audits, no forensic headaches, no trust fall.

But that’s not how most large-scale AI licensing deals actually work.

In practice, major LLM developers—including OpenAI, Meta, and others—generally require that data be delivered as a local copy. That copy is then integrated into their proprietary training pipelines, which often span thousands of GPUs across internal supercomputing clusters. These companies optimize for performance, flexibility, and scale and thus are unlikely to restructure their infrastructure just to accommodate a single dataset, no matter how sensitive.

That’s not to say it’s impossible. In rare cases, licensors with truly unique or regulated datasets (e.g., proprietary legal corpora, medical archives, or subscription-only financial data) or those with substantial bargaining power (e.g., government entities or major research institutions) may successfully negotiate cloud-only access terms. These terms might include:

Remote training via secure virtual machine (VM) environments
API-based access with metering or query caps
Use of confidential computing enclaves¹

But unless the licensor has substantial market leverage, IP exclusivity, or legal constraints, these deals are the exception, not the rule. For everyone else—the startup with a valuable scraped dataset, the academic lab licensing proprietary annotations, the boutique data vendor—the reality is stark. If you want to license your dataset to a major LLM provider, you’re probably going to have to send them a copy. And you will need to focus your energy on mitigating the security risks that come with it.

Come back next week for Part 2 of this article, which will discuss security architecture and legal contracting strategies for mitigating the risk of misappropriation of your licensed datasets. Verification is the better part of trust and cannot be achieved without ensuring both play well together.

Confidential computing enclaves, also known as Trusted Execution Environments (TEEs), are hardware-based security features that allow sensitive data to be processed in an isolated environment, protected even from system administrators and host operating systems. These enclaves encrypt data in memory during computation, offering an additional layer of protection beyond encryption at rest or in transit. See Sam Lugani & Nelly Porter, Expanded Confidential Computing portfolio and introducing Confidential Accelerators for AI Workloads, Google Cloud Blog (Apr. 10, 2024), available at https://cloud.google.com/blog/products/identity-security/expanding-confidential-computing-for-ai-workloads-next24. ↩︎

Ko IP & AI Law PLLC

Data Is the New Oil: How to License Datasets for AI Without Losing Control (Part 1 of 2)

I. Local vs. Cloud Licensing Models

A. Onsite Access: Letting the Data Leave the Building

B. Cloud-Only Access: You Keep the Oil in the Ground

II. Why Cloud Access Is Ideal—But Rarely the End of the Story

Leave a Reply Cancel reply