This article has also been published on The Outrigger Group blog, here.
If data is the new oil, then dataset licensors are the new energy companiesโsitting atop a resource thatโs increasingly essential, uniquely valuable, and perilously hard to control once extracted. But unlike physical oil fields, digital datasets can be copied, reverse-engineered, or embedded into machine learning models in ways that make true containment a technical and legal challenge. In the era of AI, licensing your dataset isnโt just about granting accessโitโs about preventing exfiltration, even after the license has ended.
As AI developers race to fine-tune large language models and multimodal systems on ever-larger collections of text, image, video, and audio data, the licensing stakes have never been higher. What happens when a licensee, acting in bad faith or simply out of carelessness, walks away with your data or a model thatโs functionally indistinguishable from it? A primary reason for licensing data is to retrieve it when the license ends, but how is that even possible?
I. Local vs. Cloud Licensing Models
There are two basic architectural approaches to dataset licensing in general, including in the AI space: (1) providing local, onsite access to licensees and (2) retaining full control through cloud-only access models.
A. Onsite Access: Letting the Data Leave the Building
This model, which gives the licensee a local copy of the dataset, is efficient for performance but risky from the licensorโs data security perspective. Even with encryption, audit logs, and deletion certifications, itโs difficult to guarantee that no data was duplicated or memorized by trained models. The more systems the data touches, the harder it is to clean up post-termination.
B. Cloud-Only Access: You Keep the Oil in the Ground
Here, the dataset never leaves the licensorโs infrastructure. The licensee accesses the data through a secure environmentโa sandboxed cloud instance, a controlled Application Programming interface (API), or a confidential computing setup. This makes access revocable, leakage less likely, and deletion a non-issue.
[cont’d โ]
Sidebar: Licensing Datasets? Watch for These 5 Red Flags
Even sophisticated AI companies can stumble when it comes to responsible dataset use. If you’re licensing out valuable training data, keep an eye out for these warning signs:
- Vague Use Restrictions
- No Model Retention Clause
- Missing Post-Termination Obligations
- Silence on Derivative Risks
- Pushback on Cloud-Only Access
Your license isnโt just a formalityโitโs your firewall. Build it accordingly.
II. Why Cloud Access Is IdealโBut Rarely the End of the Story
From the perspective of the dataset owner, cloud-only access is the gold standard for security and control. If the dataset never leaves your environment, then access is revocable, logs are centralized, and you avoid the downstream risk of data leakage, unauthorized retention, or circumvention. In the best-case scenario, the licensee logs in to a secure cloud environment, runs approved training jobs, and walks away with only the permitted outputs. No deletion audits, no forensic headaches, no trust fall.
But thatโs not how most large-scale AI licensing deals actually work.
In practice, major LLM developersโincluding OpenAI, Meta, and othersโgenerally require that data be delivered as a local copy. That copy is then integrated into their proprietary training pipelines, which often span thousands of GPUs across internal supercomputing clusters. These companies optimize for performance, flexibility, and scale and thus are unlikely to restructure their infrastructure just to accommodate a single dataset, no matter how sensitive.
Thatโs not to say itโs impossible. In rare cases, licensors with truly unique or regulated datasets (e.g., proprietary legal corpora, medical archives, or subscription-only financial data) or those with substantial bargaining power (e.g., government entities or major research institutions) may successfully negotiate cloud-only access terms. These terms might include:
- Remote training via secure virtual machine (VM) environments
- API-based access with metering or query caps
- Use of confidential computing enclaves1
But unless the licensor has substantial market leverage, IP exclusivity, or legal constraints, these deals are the exception, not the rule. For everyone elseโthe startup with a valuable scraped dataset, the academic lab licensing proprietary annotations, the boutique data vendorโthe reality is stark. If you want to license your dataset to a major LLM provider, youโre probably going to have to send them a copy. And you will need to focus your energy on mitigating the security risks that come with it.
ยฉ 2025 Ko IP & AI Law PLLC
Come back next week for Part 2 of this article, which will discuss security architecture and legal contracting strategies for mitigating the risk of misappropriation of your licensed datasets. Verification is the better part of trust and cannot be achieved without ensuring both play well together.
- Confidential computing enclaves, also known as Trusted Execution Environments (TEEs), are hardware-based security features that allow sensitive data to be processed in an isolated environment, protected even from system administrators and host operating systems. These enclaves encrypt data in memory during computation, offering an additional layer of protection beyond encryption at rest or in transit. See Sam Lugani & Nelly Porter, Expanded Confidential Computing portfolio and introducing Confidential Accelerators for AI Workloads, Google Cloud Blog (Apr. 10, 2024), available at https://cloud.google.com/blog/products/identity-security/expanding-confidential-computing-for-ai-workloads-next24. โฉ๏ธ
This is a demo advert, you can use simple text, HTML image or any Ad Service JavaScript code. If you’re inserting HTML or JS code make sure editor is switched to ‘Text’ mode.