Ko IP & AI Law PLLC

Arizona patent lawyer focused on intellectual property & artificial intelligence law. Own your ideas, implement your AI, and mitigate the risks.

Data Is the New Oil: How to License Datasets for AI Without Losing Control (Part 1 of 2)

This article has also been published on The Outrigger Group blog, here.

If data is the new oil, then dataset licensors are the new energy companiesโ€”sitting atop a resource thatโ€™s increasingly essential, uniquely valuable, and perilously hard to control once extracted. But unlike physical oil fields, digital datasets can be copied, reverse-engineered, or embedded into machine learning models in ways that make true containment a technical and legal challenge. In the era of AI, licensing your dataset isnโ€™t just about granting accessโ€”itโ€™s about preventing exfiltration, even after the license has ended.

As AI developers race to fine-tune large language models and multimodal systems on ever-larger collections of text, image, video, and audio data, the licensing stakes have never been higher. What happens when a licensee, acting in bad faith or simply out of carelessness, walks away with your data or a model thatโ€™s functionally indistinguishable from it? A primary reason for licensing data is to retrieve it when the license ends, but how is that even possible?

I. Local vs. Cloud Licensing Models

There are two basic architectural approaches to dataset licensing in general, including in the AI space: (1) providing local, onsite access to licensees and (2) retaining full control through cloud-only access models.

A. Onsite Access: Letting the Data Leave the Building

This model, which gives the licensee a local copy of the dataset, is efficient for performance but risky from the licensorโ€™s data security perspective. Even with encryption, audit logs, and deletion certifications, itโ€™s difficult to guarantee that no data was duplicated or memorized by trained models. The more systems the data touches, the harder it is to clean up post-termination.

B. Cloud-Only Access: You Keep the Oil in the Ground

Here, the dataset never leaves the licensorโ€™s infrastructure. The licensee accesses the data through a secure environmentโ€”a sandboxed cloud instance, a controlled Application Programming interface (API), or a confidential computing setup. This makes access revocable, leakage less likely, and deletion a non-issue.

[cont’d โ†—]


Ko IP & AI Law PLLC logo

II. Why Cloud Access Is Idealโ€”But Rarely the End of the Story

From the perspective of the dataset owner, cloud-only access is the gold standard for security and control. If the dataset never leaves your environment, then access is revocable, logs are centralized, and you avoid the downstream risk of data leakage, unauthorized retention, or circumvention. In the best-case scenario, the licensee logs in to a secure cloud environment, runs approved training jobs, and walks away with only the permitted outputs. No deletion audits, no forensic headaches, no trust fall.

But thatโ€™s not how most large-scale AI licensing deals actually work.

In practice, major LLM developersโ€”including OpenAI, Meta, and othersโ€”generally require that data be delivered as a local copy. That copy is then integrated into their proprietary training pipelines, which often span thousands of GPUs across internal supercomputing clusters. These companies optimize for performance, flexibility, and scale and thus are unlikely to restructure their infrastructure just to accommodate a single dataset, no matter how sensitive.

Thatโ€™s not to say itโ€™s impossible. In rare cases, licensors with truly unique or regulated datasets  (e.g., proprietary legal corpora, medical archives, or subscription-only financial data) or those with substantial bargaining power (e.g., government entities or major research institutions) may successfully negotiate cloud-only access terms. These terms might include:

  • Remote training via secure virtual machine (VM) environments
  • API-based access with metering or query caps
  • Use of confidential computing enclaves1

But unless the licensor has substantial market leverage, IP exclusivity, or legal constraints, these deals are the exception, not the rule. For everyone elseโ€”the startup with a valuable scraped dataset, the academic lab licensing proprietary annotations, the boutique data vendorโ€”the reality is stark. If you want to license your dataset to a major LLM provider, youโ€™re probably going to have to send them a copy. And you will need to focus your energy on mitigating the security risks that come with it.

ยฉ 2025 Ko IP & AI Law PLLC



Loading
  1. Confidential computing enclaves, also known as Trusted Execution Environments (TEEs), are hardware-based security features that allow sensitive data to be processed in an isolated environment, protected even from system administrators and host operating systems. These enclaves encrypt data in memory during computation, offering an additional layer of protection beyond encryption at rest or in transit. See Sam Lugani & Nelly Porter, Expanded Confidential Computing portfolio and introducing Confidential Accelerators for AI Workloads, Google Cloud Blog (Apr. 10, 2024), available at https://cloud.google.com/blog/products/identity-security/expanding-confidential-computing-for-ai-workloads-next24. โ†ฉ๏ธŽ

Posted by

in

0 0 votes
Article Rating
simple-ad

This is a demo advert, you can use simple text, HTML image or any Ad Service JavaScript code. If you’re inserting HTML or JS code make sure editor is switched to ‘Text’ mode.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x