Data Is the New Oil: How to License Datasets for AI Without Losing Control (Part 2 of 2)

In Part 1 of this article, we explained the differences between local vs. cloud licensing models and why cloud access is ideal from the licensor’s perspective but often unrealistic due to standard industry practices. We ended by highlighting the need for data licensors to mitigate the security risks that come with providing local copies of datasets to their clients. In Part 2, we will discuss security architecture and legal contracting strategies for mitigating the risk of misappropriation of your licensed datasets.

This article was written at the invitation of the terrific people at the Outrigger Group and is also published on their blog here.

I. Local vs. Cloud Licensing Models

[See Part 1 here]

A. Onsite Access: Letting the Data Leave the Building

[See Part 1 here]

B. Cloud-Only Access: You Keep the Oil in the Ground

[See Part 1 here]

II. Why Cloud Access Is Ideal—But Rarely the End of the Story

[See Part 1 here]

III. Security Architecture for Onsite Access: What Licensors Should Require

To mitigate the risk of post-license data misuse when providing a copy of the dataset to the licensee, licensors should incorporate a multi-layered security architecture. Each layer makes it incrementally harder for a licensee to retain or misuse the dataset and increases the likelihood of detection and legal enforcement.

[cont’d ↗]

Self-schedule a free 20-min. video or phone consult with Jim W. Ko of Ko IP and AI Law PLLC here.

Key components for consideration include:

Encryption at Rest and in Transit: All dataset files should be encrypted, and encryption keys should be revocable upon termination.
Access Control and Audit Logging: Strict user-level permissions, immutable logging of data access, and real-time monitoring should be enforced.
Watermarking and Dataset Fingerprinting: Embed signals in your data that persist through training and can later be used to detect whether a model memorized or retained your dataset. These techniques, including methods like radioactive data,¹ inject subtle, statistically engineered patterns that leave detectable traces in model outputs—even after fine-tuning or augmentation.
Privacy-Preserving Model Training: When licensees are permitted to train models on the dataset, licensors may require the use of differential privacy or related techniques to reduce the risk that trained models memorize or reveal raw data, making it harder for malicious actors to extract sensitive information.²
Trusted Execution Environments (TEEs): Confidential computing enclavehardware-based solutions isolate sensitive data even from system administrators during training.³
Model Unlearning Protocols: While still emerging, unlearning algorithms may allow partial “reversals” of training effects when data use is revoked.⁴
Model Behavior Monitoring: Tools are emerging to test whether models regurgitate memorized data, helping detect misuse even if direct data extraction isn’t attempted.⁵
Blockchain for Licensing and Access Logging: Utilize blockchain or distributed ledger systems to create tamper-evident access logs, smart contract-enforced usage terms, and cryptographic proofs of dataset ownership and compliance.⁶

Each of these technologies comes (or will come) with its own cost, implementation burden, and legal implications. None are robust or mature enough to claw back data or force model forgetting, but they serve as valuable post-termination detection and enforcement mechanisms. Think of them not as containment tools, but as forensic triggers: they can’t prevent misuse, but they can help identify and prove it—especially when paired with strong contractual remedies and audit rights.

For licensors navigating the power asymmetry of data licensing in the GenAI era, the principle remains the same: trust but verify. That means designing the architecture—and the agreement—to assume good faith while still preparing for bad behavior.

IV. Legal Strategies: Contracts as Containment

Technology can slow down misuse. But ultimately, it’s the contract that draws the red lines—and the willingness to enforce them that keeps licensees honest.

If you’re licensing high-value datasets for AI training, your agreement shouldn’t resemble a boilerplate software end-user licensing agreement (EULA) when what you’re licensing is the crown jewel of your AI value proposition. It should be purpose-built for the realities of data leakage, model memorization, and post-termination risk—and tailored to the architecture you choose.

Here are some key strategic considerations:

Purpose-limited use: Define exactly what the licensee is permitted to do—and nothing more.
Post-termination obligations: Set clear requirements for deletion, certification, and (where applicable) model handling.
Reverse engineering and misuse: Prohibit any attempt to extract or replicate the dataset from trained models.
Audit rights: Even if never exercised, they serve as a powerful deterrent.
Remedies with bite: Strong agreements support injunctive relief, damages, and attorneys’ fees.
Security expectations: Especially for onsite access, basic standards are necessary, including encryption, access control, and breach notice.
Watermark acknowledgment: If you’re embedding detection mechanisms, reserve the right to use them—and prohibit tampering.
Blockchain-recorded compliance: Incorporate blockchain-based verification of access logs, license conditions, and dataset provenance as enforceable elements of the agreement.

Bottom line: Your architecture and your license should speak the same language—or risk talking past each other when it matters most. A contract without technical controls is wishful thinking. Technical controls without a contract are an invitation for exploitation.

V. Conclusion: Control Is a Design Choice—and a Legal Strategy

Licensing datasets for AI training is no longer a simple matter of “give and get.” In a world where models can memorize, regenerate, and distribute the data they’re trained on, the risk isn’t just that your dataset might be copied—it’s that it might be embedded.

That’s why dataset control in the AI era isn’t just about infrastructure. It’s about intentionality. Every decision—cloud vs. onsite access, watermarking, audit rights, model retention—is a tradeoff between usability and security, speed and certainty, collaboration and containment.

There is no one-size-fits-all architecture, and no contract clause that can substitute for good design. But with the right mix of legal terms, technical controls, and business judgment, licensors can protect their data assets without stifling innovation. And if your dataset really is the new oil—then you need to treat it like a natural resource: valuable, finite, and worth protecting with everything you’ve got.

See Alexandre Sablayrolles, et al., Radioactive Data: Tracing Through Training, Proceedings of the 37th International Conference on Machine Learning, PMLR 119:8326 (2020), available at https://proceedings.mlr.press/v119/sablayrolles20a.html. ↩︎
See Joseph P. Near, et al., Guidelines for Evaluating Differential Privacy Guarantees, National Institute of Standards and Technology, NIST Special Publication 800-226 (Mar. 2025), available at https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-226.pdf. ↩︎
Confidential computing enclaves, also known as Trusted Execution Environments (TEEs), are hardware-based security features that allow sensitive data to be processed in an isolated environment, protected even from system administrators and host operating systems. These enclaves encrypt data in memory during computation, offering an additional layer of protection beyond encryption at rest or in transit. See Sam Lugani & Nelly Porter, Expanded Confidential Computing portfolio and introducing Confidential Accelerators for AI Workloads, Google Cloud Blog (Apr. 10, 2024), available at https://cloud.google.com/blog/products/identity-security/expanding-confidential-computing-for-ai-workloads-next24. ↩︎
See Ken Ziyu Liu, Machine Unlearning in 2024,Stanford AI Lab Blog (May 2024), available at https://ai.stanford.edu/~kzliu/blog/unlearning. ↩︎
See Buse Gul, Atli Tekgul & N. Asokan, On the Effectiveness of Dataset Watermarking in Adversarial Settings,arXiv:2202.12506 (Feb. 2022), available at https://arxiv.org/abs/2202.12506. ↩︎
See Primavera De Filippi & Aaron Wright, Blockchain and the Law: The Rule of Code (Harvard Univ. Press 2018). ↩︎

Ko IP & AI Law PLLC

Data Is the New Oil: How to License Datasets for AI Without Losing Control (Part 2 of 2)

I. Local vs. Cloud Licensing Models

A. Onsite Access: Letting the Data Leave the Building

B. Cloud-Only Access: You Keep the Oil in the Ground

II. Why Cloud Access Is Ideal—But Rarely the End of the Story

III. Security Architecture for Onsite Access: What Licensors Should Require

IV. Legal Strategies: Contracts as Containment

V. Conclusion: Control Is a Design Choice—and a Legal Strategy

Leave a Reply Cancel reply