Ko IP & AI Law PLLC

Arizona patent lawyer focused on intellectual property & artificial intelligence law. Own your ideas, implement your AI, and mitigate the risks.

Data Is the New Oil: How to License Datasets for AI Without Losing Control (Part 2 of 2)

In Part 1 of this article, we explained the differences between local vs. cloud licensing models and why cloud access is ideal from the licensorโ€™s perspective but often unrealistic due to standard industry practices. We ended by highlighting the need for data licensors to mitigate the security risks that come with providing local copies of datasets to their clients. In Part 2, we will discuss security architecture and legal contracting strategies for mitigating the risk of misappropriation of your licensed datasets.

This article was written at the invitation of the terrific people at the Outrigger Group and is also published on their blog here.

I. Local vs. Cloud Licensing Models

[See Part 1 here]

A. Onsite Access: Letting the Data Leave the Building

[See Part 1 here]

B. Cloud-Only Access: You Keep the Oil in the Ground

[See Part 1 here]

II. Why Cloud Access Is Idealโ€”But Rarely the End of the Story

[See Part 1 here]

III. Security Architecture for Onsite Access: What Licensors Should Require

To mitigate the risk of post-license data misuse when providing a copy of the dataset to the licensee, licensors should incorporate a multi-layered security architecture. Each layer makes it incrementally harder for a licensee to retain or misuse the dataset and increases the likelihood of detection and legal enforcement.

[cont’d โ†—]

Ko IP & AI Law PLLC logo

Key components for consideration include:

  • Encryption at Rest and in Transit: All dataset files should be encrypted, and encryption keys should be revocable upon termination.
  • Access Control and Audit Logging: Strict user-level permissions, immutable logging of data access, and real-time monitoring should be enforced.
  • Watermarking and Dataset Fingerprinting: Embed signals in your data that persist through training and can later be used to detect whether a model memorized or retained your dataset. These techniques, including methods like radioactive data,1 inject subtle, statistically engineered patterns that leave detectable traces in model outputsโ€”even after fine-tuning or augmentation.
  • Privacy-Preserving Model Training: When licensees are permitted to train models on the dataset, licensors may require the use of differential privacy or related techniques to reduce the risk that trained models memorize or reveal raw data, making it harder for malicious actors to extract sensitive informationโ€‹.2
  • Trusted Execution Environments (TEEs): Confidential computing enclavehardware-based solutions isolate sensitive data even from system administrators during training.3
  • Model Unlearning Protocols: While still emerging, unlearning algorithms may allow partial โ€œreversalsโ€ of training effects when data use is revoked.4
  • Model Behavior Monitoring: Tools are emerging to test whether models regurgitate memorized data, helping detect misuse even if direct data extraction isnโ€™t attempted.5
  • Blockchain for Licensing and Access Logging: Utilize blockchain or distributed ledger systems to create tamper-evident access logs, smart contract-enforced usage terms, and cryptographic proofs of dataset ownership and compliance.6


Each of these technologies comes (or will come) with its own cost, implementation burden, and legal implications. None are robust or mature enough to claw back data or force model forgetting, but they serve as valuable post-termination detection and enforcement mechanisms. Think of them not as containment tools, but as forensic triggers: they canโ€™t prevent misuse, but they can help identify and prove itโ€”especially when paired with strong contractual remedies and audit rights.

For licensors navigating the power asymmetry of data licensing in the GenAI era, the principle remains the same: trust but verify. That means designing the architectureโ€”and the agreementโ€”to assume good faith while still preparing for bad behavior.

IV. Legal Strategies: Contracts as Containment

Technology can slow down misuse. But ultimately, itโ€™s the contract that draws the red linesโ€”and the willingness to enforce them that keeps licensees honest.

If you’re licensing high-value datasets for AI training, your agreement shouldnโ€™t resemble a boilerplate software end-user licensing agreement (EULA) when what youโ€™re licensing is the crown jewel of your AI value proposition. It should be purpose-built for the realities of data leakage, model memorization, and post-termination riskโ€”and tailored to the architecture you choose.

Here are some key strategic considerations:

  • Purpose-limited use: Define exactly what the licensee is permitted to doโ€”and nothing more.
  • Post-termination obligations: Set clear requirements for deletion, certification, and (where applicable) model handling.
  • Reverse engineering and misuse: Prohibit any attempt to extract or replicate the dataset from trained models.
  • Audit rights: Even if never exercised, they serve as a powerful deterrent.
  • Remedies with bite: Strong agreements support injunctive relief, damages, and attorneysโ€™ fees.
  • Security expectations: Especially for onsite access, basic standards are necessary, including encryption, access control, and breach notice.
  • Watermark acknowledgment: If you’re embedding detection mechanisms, reserve the right to use themโ€”and prohibit tampering.
  • Blockchain-recorded compliance: Incorporate blockchain-based verification of access logs, license conditions, and dataset provenance as enforceable elements of the agreement.

Bottom line: Your architecture and your license should speak the same languageโ€”or risk talking past each other when it matters most. A contract without technical controls is wishful thinking. Technical controls without a contract are an invitation for exploitation.

V. Conclusion: Control Is a Design Choiceโ€”and a Legal Strategy

Licensing datasets for AI training is no longer a simple matter of โ€œgive and get.โ€ In a world where models can memorize, regenerate, and distribute the data theyโ€™re trained on, the risk isnโ€™t just that your dataset might be copiedโ€”itโ€™s that it might be embedded.

Thatโ€™s why dataset control in the AI era isnโ€™t just about infrastructure. Itโ€™s about intentionality. Every decisionโ€”cloud vs. onsite access, watermarking, audit rights, model retentionโ€”is a tradeoff between usability and security, speed and certainty, collaboration and containment.

There is no one-size-fits-all architecture, and no contract clause that can substitute for good design. But with the right mix of legal terms, technical controls, and business judgment, licensors can protect their data assets without stifling innovation. And if your dataset really is the new oilโ€”then you need to treat it like a natural resource: valuable, finite, and worth protecting with everything you’ve got.

ยฉ 2025 Ko IP & AI Law PLLC


Loading

  1. See Alexandre Sablayrolles, et al., Radioactive Data: Tracing Through Training, Proceedings of the 37th International Conference on Machine Learning, PMLR 119:8326 (2020), available at https://proceedings.mlr.press/v119/sablayrolles20a.html. โ†ฉ๏ธŽ
  2. See Joseph P. Near, et al., Guidelines for Evaluating Differential Privacy Guarantees, National Institute of Standards and Technology, NIST Special Publication 800-226 (Mar. 2025), available at https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-226.pdf. โ†ฉ๏ธŽ
  3. Confidential computing enclaves, also known as Trusted Execution Environments (TEEs), are hardware-based security features that allow sensitive data to be processed in an isolated environment, protected even from system administrators and host operating systems. These enclaves encrypt data in memory during computation, offering an additional layer of protection beyond encryption at rest or in transit. See Sam Lugani & Nelly Porter, Expanded Confidential Computing portfolio and introducing Confidential Accelerators for AI Workloads, Google Cloud Blog (Apr. 10, 2024), available at https://cloud.google.com/blog/products/identity-security/expanding-confidential-computing-for-ai-workloads-next24. โ†ฉ๏ธŽ
  4. See Ken Ziyu Liu, Machine Unlearning in 2024,Stanford AI Lab Blog (May 2024), available at https://ai.stanford.edu/~kzliu/blog/unlearning. โ†ฉ๏ธŽ
  5. See Buse Gul, Atli Tekgul & N. Asokan, On the Effectiveness of Dataset Watermarking in Adversarial Settings,arXiv:2202.12506 (Feb. 2022), available at https://arxiv.org/abs/2202.12506. โ†ฉ๏ธŽ
  6. See Primavera De Filippi & Aaron Wright, Blockchain and the Law: The Rule of Code (Harvard Univ. Press 2018). โ†ฉ๏ธŽ

Posted by

in

0 0 votes
Article Rating
simple-ad

This is a demo advert, you can use simple text, HTML image or any Ad Service JavaScript code. If you’re inserting HTML or JS code make sure editor is switched to ‘Text’ mode.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x