In Part 1 of this article, we explained the differences between local vs. cloud licensing models and why cloud access is ideal from the licensorโs perspective but often unrealistic due to standard industry practices. We ended by highlighting the need for data licensors to mitigate the security risks that come with providing local copies of datasets to their clients. In Part 2, we will discuss security architecture and legal contracting strategies for mitigating the risk of misappropriation of your licensed datasets.
This article was written at the invitation of the terrific people at the Outrigger Group and is also published on their blog here.
I. Local vs. Cloud Licensing Models
[See Part 1 here]
A. Onsite Access: Letting the Data Leave the Building
[See Part 1 here]
B. Cloud-Only Access: You Keep the Oil in the Ground
[See Part 1 here]
II. Why Cloud Access Is IdealโBut Rarely the End of the Story
[See Part 1 here]
III. Security Architecture for Onsite Access: What Licensors Should Require
To mitigate the risk of post-license data misuse when providing a copy of the dataset to the licensee, licensors should incorporate a multi-layered security architecture. Each layer makes it incrementally harder for a licensee to retain or misuse the dataset and increases the likelihood of detection and legal enforcement.
[cont’d โ]
Key components for consideration include:
- Encryption at Rest and in Transit: All dataset files should be encrypted, and encryption keys should be revocable upon termination.
- Access Control and Audit Logging: Strict user-level permissions, immutable logging of data access, and real-time monitoring should be enforced.
- Watermarking and Dataset Fingerprinting: Embed signals in your data that persist through training and can later be used to detect whether a model memorized or retained your dataset. These techniques, including methods like radioactive data,1 inject subtle, statistically engineered patterns that leave detectable traces in model outputsโeven after fine-tuning or augmentation.
- Privacy-Preserving Model Training: When licensees are permitted to train models on the dataset, licensors may require the use of differential privacy or related techniques to reduce the risk that trained models memorize or reveal raw data, making it harder for malicious actors to extract sensitive informationโ.2
- Trusted Execution Environments (TEEs): Confidential computing enclavehardware-based solutions isolate sensitive data even from system administrators during training.3
- Model Unlearning Protocols: While still emerging, unlearning algorithms may allow partial โreversalsโ of training effects when data use is revoked.4
- Model Behavior Monitoring: Tools are emerging to test whether models regurgitate memorized data, helping detect misuse even if direct data extraction isnโt attempted.5
- Blockchain for Licensing and Access Logging: Utilize blockchain or distributed ledger systems to create tamper-evident access logs, smart contract-enforced usage terms, and cryptographic proofs of dataset ownership and compliance.6
Each of these technologies comes (or will come) with its own cost, implementation burden, and legal implications. None are robust or mature enough to claw back data or force model forgetting, but they serve as valuable post-termination detection and enforcement mechanisms. Think of them not as containment tools, but as forensic triggers: they canโt prevent misuse, but they can help identify and prove itโespecially when paired with strong contractual remedies and audit rights.
For licensors navigating the power asymmetry of data licensing in the GenAI era, the principle remains the same: trust but verify. That means designing the architectureโand the agreementโto assume good faith while still preparing for bad behavior.
IV. Legal Strategies: Contracts as Containment
Technology can slow down misuse. But ultimately, itโs the contract that draws the red linesโand the willingness to enforce them that keeps licensees honest.
If you’re licensing high-value datasets for AI training, your agreement shouldnโt resemble a boilerplate software end-user licensing agreement (EULA) when what youโre licensing is the crown jewel of your AI value proposition. It should be purpose-built for the realities of data leakage, model memorization, and post-termination riskโand tailored to the architecture you choose.
Here are some key strategic considerations:
- Purpose-limited use: Define exactly what the licensee is permitted to doโand nothing more.
- Post-termination obligations: Set clear requirements for deletion, certification, and (where applicable) model handling.
- Reverse engineering and misuse: Prohibit any attempt to extract or replicate the dataset from trained models.
- Audit rights: Even if never exercised, they serve as a powerful deterrent.
- Remedies with bite: Strong agreements support injunctive relief, damages, and attorneysโ fees.
- Security expectations: Especially for onsite access, basic standards are necessary, including encryption, access control, and breach notice.
- Watermark acknowledgment: If you’re embedding detection mechanisms, reserve the right to use themโand prohibit tampering.
- Blockchain-recorded compliance: Incorporate blockchain-based verification of access logs, license conditions, and dataset provenance as enforceable elements of the agreement.
Bottom line: Your architecture and your license should speak the same languageโor risk talking past each other when it matters most. A contract without technical controls is wishful thinking. Technical controls without a contract are an invitation for exploitation.
V. Conclusion: Control Is a Design Choiceโand a Legal Strategy
Licensing datasets for AI training is no longer a simple matter of โgive and get.โ In a world where models can memorize, regenerate, and distribute the data theyโre trained on, the risk isnโt just that your dataset might be copiedโitโs that it might be embedded.
Thatโs why dataset control in the AI era isnโt just about infrastructure. Itโs about intentionality. Every decisionโcloud vs. onsite access, watermarking, audit rights, model retentionโis a tradeoff between usability and security, speed and certainty, collaboration and containment.
There is no one-size-fits-all architecture, and no contract clause that can substitute for good design. But with the right mix of legal terms, technical controls, and business judgment, licensors can protect their data assets without stifling innovation. And if your dataset really is the new oilโthen you need to treat it like a natural resource: valuable, finite, and worth protecting with everything you’ve got.
ยฉ 2025 Ko IP & AI Law PLLC
- See Alexandre Sablayrolles, et al., Radioactive Data: Tracing Through Training, Proceedings of the 37th International Conference on Machine Learning, PMLR 119:8326 (2020), available at https://proceedings.mlr.press/v119/sablayrolles20a.html. โฉ๏ธ
- See Joseph P. Near, et al., Guidelines for Evaluating Differential Privacy Guarantees, National Institute of Standards and Technology, NIST Special Publication 800-226 (Mar. 2025), available at https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-226.pdf. โฉ๏ธ
- Confidential computing enclaves, also known as Trusted Execution Environments (TEEs), are hardware-based security features that allow sensitive data to be processed in an isolated environment, protected even from system administrators and host operating systems. These enclaves encrypt data in memory during computation, offering an additional layer of protection beyond encryption at rest or in transit. See Sam Lugani & Nelly Porter, Expanded Confidential Computing portfolio and introducing Confidential Accelerators for AI Workloads, Google Cloud Blog (Apr. 10, 2024), available at https://cloud.google.com/blog/products/identity-security/expanding-confidential-computing-for-ai-workloads-next24. โฉ๏ธ
- See Ken Ziyu Liu, Machine Unlearning in 2024,Stanford AI Lab Blog (May 2024), available at https://ai.stanford.edu/~kzliu/blog/unlearning. โฉ๏ธ
- See Buse Gul, Atli Tekgul & N. Asokan, On the Effectiveness of Dataset Watermarking in Adversarial Settings,arXiv:2202.12506 (Feb. 2022), available at https://arxiv.org/abs/2202.12506. โฉ๏ธ
- See Primavera De Filippi & Aaron Wright, Blockchain and the Law: The Rule of Code (Harvard Univ. Press 2018). โฉ๏ธ
This is a demo advert, you can use simple text, HTML image or any Ad Service JavaScript code. If you’re inserting HTML or JS code make sure editor is switched to ‘Text’ mode.