When GPT Spills the Tea:

Comprehensive Assessment of Knowledge File Leakage in GPTs

ACL 2025

1CISPA Helmholtz Center for Information Security, 2Flexera

TL;DR

  • This study reveals that knowledge files in LLM agents can be easily leaked through various vectors in the LLM agent data supply chain, leading to potential copyright infringement and other security risks.
  • Through analysing 651K GPTs, we identify five leakage vectors: 1) metadata, 2) GPT initialization, 3) retrieval, 4) sandboxed execution environments (SEE), and 5) prompts.
  • Activating the built-in tool Code Interpreter can enable adversaries to directly download original knowledge files with a 95.95% success rate.
  • 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company.

Research Gap

  • ML security: Most work focuses on attacks via model inputs/outputs, ignoring the broader data flow.
  • Web security: GPTs handle files across clients, servers, and databases, but leakage risks in the full data supply chain are underexplored.
Knowledge file data in GPT data supply chain

Knowledge File Leakage Assessment

We propose the workflow for assessing knowledge file leakage, including four phases:

  1. Data discovery: Identify three primary sources of knowledge file data, i.e., metadata flows , responses
  2. Data classification: Classify knowledge file data into seven dimensions based on their sensitivity and significance
  3. Risk assessment: Assess knowledge file leakage across the three data sources and seven dimensions
  4. Mitigation: Provide actionable suggestions for both GPT builders and platform providers
Overview of the DSPM-driven risk assessment workflow

Leakage Vectors of Knowledge Files

With the above workflow, we identify 5 key knowledge file leakage vectors in the GPT data supply chain:

: fully accessible; : partially accessible or potentially contains hallucinations. "CI" denotes Code Interpreter.
Leakage Vector Data Source Leakage Cause CI Leaked Data
ID Type Count Size Title Content File
Metadata Metadata Excessive Information Exposure - - - - -
Initialization Flow Excessive Information Exposure - - -
Retrieval Flow Excessive Information Exposure - - - - -
SEE Response Broken Access Control
Prompt Response Broken Access Control - -

🔍 Watch our live demo of knowledge file leakage in action!



How to Mitigate?

To mitigate the risks of knowledge file leakage, we provide the following suggestions for GPT builders and platform providers (details can be found in the full paper ):

👨‍💻 GPT Builders
  • Disable the Code Interpreter tool when not necessary
  • Avoid uploading copyrighted or sensitive knowledge files
  • Add defense instructions into the GPT system prompt
  • … …
💻 Platform Providers
  • Pre-check knowledge files before allowing users to update them
  • Enhance the security of the Code Interpreter tool
  • Redesign the API
  • … …

Other Platforms

To demonstrate the generability of our workflow, we also apply it to two other LLM platforms: Poe and FlowGPT. Details can be found in the full paper.


Ethics and Disclosures

This study involves online data collection and analysis of knowledge file leakage in GPTs, raising ethical considerations. Our study was approved by our institution’s Ethical Review Board. Data is securely stored and only authors conduct annotations to prevent copyright issues; all personal information is removed. We responsibly disclosed our findings to the platform providers. We believe sharing these results helps improve the security of GPT systems.


BibTeX

If you find this useful in your research, please consider citing:

@inproceedings{SSBZ25,
  author = {Xinyue Shen and Yun Shen and Michael Backes and Yang Zhang},
  title = {{When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs}},
  booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
  publisher = {ACL},
  year = {2025}
}