Fine-tuning large language models on sensitive data without leakage is one of the hardest open problems in applied AI privacy.
This report presents a reproducible benchmark across three open-weight models (8B–70B parameters) measuring perplexity loss and downstream task accuracy as a function of the (ε, δ) privacy budget.
We release the training pipeline, evaluation harness, and a decision matrix linking ε ranges to product risk tiers.