Membership inference attacks on retrieval-augmented generation

Retrieval-augmented generation (RAG) is widely assumed to be private with respect to the underlying corpus because individual documents are never exposed to the model weights.

Through a series of black-box experiments against four production-style RAG stacks, we show that an attacker can determine — with > 80% precision — whether a target document is in the index, using only the model's responses.

We outline two mitigations: query-time differential privacy on retrieval scores, and a defence-in-depth pattern using semantic redaction at indexing time.