Submit #833997: yoanbernabeu grepai v0.35.0-1-gf6dbf8d Cache Poisoninginfo

Titleyoanbernabeu grepai v0.35.0-1-gf6dbf8d Cache Poisoning
Description## Vulnerability Title Cache Poisoning via Hash-Key Confusion in Postgres Embedding Cache ## Affected Component `indexer/chunker.go` and `store/postgres.go` Repository: https://github.com/yoanbernabeu/grepai ## Summary An attacker with the ability to index a project against a shared Postgres backend can craft a chunk with the same raw-content hash as a victim chunk, causing cache-key reuse in the embedding cache. This leads to cross-project reuse of the victim project's embedding vector and negatively impacts project isolation and vector-index integrity. ## Technical Details The vulnerability occurs because grepai computes a chunk `content_hash` over only raw chunk content and then uses that value as the sole lookup key for the Postgres embedding cache. **Where the Hash is Computed** `indexer/chunker.go` computes a SHA-256 digest over `chunkContent` only: ```go hash := sha256.Sum256([]byte(fmt.Sprintf("%s:%d:%d:%s", filePath, pos, end, chunkContent))) contentHash := sha256.Sum256([]byte(chunkContent)) chunkID := fmt.Sprintf("%s_%d", filePath, chunkIndex) ``` `ChunkWithContext` then changes the actual text that is embedded by adding file-path context, meaning two chunks with the same raw content can share `content_hash` even if the text sent to the embedder differs by file path context. **Vulnerable Cache Lookup** In `store/postgres.go`, the digest is used as the sole cache key for retrieving an embedding vector from a shared Postgres table. The query returns `vector` without filtering by `project_id`: ```go func (s *PostgresStore) LookupByContentHash(ctx context.Context, contentHash string) ([]float32, bool, error) { // ... err := s.pool.QueryRow(ctx, `SELECT vector FROM chunks WHERE content_hash = $1 AND vector IS NOT NULL LIMIT 1`, contentHash, ).Scan(&vec) // ... } ``` Two chunks can have identical raw content and therefore identical `content_hash`, while belonging to different projects, users, repositories, or embedding configurations. The vulnerable cache lookup treats them as equivalent because it only compares `content_hash`. ## Impact This vulnerability allows attackers to: - Reuse an embedding vector generated under another project in a shared Postgres backend. - Poison or corrupt vector-index state and semantic search behavior across project boundaries. - Infer possible cache existence for guessed content through cache-hit behavior, timing, logs, or skipped embedding work. The attack can be persistent: once a wrong vector is reused and saved into the attacker's project index, the poisoned index state can remain until the affected project is re-indexed or the cache/index entries are invalidated. ## Proof of Concept The PoC proves that two non-equivalent projects sharing one Postgres database can reuse the same embedding vector because `LookupByContentHash` ignores `project_id`. ```python #!/usr/bin/env python3 """Minimal conceptual PoC for Postgres embedding cache key confusion.""" def vulnerable_cache_key(obj): return obj["content_hash"] victim = { "project_id": "proj_victim", "content_hash": "sha256:4f9c4a1d2b7e", "vector": [9.9, 8.8, 7.7], } attacker = { "project_id": "proj_attacker", "content_hash": "sha256:4f9c4a1d2b7e", } assert victim["project_id"] != attacker["project_id"] assert vulnerable_cache_key(victim) == vulnerable_cache_key(attacker) embedding_cache = {} embedding_cache[vulnerable_cache_key(victim)] = victim["vector"] attacker_vector = embedding_cache[vulnerable_cache_key(attacker)] print("victim key: ", vulnerable_cache_key(victim)) print("attacker key:", vulnerable_cache_key(attacker)) print("attacker received vector:", attacker_vector) print("Security invariant broken: attacker project reused victim project vector.") ``` Official E2E verification output with a real `pgvector/pgvector:pg17` container: ```text tester-1 | === RUN TestPoC tester-1 | VULNERABILITY CONFIRMED: project "proj_attacker" resolved victim vector [9.9 8.8 7.7] via content_hash-only lookup for "sha256:4f9c4a1d2b7e" tester-1 | --- PASS: TestPoC (0.12s) tester-1 | PASS tester-1 | ok [github.com/yoanbernabeu/grepai/store](https://github.com/yoanbernabeu/grepai/store) 0.123s ``` ## Remediation Update the Postgres embedding cache lookup so cache identity is scoped by project and embedder configuration. Avoid looking up vectors by `content_hash` alone. Example direction: ```go // In store/postgres.go SELECT vector FROM chunks WHERE project_id = $1 AND content_hash = $2 AND vector IS NOT NULL LIMIT 1 ``` Additional mitigations: 1. Complete Field Coverage: Include `project_id` and embedder identity fields (provider, model, dimensions). 2. Domain Separation: Separate embedding cache namespaces across projects, workspaces, and tenants. 3. Read-Time Revalidation: Verify that the cached entry belongs to the current project before returning. 4. State Invalidation: Invalidate or migrate Postgres cache/index entries generated with the vulnerable key scheme. ## References - Vulnerable cache lookup: `https://github.com/yoanbernabeu/grepai/blob/f6dbf8dbb74c1f80aec37531964665bb71bdc6db/store/postgres.go#L376-L395` - Project-scoped Postgres writes: `https://github.com/yoanbernabeu/grepai/blob/f6dbf8dbb74c1f80aec37531964665bb71bdc6db/store/postgres.go#L102-L114` - Cache use in the indexer: `https://github.com/yoanbernabeu/grepai/blob/f6dbf8dbb74c1f80aec37531964665bb71bdc6db/indexer/indexer.go#L630-L647`
Source⚠️ https://github.com/yoanbernabeu/grepai/issues/249
User
 Dem000 (UID 98389)
Submission05/20/2026 10:27 (21 days ago)
Moderation06/07/2026 11:57 (18 days later)
StatusAccepted
VulDB entry369101 [yoanbernabeu grepai up to 0.35.0 Postgres Embedding Cache indexer/chunker.go PostgresStore.LookupByContentHash content_hash weak hash]
Points20

Do you know our Splunk app?

Download it now for free!