| Title | yoanbernabeu grepai v0.35.0-1-gf6dbf8d Cache Poisoning |
|---|
| Description | ## Vulnerability Title
Cache Poisoning via Hash-Key Confusion in Postgres Embedding Cache
## Affected Component
`indexer/chunker.go` and `store/postgres.go`
Repository: https://github.com/yoanbernabeu/grepai
## Summary
An attacker with the ability to index a project against a shared Postgres backend can craft a chunk with the same raw-content hash as a victim chunk, causing cache-key reuse in the embedding cache. This leads to cross-project reuse of the victim project's embedding vector and negatively impacts project isolation and vector-index integrity.
## Technical Details
The vulnerability occurs because grepai computes a chunk `content_hash` over only raw chunk content and then uses that value as the sole lookup key for the Postgres embedding cache.
**Where the Hash is Computed**
`indexer/chunker.go` computes a SHA-256 digest over `chunkContent` only:
```go
hash := sha256.Sum256([]byte(fmt.Sprintf("%s:%d:%d:%s", filePath, pos, end, chunkContent)))
contentHash := sha256.Sum256([]byte(chunkContent))
chunkID := fmt.Sprintf("%s_%d", filePath, chunkIndex)
```
`ChunkWithContext` then changes the actual text that is embedded by adding file-path context, meaning two chunks with the same raw content can share `content_hash` even if the text sent to the embedder differs by file path context.
**Vulnerable Cache Lookup**
In `store/postgres.go`, the digest is used as the sole cache key for retrieving an embedding vector from a shared Postgres table. The query returns `vector` without filtering by `project_id`:
```go
func (s *PostgresStore) LookupByContentHash(ctx context.Context, contentHash string) ([]float32, bool, error) {
// ...
err := s.pool.QueryRow(ctx,
`SELECT vector FROM chunks WHERE content_hash = $1 AND vector IS NOT NULL LIMIT 1`,
contentHash,
).Scan(&vec)
// ...
}
```
Two chunks can have identical raw content and therefore identical `content_hash`, while belonging to different projects, users, repositories, or embedding configurations. The vulnerable cache lookup treats them as equivalent because it only compares `content_hash`.
## Impact
This vulnerability allows attackers to:
- Reuse an embedding vector generated under another project in a shared Postgres backend.
- Poison or corrupt vector-index state and semantic search behavior across project boundaries.
- Infer possible cache existence for guessed content through cache-hit behavior, timing, logs, or skipped embedding work.
The attack can be persistent: once a wrong vector is reused and saved into the attacker's project index, the poisoned index state can remain until the affected project is re-indexed or the cache/index entries are invalidated.
## Proof of Concept
The PoC proves that two non-equivalent projects sharing one Postgres database can reuse the same embedding vector because `LookupByContentHash` ignores `project_id`.
```python
#!/usr/bin/env python3
"""Minimal conceptual PoC for Postgres embedding cache key confusion."""
def vulnerable_cache_key(obj):
return obj["content_hash"]
victim = {
"project_id": "proj_victim",
"content_hash": "sha256:4f9c4a1d2b7e",
"vector": [9.9, 8.8, 7.7],
}
attacker = {
"project_id": "proj_attacker",
"content_hash": "sha256:4f9c4a1d2b7e",
}
assert victim["project_id"] != attacker["project_id"]
assert vulnerable_cache_key(victim) == vulnerable_cache_key(attacker)
embedding_cache = {}
embedding_cache[vulnerable_cache_key(victim)] = victim["vector"]
attacker_vector = embedding_cache[vulnerable_cache_key(attacker)]
print("victim key: ", vulnerable_cache_key(victim))
print("attacker key:", vulnerable_cache_key(attacker))
print("attacker received vector:", attacker_vector)
print("Security invariant broken: attacker project reused victim project vector.")
```
Official E2E verification output with a real `pgvector/pgvector:pg17` container:
```text
tester-1 | === RUN TestPoC
tester-1 | VULNERABILITY CONFIRMED: project "proj_attacker" resolved victim vector [9.9 8.8 7.7] via content_hash-only lookup for "sha256:4f9c4a1d2b7e"
tester-1 | --- PASS: TestPoC (0.12s)
tester-1 | PASS
tester-1 | ok [github.com/yoanbernabeu/grepai/store](https://github.com/yoanbernabeu/grepai/store) 0.123s
```
## Remediation
Update the Postgres embedding cache lookup so cache identity is scoped by project and embedder configuration. Avoid looking up vectors by `content_hash` alone.
Example direction:
```go
// In store/postgres.go
SELECT vector
FROM chunks
WHERE project_id = $1
AND content_hash = $2
AND vector IS NOT NULL
LIMIT 1
```
Additional mitigations:
1. Complete Field Coverage: Include `project_id` and embedder identity fields (provider, model, dimensions).
2. Domain Separation: Separate embedding cache namespaces across projects, workspaces, and tenants.
3. Read-Time Revalidation: Verify that the cached entry belongs to the current project before returning.
4. State Invalidation: Invalidate or migrate Postgres cache/index entries generated with the vulnerable key scheme.
## References
- Vulnerable cache lookup: `https://github.com/yoanbernabeu/grepai/blob/f6dbf8dbb74c1f80aec37531964665bb71bdc6db/store/postgres.go#L376-L395`
- Project-scoped Postgres writes: `https://github.com/yoanbernabeu/grepai/blob/f6dbf8dbb74c1f80aec37531964665bb71bdc6db/store/postgres.go#L102-L114`
- Cache use in the indexer: `https://github.com/yoanbernabeu/grepai/blob/f6dbf8dbb74c1f80aec37531964665bb71bdc6db/indexer/indexer.go#L630-L647` |
|---|
| Source | ⚠️ https://github.com/yoanbernabeu/grepai/issues/249 |
|---|
| User | Dem000 (UID 98389) |
|---|
| Submission | 05/20/2026 10:27 (21 days ago) |
|---|
| Moderation | 06/07/2026 11:57 (18 days later) |
|---|
| Status | Accepted |
|---|
| VulDB entry | 369101 [yoanbernabeu grepai up to 0.35.0 Postgres Embedding Cache indexer/chunker.go PostgresStore.LookupByContentHash content_hash weak hash] |
|---|
| Points | 20 |
|---|