| الوصف | Cache Poisoning via Partial-Object Hashing in Vision Feature Cache
Summary
An attacker with the ability to submit vision inference requests to an exo cluster can craft an image whose raw pixel byte sequence matches another user's image but with different dimensions, causing feature-cache-key and prefix-cache content_hash
reuse, which leads to the victim's inference silently using vision features computed from the attacker's image, negatively impacting all API clients sharing the same exo process.
Affected Versions
- Affected: exo version <= 0.3.70, main branch at and before commit 629c55d6ba201014ab45c48b0e9f984495a30f34
- Confirmed affected: exo version 0.3.70, commit 629c55d6ba201014ab45c48b0e9f984495a30f34
- Fixed: PR #2152 (https://github.com/exo-explore/exo/pull/2152), not yet released at the time of reporting
Existing _feature_cache entries and KVPrefixCache entries generated by affected versions are in-memory only and do not persist across restarts. A process restart clears all vulnerable cache state.
Details
The vulnerability occurs because the MLX vision engine computes SHA-256 over PIL.Image.tobytes() output and then uses the resulting digest as both a feature cache key and a prefix cache content hash. However, tobytes() returns raw pixel bytes
without any dimension information, so two images with different width×height but identical pixel byte sequences are treated as the same image.
Where the Hash is Computed
The following code is from version 0.3.70, commit 629c55d6ba201014ab45c48b0e9f984495a30f34.
Feature cache key:
# https://github.com/exo-explore/exo/blob/629c55d6ba201014ab45c48b0e9f984495a30f34/src/exo/worker/engines/mlx/vision.py#L739-L744
def _image_cache_key(self, images: list[Base64Image]) -> str:
h = hashlib.sha256()
for img in images:
pil = decode_base64_image(img)
h.update(pil.tobytes())
return h.hexdigest()
Prefix cache content hash:
# https://github.com/exo-explore/exo/blob/629c55d6ba201014ab45c48b0e9f984495a30f34/src/exo/worker/engines/mlx/vision.py#L711-L712
img = decode_base64_image(images[i])
region.content_hash = hashlib.sha256(img.tobytes()).hexdigest()
Both call decode_base64_image(), which converts the image to RGB mode via img.convert("RGB") before returning. The conversion resolves palette-mode (P mode) ambiguity but does not embed dimension metadata into the byte stream.
What Fields Are Included or Excluded
The digest includes:
- Raw RGB pixel byte sequence (PIL.Image.tobytes() after convert("RGB"))
The digest excludes:
- Image width
- Image height
- Original image mode (before conversion)
- Color profile / ICC data
- Any other PIL metadata
The excluded dimension fields directly determine how the vision encoder preprocesses the image. The encoder resizes images by aspect ratio (e.g. 6×4 → 384×256 vs 4×6 → 256×384), producing semantically different feature tensors. Therefore, digest
equality does not imply that the two images will produce equivalent vision features.
How the Hash is Used for a Security-Relevant Decision
Feature cache lookup:
# https://github.com/exo-explore/exo/blob/629c55d6ba201014ab45c48b0e9f984495a30f34/src/exo/worker/engines/mlx/vision.py#L757-L766
cache_key = self._image_cache_key(images)
cached = self._feature_cache.pop(cache_key, None)
if cached is not None:
self._feature_cache[cache_key] = cached
image_features, n_tokens_per_image = cached
else:
image_features, n_tokens_per_image = self._encoder.encode_images(images)
self._feature_cache[cache_key] = (image_features, n_tokens_per_image)
The digest is used as the sole key in a process-level dict (_feature_cache, max 32 entries). A cache hit returns pre-computed vision features without re-encoding, and these features are fed directly into the model's generation pipeline.
Prefix cache validation:
# https://github.com/exo-explore/exo/blob/629c55d6ba201014ab45c48b0e9f984495a30f34/src/exo/worker/engines/mlx/cache.py#L417-L424
if query_r.content_hash != cached_r.content_hash:
match_length = cached_r.start_pos
break
The content_hash is used to validate whether a prefix cache entry's media region corresponds to the same image. If the hashes match, the prefix cache reuses KV states computed from the cached image's features.
Why Hash Equality Does Not Imply Security Equivalence
The issue is not a raw cryptographic break of SHA-256. The issue is application-level hash confusion: the application treats digest equality as image equivalence, but the digest does not encode the spatial dimensions that determine how the image is
preprocessed and encoded into features.
Two images with identical pixel byte sequences but different width×height produce the same SHA-256 digest. However, the vision encoder's aspect-ratio-aware resize produces different feature tensors for each, so digest equality does not imply feature
equivalence.
How the Attacker Constructs a Conflicting Object
The collision is deterministic and requires no brute-force. Given any W×H image where W ≠ H:
Image A (victim): width=6, height=4, RGB pixels = [R0,G0,B0, R1,G1,B1, ..., R23,G23,B23]
Image B (attacker): width=4, height=6, RGB pixels = [R0,G0,B0, R1,G1,B1, ..., R23,G23,B23]
Both images have identical tobytes() output (72 bytes), therefore identical SHA-256 digest. But after aspect-ratio resize:
- Image A (landscape 3:2) resizes to e.g. 384×256
- Image B (portrait 2:3) resizes to e.g. 256×384
The resulting feature tensors are semantically different.
The attacker's requirement is knowledge of the victim's pixel byte sequence. In scenarios where images are predictable (standard test images, templated charts, well-known icons, shared reference images), this prerequisite is realistic.
Version-Specific Behavior
The _feature_cache and KVPrefixCache are process-level in-memory structures shared across all API requests within a single exo process. exo exposes OpenAI, Claude, and Ollama-compatible API endpoints, so multiple API clients may share the same
cache. The vulnerability exists in all versions of exo that include the MLX vision engine with caching (version 0.3.70 and the main branch at commit 629c55d).
Comparison with a Secure Path
Fixed version (PR #2152 (https://github.com/exo-explore/exo/pull/2152)):
# _image_cache_key — fixed
def _image_cache_key(self, images: list[Base64Image]) -> str:
h = hashlib.sha256()
for img in images:
pil = decode_base64_image(img)
h.update(f"{pil.width}x{pil.height}".encode())
h.update(pil.tobytes())
return h.hexdigest()
# content_hash — fixed
img = decode_base64_image(images[i])
h = hashlib.sha256(f"{img.width}x{img.height}".encode())
h.update(img.tobytes())
region.content_hash = h.hexdigest()
The fix prepends "{width}x{height}" to the hash input, ensuring dimension-different images always produce distinct digests.
Impact
This vulnerability allows attackers to:
- Poison the vision feature cache: an attacker-submitted image with swapped dimensions occupies the cache slot, causing subsequent requests with the victim's image to receive incorrect vision features
- Bypass prefix cache validation: the _validate_media_match check passes on colliding content_hash values, reusing KV states computed from incorrect image features
- Corrupt model inference output: the model generates responses based on vision features from a different image, silently producing wrong answers for vision tasks
The attack is transient: _feature_cache and KVPrefixCache are in-memory structures that are cleared on process restart. However, within a running process, poisoned entries persist until evicted by the LRU policy (max 32 feature cache entries).
|
|---|