| 描述 | mage Cache Hash Collision via Missing Dimension Metadata
| Field | Value |
|-------|-------|
| **Advisory ID** | SWIFT-2026-001 |
| **Affected Component** | `swift/template/base.py` → `Template._save_pil_image()` |
| **Affected Versions** | All versions up to and including commit `8d1e071d5` |
| **Severity** | Medium (CVSS 3.1 Base: 5.3) |
| **Attack Vector** | Network (user-supplied image input) |
| **CWE** | CWE-345 (Insufficient Verification of Data Authenticity) |
| **Status** | Open |
---
## Summary
`Template._save_pil_image()` uses `SHA256(image.tobytes())` as the cache key for content-addressed image storage. However, PIL's `tobytes()` returns a **flat byte stream without any dimensional metadata** (width, height, mode). Two visually distinct images with identical raw pixel bytes but different dimensions will produce the same hash, causing cache collision. The first image cached will be silently served for all subsequent images sharing the same hash — resulting in incorrect model input.
---
## Affected Code
**File:** `swift/template/base.py`, lines 697–707
```python
@staticmethod
def _save_pil_image(image: Image.Image) -> str:
img_bytes = image.tobytes() # ← no dimensional metadata
img_hash = hashlib.sha256(img_bytes).hexdigest() # ← hash of raw pixels only
tmp_dir = os.path.join(get_cache_dir(), 'tmp', 'images')
logger.info_once(f'create tmp_dir: {tmp_dir}')
os.makedirs(tmp_dir, exist_ok=True)
img_path = os.path.join(tmp_dir, f'{img_hash}.png')
if not os.path.exists(img_path):
image.save(img_path) # ← only first image is saved
return img_path # ← all collisions get this path
```
**Call site:** `swift/template/base.py`, lines 305–308
```python
if images and not load_images_origin: # fix pt & qwen-vl
for i, image in enumerate(images):
if isinstance(image, Image.Image):
images[i] = self._save_pil_image(image)
```
---
## Root Cause
`PIL.Image.tobytes()` serializes pixel data as a contiguous byte array:
```
Output: R₁G₁B₁ R₂G₂B₂ R₃G₃B₃ ... (W × H × 3 bytes for RGB)
```
This output does **not** encode:
- **Width (W)** and **Height (H)** — only the product `W × H` is implied by the byte count
- **Image mode** — while the upstream pipeline normalizes to RGB, the function itself does not enforce this
Therefore, for any pair of images where `W₁ × H₁ == W₂ × H₂` and the flat pixel sequences are identical, the SHA-256 digest will collide.
---
## Proof of Concept
### Minimal Reproduction
```python
"""
PoC: Demonstrate hash collision in _save_pil_image()
Two visually DIFFERENT images produce the same SHA256 cache key.
Run this script — it will assert that both images share the same hash
while being visually distinct (horizontal stripes vs diagonal pattern).
"""
import hashlib
import numpy as np
from PIL import Image
# ── Step 1: Construct raw pixel data (shared by both images) ──────────
# Total pixels: 120 × 80 = 80 × 120 = 9600 pixels = 28800 bytes (RGB)
W_A, H_A = 120, 80 # landscape
W_B, H_B = 80, 120 # portrait
assert W_A * H_A == W_B * H_B # same total pixel count
total_pixels = W_A * H_A
pixels = np.zeros((total_pixels, 3), dtype=np.uint8)
# Paint horizontal stripes (every 120 pixels = one row of image A)
for i in range(total_pixels):
row_in_A = i // W_A
if row_in_A % 10 < 5:
pixels[i] = [255, 60, 60] # red stripe
else:
pixels[i] = [60, 60, 255] # blue stripe
raw_bytes = pixels.tobytes()
# ── Step 2: Create two images from the SAME bytes ────────────────────
img_a = Image.frombytes('RGB', (W_A, H_A), raw_bytes) # 120×80
img_b = Image.frombytes('RGB', (W_B, H_B), raw_bytes) # 80×120
# ── Step 3: Verify hash collision ─────────────────────────────────────
hash_a = hashlib.sha256(img_a.tobytes()).hexdigest()
hash_b = hashlib.sha256(img_b.tobytes()).hexdigest()
assert hash_a == hash_b, "Hashes should collide!"
print(f"[COLLISION] Both images produce the same SHA-256:")
print(f" Image A: {W_A}×{H_A} hash={hash_a[:16]}...")
print(f" Image B: {W_B}×{H_B} hash={hash_b[:16]}...")
# ── Step 4: Simulate _save_pil_image cache behavior ──────────────────
# First call: img_a gets cached
img_a.save('/tmp/poc_image_a_120x80.png')
# Second call: img_b hits cache, gets img_a's file
# (in _save_pil_image, the `if not os.path.exists` check skips saving img_b)
img_b.save('/tmp/poc_image_b_80x120.png')
print(f"\n[VISUAL DIFF] Open both files to see the difference:")
print(f" Image A (120×80): /tmp/poc_image_a_120x80.png → clean horizontal stripes")
print(f" Image B (80×120): /tmp/poc_image_b_80x120.png → diagonal/broken pattern")
print(f"\nIn production, image B would NEVER be saved.")
print(f"The model would receive image A when image B was submitted.")
# ── Step 5: Verify they are visually different ────────────────────────
arr_a = np.array(img_a) # shape (80, 120, 3)
arr_b = np.array(img_b) # shape (120, 80, 3)
assert arr_a.shape != arr_b.shape, "Shapes must differ"
print(f"\n[CONFIRMED] Shape A={arr_a.shape}, Shape B={arr_b.shape}")
print(f"[CONFIRMED] Pixel arrays are NOT equivalent — images are visually distinct.")
```
### Expected Output
```
[COLLISION] Both images produce the same SHA-256:
Image A: 120×80 hash=a1b2c3d4e5f6a7b8...
Image B: 80×120 hash=a1b2c3d4e5f6a7b8...
[VISUAL DIFF] Open both files to see the difference:
Image A (120×80): /tmp/poc_image_a_120x80.png → clean horizontal stripes
Image B (80×120): /tmp/poc_image_b_80x120.png → diagonal/broken pattern
In production, image B would NEVER be saved.
The model would receive image A when image B was submitted.
[CONFIRMED] Shape A=(80, 120, 3), Shape B=(120, 80, 3)
[CONFIRMED] Pixel arrays are NOT equivalent — images are visually distinct.
```
### Visual Comparison
```
Image A (120×80) — Clean stripes: Image B (80×120) — Broken pattern:
████████████████████████ ████████████████
████████████████████████ ████████████░░░░
████████████████████████ ░░░░░░░░░░░░████
████████████████████████ ████████████████
████████████████████████ ████████░░░░░░░░
░░░░░░░░░░░░░░░░░░░░░░░░ ░░░░░░░░████████
░░░░░░░░░░░░░░░░░░░░░░░░ ████████████████
░░░░░░░░░░░░░░░░░░░░░░░░ ████░░░░░░░░░░░░
░░░░░░░░░░░░░░░░░░░░░░░░ ░░░░░░░░░░░░████
░░░░░░░░░░░░░░░░░░░░░░░░ ████████████████
Same bytes, different row width → completely different visual layout
```
---
## Impact
### 1. Inference Cache Poisoning
A shared inference service caches image A. A subsequent request with a crafted image B (same pixel bytes, different dimensions) hits the cache and receives image A. The multimodal model generates a response based on the **wrong image**.
### 2. GRPO / RLHF Training Data Corruption
In reinforcement learning training pipelines (GRPO, DPO, RLHF), if two training samples contain images that collide:
- Sample 2's image is silently replaced with Sample 1's cached image
- The reward model scores based on the wrong image-text pairing
- The policy model learns from corrupted reward signals
### 3. Dataset Deduplication False Positives
If image hashes are used for deduplication, visually distinct images are incorrectly identified as duplicates and removed, reducing effective training data.
---
## Remediation
### Recommended Fix
Include image dimensions and mode in the hash input:
```python
@staticmethod
def _save_pil_image(image: Image.Image) -> str:
# Fix: include dimensional metadata to prevent cross-dimension collision
img_meta = f"{image.mode}:{image.width}:{image.height}:".encode()
img_hash = hashlib.sha256(img_meta + image.tobytes()).hexdigest()
tmp_dir = os.path.join(get_cache_dir(), 'tmp', 'images')
logger.info_once(f'create tmp_dir: {tmp_dir}')
os.makedirs(tmp_dir, exist_ok=True)
img_path = os.path.join(tmp_dir, f'{img_hash}.png')
if not os.path.exists(img_path):
image.save(img_path)
return img_path
```
### Why This Fix Is Sufficient
After the upstream `convert('RGB')` + `rescale_image` pipeline:
- `mode` is always `'RGB'` — including it provides defense-in-depth against future call paths that skip `convert('RGB')`
- `width` and `height` are the **only** remaining ambiguity in `tobytes()` output
- Adding these as a prefix to the hash input makes the key fully deterministic for a given visual image
### Migration
This change will **invalidate existing cached images** (hash values change). The cache directory (`{ca |
|---|