| Beschreibung | ## Summary
`mlrun.utils.helpers.calculate_dataframe_hash` relies on `pandas.util.hash_pandas_object` as its sole fingerprinting mechanism. Due to the way pandas internally hashes different dtypes (bool→uint64, datetime64→int64, int8→uint64 type promotion) and combines column hashes via linear arithmetic (mod 2^64), semantically different DataFrames can produce identical hash values. Since the hash is used to construct artifact storage paths (`{artifact_path}/{hash}{suffix}`), this leads to **artifact path collision**, resulting in silent data overwrite or stale data reads.
## Affected Version
- Verified on **v1.11.0-rc42** (commit `f802de1f4`)
- Affects all versions containing `calculate_dataframe_hash`
## Affected Code
### Root Cause
[`mlrun/utils/helpers.py:1626-1628`](https://github.com/mlrun/mlrun/blob/development/mlrun/utils/helpers.py#L1626-L1628)
```python
def calculate_dataframe_hash(dataframe: pandas.DataFrame):
# https://stackoverflow.com/questions/49883236/...
return hashlib.sha1(pandas.util.hash_pandas_object(dataframe).values).hexdigest()
```
### Impact Point
[`mlrun/artifacts/dataset.py:250-261`](https://github.com/mlrun/mlrun/blob/development/mlrun/artifacts/dataset.py#L250-L261) — `resolve_dataframe_target_hash_path` uses this hash to derive `target_path`; the hash is also stored as `metadata.hash`.
```python
def resolve_dataframe_target_hash_path(self, dataframe, artifact_path: str):
dataframe_hash = mlrun.utils.helpers.calculate_dataframe_hash(dataframe)
suffix = self._resolve_suffix()
artifact_path = (
artifact_path + "/" if not artifact_path.endswith("/") else artifact_path
)
target_path = f"{artifact_path}{dataframe_hash}{suffix}"
return dataframe_hash, target_path
```
### Caller
[`mlrun/artifacts/dataset.py:214-220`](https://github.com/mlrun/mlrun/blob/development/mlrun/artifacts/dataset.py#L214-L220) — `DatasetArtifact.upload()` assigns the hash to `metadata.hash` and the derived path to `spec.target_path`.
## Collision Types
### Type 1: Boolean vs Integer (Natural, High Likelihood)
`True` → `uint64(1)`, `False` → `uint64(0)`. Identical to integer `1` and `0`.
```python
df_bool = pd.DataFrame({"flag": [True, False, True], "value": [10, 20, 30]})
df_int = pd.DataFrame({"flag": [1, 0, 1], "value": [10, 20, 30]})
assert calculate_dataframe_hash(df_bool) == calculate_dataframe_hash(df_int)
# ✅ Collision confirmed
```
**Likelihood**: **High** — boolean/integer coercion is extremely common in ML pipelines (label encoding, feature preprocessing).
### Type 2: Datetime vs Integer (Natural, Medium Likelihood)
`datetime64[ns]` is hashed via `.view("i8")`, producing the same bit pattern as an `int64` with the same nanosecond value.
```python
ts = pd.Timestamp("1970-01-01") + pd.Timedelta(nanoseconds=100)
df_time = pd.DataFrame({"col": pd.array([ts], dtype="datetime64[ns]")})
df_nint = pd.DataFrame({"col": pd.array([100], dtype="int64")})
assert calculate_dataframe_hash(df_time) == calculate_dataframe_hash(df_nint)
# ✅ Collision confirmed
```
**Likelihood**: **Medium** — occurs when feature engineering converts timestamps to numeric representations.
### Type 3: Integer Precision Promotion (Natural, High Likelihood)
All sub-64-bit integer types (`int8`, `int16`, `int32`) are upcast to `uint64` before hashing.
```python
df_i8 = pd.DataFrame({"sensor": np.array([5, -3, 127], dtype=np.int8)})
df_i64 = pd.DataFrame({"sensor": np.array([5, -3, 127], dtype=np.int64)})
assert calculate_dataframe_hash(df_i8) == calculate_dataframe_hash(df_i64)
# ✅ Collision confirmed
```
**Likelihood**: **High** — dtype narrowing/widening is common when data passes through different systems (edge devices → server, CSV parsing → typed storage).
### Type 4: Cross-Column Linear Overflow (Adversarial)
Pandas combines per-column hashes via `result = result * multiplier + col_hash` under uint64 modular arithmetic. An attacker can solve for values that produce the same combined hash.
```python
df1 = pd.DataFrame({"A": [1604090909467468979, 2], "B": [4, 4]})
df2 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
assert calculate_dataframe_hash(df1) == calculate_dataframe_hash(df2)
# ✅ Collision confirmed
```
**Likelihood**: **Low** for natural data, but **trivial to construct** intentionally.
### Type 5: Combined Attack (Multiple Types Together)
```python
df_original = pd.DataFrame({
"is_active": [True, False],
"created_at": pd.to_datetime(["1970-01-01T00:00:00.000000001",
"1970-01-01T00:00:00.000000002"]),
"score": np.array([10, 20], dtype=np.int8),
})
df_shadow = pd.DataFrame({
"is_active": [1, 0],
"created_at": [1, 2],
"score": np.array([10, 20], dtype=np.int64),
})
assert calculate_dataframe_hash(df_original) == calculate_dataframe_hash(df_shadow)
# ✅ Collision confirmed — completely different schema, same hash
```
## Impact
### Artifact Path Collision
`resolve_dataframe_target_hash_path` generates `target_path = f"{artifact_path}/{hash}{suffix}"`. Two different DataFrames with the same hash produce the same path, leading to:
1. **Silent data overwrite** — a later upload replaces an earlier artifact at the same path without warning.
2. **Stale data reads** — a new dataset is skipped because its hash matches an existing artifact, causing the pipeline to use outdated data.
3. **Metadata corruption** — `metadata.hash` no longer uniquely identifies the artifact content.
### Attack Scenario
```
User A uploads feature table (df_original) → hash "abc123" → stored at /artifacts/abc123.parquet
User B uploads crafted data (df_shadow) → hash "abc123" → same path
→ Result (a): overwrite → User A now reads poisoned data
→ Result (b): skip → User B reads User A's data (data leakage)
```
### Affected Workflows
| Workflow | Risk | Trigger |
|----------|------|---------|
| Feature Store ingestion | Incremental update skipped | Type 1, 3 (bool/int, dtype mismatch) |
| Dataset artifact caching | Wrong artifact served | All types |
| Model monitoring | Drift undetected | Type 1 (bool vs int labels) |
| Multi-tenant projects | Cross-user data collision | Type 4, 5 (adversarial) |
## CVSS Assessment
- **Vector**: `AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:H/A:N`
- **Score**: approximately **6.5 (Medium)**
- Rationale: Authenticated user with project access can cause data integrity loss; Types 1 and 3 require no special crafting.
## Suggested Fix
### Option A: Serialization-based hash (Recommended)
Replace `hash_pandas_object`-based hashing with a serialization-based approach that preserves dtype information:
```python
def calculate_dataframe_hash(dataframe: pandas.DataFrame):
# Serialize to parquet (preserves schema/dtype), then hash the bytes
buf = dataframe.to_parquet() # returns bytes
return hashlib.sha256(buf).hexdigest()
```
**Advantages:**
- Parquet encoding preserves exact dtypes — `bool`, `int8`, `int64`, `datetime64` all produce different byte representations.
- SHA-256 provides stronger collision resistance (upgrade from SHA-1).
- No dependency on pandas internal hashing implementation details.
**Trade-offs:**
- Slightly higher computational cost (serialization overhead).
- Non-determinism risk: parquet metadata may include timestamps. Use `coerce_timestamps` and disable `write_statistics` if determinism is required.
### Option B: Minimal change — prepend dtype info
```python
def calculate_dataframe_hash(dataframe: pandas.DataFrame):
# Include dtype info in the hash to prevent cross-type collisions
dtype_bytes = str(dataframe.dtypes.to_dict()).encode()
data_bytes = pandas.util.hash_pandas_object(dataframe).values.tobytes()
return hashlib.sha256(dtype_bytes + data_bytes).hexdigest()
```
This is a smaller change that addresses Types 1-3 (dtype collisions) while keeping the existing approach. Type 4 (linear overflow) remains theoretically possible but is not practically exploitable without adversarial intent.
> **Note**: Both options change the hash output format, which means existing artifact paths will no longer match. A migration strategy or backward-compatible fallback may be needed.
## Reproduction Script
```python
"""
Run: python reproduce_hash_collision.py
Expected: All assertions pass, demonstrating collisions.
"""
import hashlib
import numpy as np
import pandas as pd
def calculate_dataframe_hash(dataframe: pd.DataFrame):
"""Exact copy from mlrun/utils/helpers.py:1626"""
return hashlib.sha1(
pd.util.hash_pandas_object(dataframe).values
).hexdigest()
def test_bool_vs_int():
df1 = pd.DataFrame({"flag": [True, False], "val": [10, 20]})
df2 = pd.DataFrame({"flag": [1, 0], "val": [10, 20]})
h1, h2 = calculate_dataframe_hash(df1), calculate_dataframe_hash(df2)
assert h1 == h2, f"Expected collision: {h1} != {h2}"
print(f"[PASS] bool vs int: {h1}")
def test_datetime_vs_int():
ts = pd.Timestamp("1970-01-01") + pd.Timedelta(nanoseconds=100)
df1 = pd.DataFrame({"col": pd.array([ts], dtype="datetime64[ns]")})
df2 = pd.DataFrame({"col": pd.array([100], dtype="int64")})
h1, h2 = calculate_dataframe_hash(df1), calculate_dataframe_hash(df2)
assert h1 == h2, f"Expected collision: {h1} != {h2}"
print(f"[PASS] datetime vs int: {h1}")
def test_int8_vs_int64():
df1 = pd.DataFrame({"s": np.array([5, -3, 127], dtype=np.int8)})
df2 = pd.DataFrame({"s": np.array([5, -3, 127], dtype=np.int64)})
h1, h2 = calculate_dataframe_hash(df1), calculate_dataframe_hash(df2)
assert h1 == h2, f"Expected collision: {h1} != {h2}"
print(f"[PASS] int8 vs int64 |
|---|