| Beschreibung | ## Summary
MLflow's dataset digest computation mechanism (`mlflow.data.digest_utils`) contains multiple weaknesses that allow an attacker to construct a semantically different dataset that produces an identical digest to a known legitimate dataset. This enables dataset spoofing, bypassing of validation pipelines, and corruption of experiment tracking integrity.
**Affected Component:** `mlflow/data/digest_utils.py`
**Severity:** High (CVSS 3.1 Base Score: 7.5)
**Attack Vector:** Network / Local
**CWE:** CWE-328 (Use of Weak Hash), CWE-354 (Improper Validation of Integrity Check Value)
---
## Affected Versions
All current versions of MLflow that use `compute_pandas_digest()`, `compute_numpy_digest()`, and `get_normalized_md5_digest()` in `mlflow/data/digest_utils.py`.
Affected dataset types include:
- `PandasDataset` (`mlflow/data/pandas_dataset.py`)
- `NumpyDataset` (`mlflow/data/numpy_dataset.py`)
- `HuggingFaceDataset` (`mlflow/data/huggingface_dataset.py`)
- `TensorFlowDataset` (`mlflow/data/tensorflow_dataset.py`)
- Any downstream consumer relying on the `(dataset_name, dataset_digest)` pair for identity
---
## Vulnerability Details
### 1. MD5 Truncation to 32 Bits (Critical)
**Location:** `mlflow/data/digest_utils.py`, line 107
```python
return md5.hexdigest()[:8]
```
The final digest is truncated from 128 bits (full MD5) to just 32 bits (8 hex characters). By the birthday paradox, the probability of at least one collision among `n` random digests is approximately `1 - e^(-n^2 / 2^33)`. This means:
| Number of datasets | Collision probability |
|----|-----|
| 10,000 | ~1.2% |
| 50,000 | ~25% |
| 65,536 | ~50% |
A targeted brute-force second-preimage attack (finding a dataset matching a known digest) requires on average only **2^31 ≈ 2.1 billion** trials for MD5, but with 32-bit truncation this drops to **2^32 / 2 ≈ 2^31** in the worst case and far fewer with the structural shortcuts described below.
### 2. Non-Numeric/Non-String Columns Excluded from Digest (Critical)
**Location:** `mlflow/data/digest_utils.py`, lines 28-35
```python
string_columns = trimmed_df.columns[(df.map(type) == str).all(0)]
numeric_columns = trimmed_df.select_dtypes(include=[np.number]).columns
desired_columns = string_columns.union(numeric_columns)
trimmed_df = trimmed_df[desired_columns]
```
Columns of type `datetime64`, `timedelta64`, `bool`, `category`, mixed `object`, and any custom extension types are silently dropped before hashing. An attacker can produce two datasets with identical numeric and string columns but arbitrarily different content in excluded column types, yielding **identical digests with zero computational effort**.
### 3. Only First 10,000 Rows Hashed (High)
**Location:** `mlflow/data/digest_utils.py`, line 25
```python
trimmed_df = df.head(MAX_ROWS)
```
Only the first 10,000 rows of the DataFrame are included in the hash computation. While the total row count `len(df)` is included (line 40), any two DataFrames that share the same first 10,000 rows, the same column names, and the same total row count will produce identical digests — regardless of the content of rows 10,001 and beyond.
### 4. Type Coercion in `pd.util.hash_pandas_object` (Medium)
**Location:** `mlflow/data/digest_utils.py`, line 39
```python
pd.util.hash_pandas_object(trimmed_df).values
```
The underlying `pd.util.hash_pandas_object` and `pd.util.hash_array` perform implicit dtype promotion before hashing. This creates collision opportunities:
- **Integer width collisions:** Values like `np.int8(42)` and `np.int64(42)` may hash identically after internal promotion to a common 64-bit representation.
- **Float precision collisions:** `np.float32` values promoted to `np.float64` during hashing may collide with native `np.float64` values that differ at the float32 precision boundary.
- **Timestamp-integer aliasing:** A `pd.Timestamp` is internally stored as `int64` nanoseconds. While timestamp columns are currently filtered out (see issue #2), if this filtering is relaxed in future versions, integer columns could collide with timestamp columns sharing the same nanosecond representation.
---
## Proof of Concept
### PoC 1: Column Filtering Bypass (Zero Effort)
```python
import pandas as pd
from mlflow.data.digest_utils import compute_pandas_digest
# Legitimate dataset
df_legitimate = pd.DataFrame({
"feature_a": [1.0, 2.0, 3.0],
"label": [0, 1, 0],
"review_date": pd.to_datetime(["2024-01-01", "2024-06-15", "2024-12-31"]),
"is_validated": [True, True, True],
})
# Malicious dataset — non-numeric/non-string columns altered
df_malicious = pd.DataFrame({
"feature_a": [1.0, 2.0, 3.0],
"label": [0, 1, 0],
"review_date": pd.to_datetime(["2099-01-01", "2099-01-01", "2099-01-01"]),
"is_validated": [False, False, False],
})
assert compute_pandas_digest(df_legitimate) == compute_pandas_digest(df_malicious)
# Passes — digests are identical despite different dates and booleans
```
### PoC 2: Row Truncation Bypass (Zero Effort)
```python
import numpy as np
import pandas as pd
from mlflow.data.digest_utils import compute_pandas_digest
np.random.seed(42)
shared_head = pd.DataFrame({"value": np.random.randn(10000)})
df_legitimate = pd.concat([
shared_head,
pd.DataFrame({"value": np.zeros(5000)}) # benign tail
], ignore_index=True)
df_malicious = pd.concat([
shared_head,
pd.DataFrame({"value": np.full(5000, 999.0)}) # poisoned tail
], ignore_index=True)
assert compute_pandas_digest(df_legitimate) == compute_pandas_digest(df_malicious)
# Passes — rows beyond 10,000 are invisible to the digest
```
### PoC 3: Brute-Force 32-bit Collision
```python
import pandas as pd
from mlflow.data.digest_utils import compute_pandas_digest
target_digest = compute_pandas_digest(pd.DataFrame({"x": [1.0, 2.0, 3.0]}))
for i in range(2**25): # typically finds collision well before this
candidate = pd.DataFrame({"x": [float(i)]})
if compute_pandas_digest(candidate) == target_digest:
print(f"Collision found: i={i}, digest={target_digest}")
break
```
---
## Impact
### Dataset Identity Spoofing
MLflow identifies datasets by the `(name, digest)` tuple in the tracking store. A collision allows a malicious dataset to be treated as a previously validated, legitimate dataset. Downstream systems that cache validation results or skip re-processing based on digest matching will accept the spoofed dataset without inspection.
### Experiment Tracking Corruption
Evaluation metrics are associated with datasets via `dataset_name + dataset_digest` (see `mlflow/tracking/client.py`). Digest collisions cause metrics from different datasets to be incorrectly merged or attributed, corrupting experiment comparisons and model selection decisions.
### ML Pipeline Poisoning
In CI/CD-integrated ML pipelines, dataset digests may serve as cache keys for data validation, preprocessing, or feature engineering steps. A colliding malicious dataset could bypass these steps entirely, allowing poisoned data to flow into model training undetected.
### Audit Trail Compromise
Dataset digests stored in the MLflow tracking backend serve as the data lineage record. Collisions undermine the trustworthiness of the entire audit trail — it becomes impossible to determine, after the fact, whether a given model was trained on the legitimate or malicious dataset.
---
## Recommended Remediation
### Short Term (Patch)
1. **Extend digest length:** Use the full MD5 hexdigest (32 characters / 128 bits) or migrate to SHA-256 (64 characters / 256 bits). This raises the brute-force cost from ~2^16 (birthday) to ~2^64 or ~2^128 respectively.
```python
# Before
return md5.hexdigest()[:8]
# After (minimum fix)
return hashlib.sha256(md5.digest()).hexdigest()[:16]
```
2. **Hash all columns:** Serialize excluded column types (e.g., via `df.to_json()` or column-wise `repr()`) and include them in the digest input.
```python
# Include all columns, not just string/numeric
all_columns_hash = pd.util.hash_pandas_object(
trimmed_df.astype(str)
).values
```
3. **Include dtype information in digest:** Encode `df.dtypes` as part of the hash input to prevent type coercion collisions.
```python
dtype_bytes = str(df.dtypes.to_dict()).encode()
elements.append(dtype_bytes)
```
### Medium Term
4. **Hash beyond the first 10,000 rows:** Adopt a head-and-tail sampling strategy (already used by `EvaluationDataset`) or compute a streaming hash over the full dataset.
```python
head = df.head(MAX_ROWS // 2)
tail = df.tail(MAX_ROWS // 2)
trimmed_df = pd.concat([head, tail]).drop_duplicates()
```
5. **Add schema to digest:** Include column names, dtypes, and shape as structured input to the hash to make the digest sensitive to structural changes.
### Long Term
6. **Migrate to a cryptographic hash function:** Replace MD5 (even untruncated) with SHA-256 for all digest computations, as MD5 is considered cryptographically broken.
7. **Document digest security guarantees:** Clearly document that dataset digests are intended for deduplication convenience and are not a security mechanism — or upgrade them to be one.
---
## Attack Complexity Assessment
| Attack Vector | Complexity | Effort | Prerequisites |
|---|---|---|---|
| Column filtering bypass | None | Trivial — modify excluded columns | Knowledge of target dataset schema |
| Row truncation bypass | None | Trivial — modify rows > 10,000 | Dataset has > 10,000 rows |
| 32-bit brute force | Low | Seconds to minutes on commodity hardware | Target digest value |
| Type coercion collision | Medium | Requires understanding of pandas internals | Specific dtype combinations |
-- |
|---|