提交 #831462: mlflow 3.10.0 Digest Collision信息

标题	mlflow 3.10.0 Digest Collision
描述	## Summary MLflow's dataset digest computation mechanism (`mlflow.data.digest_utils`) contains multiple weaknesses that allow an attacker to construct a semantically different dataset that produces an identical digest to a known legitimate dataset. This enables dataset spoofing, bypassing of validation pipelines, and corruption of experiment tracking integrity. Affected Component: `mlflow/data/digest_utils.py` Severity: High (CVSS 3.1 Base Score: 7.5) Attack Vector: Network / Local CWE: CWE-328 (Use of Weak Hash), CWE-354 (Improper Validation of Integrity Check Value) --- ## Affected Versions All current versions of MLflow that use `compute_pandas_digest()`, `compute_numpy_digest()`, and `get_normalized_md5_digest()` in `mlflow/data/digest_utils.py`. Affected dataset types include: - `PandasDataset` (`mlflow/data/pandas_dataset.py`) - `NumpyDataset` (`mlflow/data/numpy_dataset.py`) - `HuggingFaceDataset` (`mlflow/data/huggingface_dataset.py`) - `TensorFlowDataset` (`mlflow/data/tensorflow_dataset.py`) - Any downstream consumer relying on the `(dataset_name, dataset_digest)` pair for identity --- ## Vulnerability Details ### 1. MD5 Truncation to 32 Bits (Critical) Location: `mlflow/data/digest_utils.py`, line 107 ```python return md5.hexdigest()[:8] ``` The final digest is truncated from 128 bits (full MD5) to just 32 bits (8 hex characters). By the birthday paradox, the probability of at least one collision among `n` random digests is approximately `1 - e^(-n^2 / 2^33)`. This means: \| Number of datasets \| Collision probability \| \|----\|-----\| \| 10,000 \| ~1.2% \| \| 50,000 \| ~25% \| \| 65,536 \| ~50% \| A targeted brute-force second-preimage attack (finding a dataset matching a known digest) requires on average only 2^31 ≈ 2.1 billion trials for MD5, but with 32-bit truncation this drops to 2^32 / 2 ≈ 2^31 in the worst case and far fewer with the structural shortcuts described below. ### 2. Non-Numeric/Non-String Columns Excluded from Digest (Critical) Location: `mlflow/data/digest_utils.py`, lines 28-35 ```python string_columns = trimmed_df.columns[(df.map(type) == str).all(0)] numeric_columns = trimmed_df.select_dtypes(include=[np.number]).columns desired_columns = string_columns.union(numeric_columns) trimmed_df = trimmed_df[desired_columns] ``` Columns of type `datetime64`, `timedelta64`, `bool`, `category`, mixed `object`, and any custom extension types are silently dropped before hashing. An attacker can produce two datasets with identical numeric and string columns but arbitrarily different content in excluded column types, yielding identical digests with zero computational effort. ### 3. Only First 10,000 Rows Hashed (High) Location: `mlflow/data/digest_utils.py`, line 25 ```python trimmed_df = df.head(MAX_ROWS) ``` Only the first 10,000 rows of the DataFrame are included in the hash computation. While the total row count `len(df)` is included (line 40), any two DataFrames that share the same first 10,000 rows, the same column names, and the same total row count will produce identical digests — regardless of the content of rows 10,001 and beyond. ### 4. Type Coercion in `pd.util.hash_pandas_object` (Medium) Location: `mlflow/data/digest_utils.py`, line 39 ```python pd.util.hash_pandas_object(trimmed_df).values ``` The underlying `pd.util.hash_pandas_object` and `pd.util.hash_array` perform implicit dtype promotion before hashing. This creates collision opportunities: - Integer width collisions: Values like `np.int8(42)` and `np.int64(42)` may hash identically after internal promotion to a common 64-bit representation. - Float precision collisions: `np.float32` values promoted to `np.float64` during hashing may collide with native `np.float64` values that differ at the float32 precision boundary. - Timestamp-integer aliasing: A `pd.Timestamp` is internally stored as `int64` nanoseconds. While timestamp columns are currently filtered out (see issue #2), if this filtering is relaxed in future versions, integer columns could collide with timestamp columns sharing the same nanosecond representation. --- ## Proof of Concept ### PoC 1: Column Filtering Bypass (Zero Effort) ```python import pandas as pd from mlflow.data.digest_utils import compute_pandas_digest # Legitimate dataset df_legitimate = pd.DataFrame({ "feature_a": [1.0, 2.0, 3.0], "label": [0, 1, 0], "review_date": pd.to_datetime(["2024-01-01", "2024-06-15", "2024-12-31"]), "is_validated": [True, True, True], }) # Malicious dataset — non-numeric/non-string columns altered df_malicious = pd.DataFrame({ "feature_a": [1.0, 2.0, 3.0], "label": [0, 1, 0], "review_date": pd.to_datetime(["2099-01-01", "2099-01-01", "2099-01-01"]), "is_validated": [False, False, False], }) assert compute_pandas_digest(df_legitimate) == compute_pandas_digest(df_malicious) # Passes — digests are identical despite different dates and booleans ``` ### PoC 2: Row Truncation Bypass (Zero Effort) ```python import numpy as np import pandas as pd from mlflow.data.digest_utils import compute_pandas_digest np.random.seed(42) shared_head = pd.DataFrame({"value": np.random.randn(10000)}) df_legitimate = pd.concat([ shared_head, pd.DataFrame({"value": np.zeros(5000)}) # benign tail ], ignore_index=True) df_malicious = pd.concat([ shared_head, pd.DataFrame({"value": np.full(5000, 999.0)}) # poisoned tail ], ignore_index=True) assert compute_pandas_digest(df_legitimate) == compute_pandas_digest(df_malicious) # Passes — rows beyond 10,000 are invisible to the digest ``` ### PoC 3: Brute-Force 32-bit Collision ```python import pandas as pd from mlflow.data.digest_utils import compute_pandas_digest target_digest = compute_pandas_digest(pd.DataFrame({"x": [1.0, 2.0, 3.0]})) for i in range(225): # typically finds collision well before this candidate = pd.DataFrame({"x": [float(i)]}) if compute_pandas_digest(candidate) == target_digest: print(f"Collision found: i={i}, digest={target_digest}") break ``` --- ## Impact ### Dataset Identity Spoofing MLflow identifies datasets by the `(name, digest)` tuple in the tracking store. A collision allows a malicious dataset to be treated as a previously validated, legitimate dataset. Downstream systems that cache validation results or skip re-processing based on digest matching will accept the spoofed dataset without inspection. ### Experiment Tracking Corruption Evaluation metrics are associated with datasets via `dataset_name + dataset_digest` (see `mlflow/tracking/client.py`). Digest collisions cause metrics from different datasets to be incorrectly merged or attributed, corrupting experiment comparisons and model selection decisions. ### ML Pipeline Poisoning In CI/CD-integrated ML pipelines, dataset digests may serve as cache keys for data validation, preprocessing, or feature engineering steps. A colliding malicious dataset could bypass these steps entirely, allowing poisoned data to flow into model training undetected. ### Audit Trail Compromise Dataset digests stored in the MLflow tracking backend serve as the data lineage record. Collisions undermine the trustworthiness of the entire audit trail — it becomes impossible to determine, after the fact, whether a given model was trained on the legitimate or malicious dataset. --- ## Recommended Remediation ### Short Term (Patch) 1. Extend digest length: Use the full MD5 hexdigest (32 characters / 128 bits) or migrate to SHA-256 (64 characters / 256 bits). This raises the brute-force cost from ~2^16 (birthday) to ~2^64 or ~2^128 respectively. ```python # Before return md5.hexdigest()[:8] # After (minimum fix) return hashlib.sha256(md5.digest()).hexdigest()[:16] ``` 2. Hash all columns: Serialize excluded column types (e.g., via `df.to_json()` or column-wise `repr()`) and include them in the digest input. ```python # Include all columns, not just string/numeric all_columns_hash = pd.util.hash_pandas_object( trimmed_df.astype(str) ).values ``` 3. Include dtype information in digest: Encode `df.dtypes` as part of the hash input to prevent type coercion collisions. ```python dtype_bytes = str(df.dtypes.to_dict()).encode() elements.append(dtype_bytes) ``` ### Medium Term 4. Hash beyond the first 10,000 rows: Adopt a head-and-tail sampling strategy (already used by `EvaluationDataset`) or compute a streaming hash over the full dataset. ```python head = df.head(MAX_ROWS // 2) tail = df.tail(MAX_ROWS // 2) trimmed_df = pd.concat([head, tail]).drop_duplicates() ``` 5. Add schema to digest: Include column names, dtypes, and shape as structured input to the hash to make the digest sensitive to structural changes. ### Long Term 6. Migrate to a cryptographic hash function: Replace MD5 (even untruncated) with SHA-256 for all digest computations, as MD5 is considered cryptographically broken. 7. Document digest security guarantees:** Clearly document that dataset digests are intended for deduplication convenience and are not a security mechanism — or upgrade them to be one. --- ## Attack Complexity Assessment \| Attack Vector \| Complexity \| Effort \| Prerequisites \| \|---\|---\|---\|---\| \| Column filtering bypass \| None \| Trivial — modify excluded columns \| Knowledge of target dataset schema \| \| Row truncation bypass \| None \| Trivial — modify rows > 10,000 \| Dataset has > 10,000 rows \| \| 32-bit brute force \| Low \| Seconds to minutes on commodity hardware \| Target digest value \| \| Type coercion collision \| Medium \| Requires understanding of pandas internals \| Specific dtype combinations \| --
来源	⚠️ https://github.com/mlflow/mlflow/issues/22419
用户	Dem0 (UID 82596)
提交	2026-05-16 10時55分 (20 日前)
管理	2026-06-04 07時07分 (19 days later)
状态	已接受
VulDB条目	368252 [MLflow 直到 3.10.0 Dataset Digest Computation digest_utils.py mlflow.data.digest_utils 弱加密]
积分	20

◂ 上一步一览下一步 ▸

Want to stay up to date on a daily basis?

Enable the mail alert feature now!