Submit #831419: mlrun v1.12.0-rc3 Hash Collisioninfo

Titel	mlrun v1.12.0-rc3 Hash Collision
Beschreibung	## Summary `mlrun.utils.helpers.calculate_dataframe_hash` relies on `pandas.util.hash_pandas_object` as its sole fingerprinting mechanism. Due to the way pandas internally hashes different dtypes (bool→uint64, datetime64→int64, int8→uint64 type promotion) and combines column hashes via linear arithmetic (mod 2^64), semantically different DataFrames can produce identical hash values. Since the hash is used to construct artifact storage paths (`{artifact_path}/{hash}{suffix}`), this leads to artifact path collision, resulting in silent data overwrite or stale data reads. ## Affected Version - Verified on v1.11.0-rc42 (commit `f802de1f4`) - Affects all versions containing `calculate_dataframe_hash` ## Affected Code ### Root Cause [`mlrun/utils/helpers.py:1626-1628`](https://github.com/mlrun/mlrun/blob/development/mlrun/utils/helpers.py#L1626-L1628) ```python def calculate_dataframe_hash(dataframe: pandas.DataFrame): # https://stackoverflow.com/questions/49883236/... return hashlib.sha1(pandas.util.hash_pandas_object(dataframe).values).hexdigest() ``` ### Impact Point [`mlrun/artifacts/dataset.py:250-261`](https://github.com/mlrun/mlrun/blob/development/mlrun/artifacts/dataset.py#L250-L261) — `resolve_dataframe_target_hash_path` uses this hash to derive `target_path`; the hash is also stored as `metadata.hash`. ```python def resolve_dataframe_target_hash_path(self, dataframe, artifact_path: str): dataframe_hash = mlrun.utils.helpers.calculate_dataframe_hash(dataframe) suffix = self._resolve_suffix() artifact_path = ( artifact_path + "/" if not artifact_path.endswith("/") else artifact_path ) target_path = f"{artifact_path}{dataframe_hash}{suffix}" return dataframe_hash, target_path ``` ### Caller [`mlrun/artifacts/dataset.py:214-220`](https://github.com/mlrun/mlrun/blob/development/mlrun/artifacts/dataset.py#L214-L220) — `DatasetArtifact.upload()` assigns the hash to `metadata.hash` and the derived path to `spec.target_path`. ## Collision Types ### Type 1: Boolean vs Integer (Natural, High Likelihood) `True` → `uint64(1)`, `False` → `uint64(0)`. Identical to integer `1` and `0`. ```python df_bool = pd.DataFrame({"flag": [True, False, True], "value": [10, 20, 30]}) df_int = pd.DataFrame({"flag": [1, 0, 1], "value": [10, 20, 30]}) assert calculate_dataframe_hash(df_bool) == calculate_dataframe_hash(df_int) # ✅ Collision confirmed ``` Likelihood: High — boolean/integer coercion is extremely common in ML pipelines (label encoding, feature preprocessing). ### Type 2: Datetime vs Integer (Natural, Medium Likelihood) `datetime64[ns]` is hashed via `.view("i8")`, producing the same bit pattern as an `int64` with the same nanosecond value. ```python ts = pd.Timestamp("1970-01-01") + pd.Timedelta(nanoseconds=100) df_time = pd.DataFrame({"col": pd.array([ts], dtype="datetime64[ns]")}) df_nint = pd.DataFrame({"col": pd.array([100], dtype="int64")}) assert calculate_dataframe_hash(df_time) == calculate_dataframe_hash(df_nint) # ✅ Collision confirmed ``` Likelihood: Medium — occurs when feature engineering converts timestamps to numeric representations. ### Type 3: Integer Precision Promotion (Natural, High Likelihood) All sub-64-bit integer types (`int8`, `int16`, `int32`) are upcast to `uint64` before hashing. ```python df_i8 = pd.DataFrame({"sensor": np.array([5, -3, 127], dtype=np.int8)}) df_i64 = pd.DataFrame({"sensor": np.array([5, -3, 127], dtype=np.int64)}) assert calculate_dataframe_hash(df_i8) == calculate_dataframe_hash(df_i64) # ✅ Collision confirmed ``` Likelihood: High — dtype narrowing/widening is common when data passes through different systems (edge devices → server, CSV parsing → typed storage). ### Type 4: Cross-Column Linear Overflow (Adversarial) Pandas combines per-column hashes via `result = result * multiplier + col_hash` under uint64 modular arithmetic. An attacker can solve for values that produce the same combined hash. ```python df1 = pd.DataFrame({"A": [1604090909467468979, 2], "B": [4, 4]}) df2 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) assert calculate_dataframe_hash(df1) == calculate_dataframe_hash(df2) # ✅ Collision confirmed ``` Likelihood: Low for natural data, but trivial to construct intentionally. ### Type 5: Combined Attack (Multiple Types Together) ```python df_original = pd.DataFrame({ "is_active": [True, False], "created_at": pd.to_datetime(["1970-01-01T00:00:00.000000001", "1970-01-01T00:00:00.000000002"]), "score": np.array([10, 20], dtype=np.int8), }) df_shadow = pd.DataFrame({ "is_active": [1, 0], "created_at": [1, 2], "score": np.array([10, 20], dtype=np.int64), }) assert calculate_dataframe_hash(df_original) == calculate_dataframe_hash(df_shadow) # ✅ Collision confirmed — completely different schema, same hash ``` ## Impact ### Artifact Path Collision `resolve_dataframe_target_hash_path` generates `target_path = f"{artifact_path}/{hash}{suffix}"`. Two different DataFrames with the same hash produce the same path, leading to: 1. Silent data overwrite — a later upload replaces an earlier artifact at the same path without warning. 2. Stale data reads — a new dataset is skipped because its hash matches an existing artifact, causing the pipeline to use outdated data. 3. Metadata corruption — `metadata.hash` no longer uniquely identifies the artifact content. ### Attack Scenario ``` User A uploads feature table (df_original) → hash "abc123" → stored at /artifacts/abc123.parquet User B uploads crafted data (df_shadow) → hash "abc123" → same path → Result (a): overwrite → User A now reads poisoned data → Result (b): skip → User B reads User A's data (data leakage) ``` ### Affected Workflows \| Workflow \| Risk \| Trigger \| \|----------\|------\|---------\| \| Feature Store ingestion \| Incremental update skipped \| Type 1, 3 (bool/int, dtype mismatch) \| \| Dataset artifact caching \| Wrong artifact served \| All types \| \| Model monitoring \| Drift undetected \| Type 1 (bool vs int labels) \| \| Multi-tenant projects \| Cross-user data collision \| Type 4, 5 (adversarial) \| ## CVSS Assessment - Vector: `AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:H/A:N` - Score: approximately 6.5 (Medium) - Rationale: Authenticated user with project access can cause data integrity loss; Types 1 and 3 require no special crafting. ## Suggested Fix ### Option A: Serialization-based hash (Recommended) Replace `hash_pandas_object`-based hashing with a serialization-based approach that preserves dtype information: ```python def calculate_dataframe_hash(dataframe: pandas.DataFrame): # Serialize to parquet (preserves schema/dtype), then hash the bytes buf = dataframe.to_parquet() # returns bytes return hashlib.sha256(buf).hexdigest() ``` Advantages: - Parquet encoding preserves exact dtypes — `bool`, `int8`, `int64`, `datetime64` all produce different byte representations. - SHA-256 provides stronger collision resistance (upgrade from SHA-1). - No dependency on pandas internal hashing implementation details. Trade-offs: - Slightly higher computational cost (serialization overhead). - Non-determinism risk: parquet metadata may include timestamps. Use `coerce_timestamps` and disable `write_statistics` if determinism is required. ### Option B: Minimal change — prepend dtype info ```python def calculate_dataframe_hash(dataframe: pandas.DataFrame): # Include dtype info in the hash to prevent cross-type collisions dtype_bytes = str(dataframe.dtypes.to_dict()).encode() data_bytes = pandas.util.hash_pandas_object(dataframe).values.tobytes() return hashlib.sha256(dtype_bytes + data_bytes).hexdigest() ``` This is a smaller change that addresses Types 1-3 (dtype collisions) while keeping the existing approach. Type 4 (linear overflow) remains theoretically possible but is not practically exploitable without adversarial intent. > Note: Both options change the hash output format, which means existing artifact paths will no longer match. A migration strategy or backward-compatible fallback may be needed. ## Reproduction Script ```python """ Run: python reproduce_hash_collision.py Expected: All assertions pass, demonstrating collisions. """ import hashlib import numpy as np import pandas as pd def calculate_dataframe_hash(dataframe: pd.DataFrame): """Exact copy from mlrun/utils/helpers.py:1626""" return hashlib.sha1( pd.util.hash_pandas_object(dataframe).values ).hexdigest() def test_bool_vs_int(): df1 = pd.DataFrame({"flag": [True, False], "val": [10, 20]}) df2 = pd.DataFrame({"flag": [1, 0], "val": [10, 20]}) h1, h2 = calculate_dataframe_hash(df1), calculate_dataframe_hash(df2) assert h1 == h2, f"Expected collision: {h1} != {h2}" print(f"[PASS] bool vs int: {h1}") def test_datetime_vs_int(): ts = pd.Timestamp("1970-01-01") + pd.Timedelta(nanoseconds=100) df1 = pd.DataFrame({"col": pd.array([ts], dtype="datetime64[ns]")}) df2 = pd.DataFrame({"col": pd.array([100], dtype="int64")}) h1, h2 = calculate_dataframe_hash(df1), calculate_dataframe_hash(df2) assert h1 == h2, f"Expected collision: {h1} != {h2}" print(f"[PASS] datetime vs int: {h1}") def test_int8_vs_int64(): df1 = pd.DataFrame({"s": np.array([5, -3, 127], dtype=np.int8)}) df2 = pd.DataFrame({"s": np.array([5, -3, 127], dtype=np.int64)}) h1, h2 = calculate_dataframe_hash(df1), calculate_dataframe_hash(df2) assert h1 == h2, f"Expected collision: {h1} != {h2}" print(f"[PASS] int8 vs int64
Quelle	⚠️ https://github.com/mlrun/mlrun/issues/9691
Benutzer	Dem0 (UID 82596)
Einreichung	16.05.2026 06:37 (vor 3 Monaten)
Moderieren	03.06.2026 17:40 (18 days later)
Status	Akzeptiert
VulDB Eintrag	368136 [mlrun bis 1.12.0-rc3 DataFrame Hash mlrun/utils/helpers.py mlrun.utils.helpers.calculate_dataframe_hash schwache Verschlüsselung]
Punkte	20

◂ Zurück Übersicht Weiter ▸

Do you need the next level of professionalism?

Upgrade your account now!