| शीर्षक | onnx onnx-mlir v0.5.0.0 cache key collision |
|---|
| विवरण | ## Affected Component
```text
Component: torch_onnxmlir backend session cache
Affected file: src/Runtime/python/torch_onnxmlir/src/torch_onnxmlir/backend.py
Affected function: generate_hash_key()
Related cache object: global_session_cache / SessionCache
```
## Affected Versions
```text
Affected: onnx-mlir revisions containing the lightweight hashing implementation
that generated placeholder cache key material from tensor shape without tensor
dtype.
Confirmed affected: parent commit 27a08138aa182c526bb559e38a8921901a0b4646.
Fixed: commit 1a25fe4155065fb8c5de4c3fe55c39531cc18e8a, merged through PR #3427
on 2026-03-30. GitHub reports merge commit
72c5187ff6d13c2c2b3d3789b8f5faf99f08a5b4.
Exact released version range: not verified. Confirm with the onnx-mlir
maintainers or release notes before submitting as a final VulDB entry.
```
## Vulnerability Class
```text
Cache key collision / improper cache key generation / incomplete comparison
with missing factor
```
Suggested CWE mapping:
```text
CWE-1023: Incomplete Comparison with Missing Factors
```
Secondary CWE candidates:
```text
CWE-345: Insufficient Verification of Data Authenticity
CWE-706: Use of Incorrectly-Resolved Name or Reference
```
## Severity
Suggested conservative library-level severity:
```text
Medium
CVSS 3.1: 5.5
Vector: CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:H/A:N
```
Rationale:
```text
The issue corrupts inference integrity but does not directly provide code
execution or memory corruption. Exploitation requires the attacker to influence
models or example inputs in the same cache namespace. In a networked or
multi-tenant inference service that exposes model compilation or dtype selection
to untrusted users, the deployment-specific score may be higher.
```
## Summary
```text
onnx-mlir contains a cache key collision issue in the torch_onnxmlir backend.
When generate_hash_key() used the lightweight hashing path, placeholder nodes
were represented by placeholder index and tensor shape only. Tensor dtype was
not included in the cache key material.
Two torch.fx GraphModule instances with the same graph structure and input shape
but different input dtypes, such as torch.float32 and torch.float64, could
therefore produce the same cache key. The second model could silently reuse the
first model's compiled onnx-mlir shared object and return incorrect inference
results without an error or warning.
```
## Technical Description
The vulnerable logic is in `generate_hash_key()` when
`use_lightweight_hashing=True`, which is the default argument in the affected
function.
Before PR #3427, placeholder nodes with tensor metadata were added to
`graph_info` using only a stable placeholder counter and the tensor shape:
```python
shape_str = ",".join(shape)
node_info.append(
f"om_placeholder_{placeholder_counter}_[{shape_str}]"
)
```
The generated placeholder component was identical across dtypes:
```text
om_placeholder_0_[1,768]
```
The rest of the lightweight cache key is derived from this graph string and the
compile options:
```text
graph_hash = sha256(graph_str)
options_hash = sha256(optimized_pickle(compile_options))
cache_key = "_om_" + graph_hash + options_hash
```
Because dtype was excluded from `graph_str`, two models that differ only by
input dtype could resolve to the same `cache_key` when their compile options
also matched. `TorchONNXMLIR` then uses that key to query `global_session_cache`.
On a cache hit, it reuses the cached `InferenceSession` and compiled `.so`
instead of exporting and compiling the current model.
The heavy hashing path is not affected. When `use_lightweight_hashing=False`,
`OMFxGraphCachePickler` reduces tensors through PyTorch
`extract_tensor_metadata_for_cache_key()`, which includes tensor metadata such
as dtype.
## Attack Requirements
```text
The attacker must be able to cause at least two model compilation or inference
requests to share the same torch_onnxmlir cache namespace. The attacker also
needs influence over model/input dtype or over two otherwise equivalent models
that differ only by dtype.
The issue is most relevant for long-running Python processes, notebooks,
benchmarking systems, or model-serving deployments that compile multiple dtype
variants of the same model while sharing the same onnx-mlir session cache.
```
## Collision Conditions
```text
Two requests can collide when all of the following hold:
1. The FX graph structure is equivalent.
2. Placeholder order and count are equivalent.
3. Placeholder shapes are equivalent, including symbolic dimension pattern.
4. Compile options serialize to the same optimized pickle.
5. Relevant sampled parameter values and non-placeholder node inputs match.
6. Placeholder tensor dtype differs.
```
Example:
```text
Model A input: shape [1, 768], dtype torch.float32
Model B input: shape [1, 768], dtype torch.float64
Compile options: identical
Before the fix:
Model A placeholder key part: om_placeholder_0_[1,768]
Model B placeholder key part: om_placeholder_0_[1,768]
Result: same graph hash and same cache key
After the fix:
Model A placeholder key part: om_placeholder_0_[1,768]_torch.float32
Model B placeholder key part: om_placeholder_0_[1,768]_torch.float64
Result: different graph hash and different cache key
```
## Proof of Concept
The following is a minimal conceptual proof for the vulnerable revision. It
shows the expected assertion before and after the patch; exact GraphModule
construction depends on the PyTorch version used by the test environment.
```python
import torch
from torch_onnxmlir.backend import generate_hash_key
def make_graph_module(dtype):
class Model(torch.nn.Module):
def forward(self, x):
return x + x
model = Model().eval()
example = torch.ones((1, 768), dtype=dtype)
gm = torch.fx.symbolic_trace(model)
# The backend's hashing path relies on example_value metadata on
# placeholder nodes. Test harnesses can populate this through the normal
# torch.compile/export path or directly for a focused unit test.
for node in gm.graph.nodes:
if node.op == "placeholder":
node.meta["example_value"] = example
break
return gm
options = {"opt_level": 3}
gm_f32 = make_graph_module(torch.float32)
gm_f64 = make_graph_module(torch.float64)
key_f32 = generate_hash_key(gm_f32, options)
key_f64 = generate_hash_key(gm_f64, options)
# Vulnerable revision: keys are equal because dtype is not in the placeholder
# key material. Fixed revision: keys differ.
assert key_f32 != key_f64
```
In an affected runtime, the practical failure sequence is:
```text
1. Compile or run the float32 model first.
2. The compiled onnx-mlir session is stored under the shape-only cache key.
3. Compile or run the float64 model in the same cache namespace.
4. The backend computes the same cache key and reuses the float32 compiled
session.
5. Inference completes without an exception but returns results generated by
the wrong compiled artifact.
```
## Impact
```text
Successful exploitation causes silent inference integrity failure. The backend
can return numerically incorrect output while preserving the expected output
shape and without logging an error.
Possible consequences include wrong evaluation metrics, incorrect benchmark
results, incorrect production inference decisions, and cache poisoning between
dtype variants of the same model in shared model-serving environments.
This is not known to be a memory-safety issue and does not by itself imply
arbitrary code execution. Confidentiality impact is deployment-dependent and is
not the primary impact.
```
## Scope and Limitations
```text
The collision requires shared cache state. The PR discussion describes the cache
as in-memory, and the current SessionCache implementation also supports loading
and writing compiled sessions under TORCHONNXMLIR_CACHE_DIR, defaulting to
~/.cache/. Deployments should consider both process lifetime and configured
cache directory reuse.
The same_hash_counter optimization does not solve the issue because it can only
compare the already-colliding generated hash values.
The issue affects the lightweight hashing path. The heavy pickler-based hashing
path includes tensor metadata and is not affected by this specific dtype
omission.
```
## Fix / Mitigation
PR #3427 fixes the collision by adding tensor dtype to the placeholder key
material:
```python
shape_str = ",".join(shape)
dtype = node.meta["example_value"].dtype
node_info.append(
f"om_placeholder_{placeholder_counter}_[{shape_str}]_{dtype}"
)
```
Recommended mitigation:
```text
1. Upgrade to a revision containing commit
1a25fe4155065fb8c5de4c3fe55c39531cc18e8a or the PR #3427 merge.
2. Restart long-running processes after upgrade.
3. If TORCHONNXMLIR_CACHE_DIR is used, clear stale onnx-mlir compiled-session
entries associated with the old shape-only cache scheme. Avoid deleting an
entire shared cache directory blindly.
4. Add regression tests that assert different dtypes produce different
generate_hash_key() values for otherwise equivalent GraphModules.
```
## Disclosure Timeline
```text
2026-03-25: PR #3427 opened with root cause, impact, and patch.
2026-03-26: Maintainer review approved the change.
2026-03-27: Maintainer confirmed an unrelated macOS failure was not caused by
the patch and stated the PR was good to merge.
2026-03-30: PR #3427 merged into onnx-mlir main.
2026-03-30: Jenkins Linux amd64 and Linux s390x builds for the patch passed.
```
## References
```text
Upstream PR:
https://github.com/onnx/onnx-mlir/pull/3427
Patch commit:
https://github.com/onnx/onnx-mlir/commit/1a25fe4155065fb8c5de4c3fe5 |
|---|
| स्रोत | ⚠️ https://github.com/onnx/onnx-mlir/pull/3427 |
|---|
| उपयोगकर्ता | Dem00 (UID 84913) |
|---|
| सबमिशन | 18/05/2026 08:22 AM (19 दिन पहले) |
|---|
| संयम | 05/06/2026 08:43 AM (18 days later) |
|---|
| स्थिति | स्वीकृत |
|---|
| VulDB प्रविष्टि | 368865 [onnx onnx-mlir तक 0.5.0.0 Placeholder Node Cache backend.py generate_hash_key कमजोर एन्क्रिप्शन] |
|---|
| अंक | 20 |
|---|