जमा करें #832358: onnx onnx-mlir v0.5.0.0 cache key collisionजानकारी

शीर्षक	onnx onnx-mlir v0.5.0.0 cache key collision
विवरण	## Affected Component ```text Component: torch_onnxmlir backend session cache Affected file: src/Runtime/python/torch_onnxmlir/src/torch_onnxmlir/backend.py Affected function: generate_hash_key() Related cache object: global_session_cache / SessionCache ``` ## Affected Versions ```text Affected: onnx-mlir revisions containing the lightweight hashing implementation that generated placeholder cache key material from tensor shape without tensor dtype. Confirmed affected: parent commit 27a08138aa182c526bb559e38a8921901a0b4646. Fixed: commit 1a25fe4155065fb8c5de4c3fe55c39531cc18e8a, merged through PR #3427 on 2026-03-30. GitHub reports merge commit 72c5187ff6d13c2c2b3d3789b8f5faf99f08a5b4. Exact released version range: not verified. Confirm with the onnx-mlir maintainers or release notes before submitting as a final VulDB entry. ``` ## Vulnerability Class ```text Cache key collision / improper cache key generation / incomplete comparison with missing factor ``` Suggested CWE mapping: ```text CWE-1023: Incomplete Comparison with Missing Factors ``` Secondary CWE candidates: ```text CWE-345: Insufficient Verification of Data Authenticity CWE-706: Use of Incorrectly-Resolved Name or Reference ``` ## Severity Suggested conservative library-level severity: ```text Medium CVSS 3.1: 5.5 Vector: CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:H/A:N ``` Rationale: ```text The issue corrupts inference integrity but does not directly provide code execution or memory corruption. Exploitation requires the attacker to influence models or example inputs in the same cache namespace. In a networked or multi-tenant inference service that exposes model compilation or dtype selection to untrusted users, the deployment-specific score may be higher. ``` ## Summary ```text onnx-mlir contains a cache key collision issue in the torch_onnxmlir backend. When generate_hash_key() used the lightweight hashing path, placeholder nodes were represented by placeholder index and tensor shape only. Tensor dtype was not included in the cache key material. Two torch.fx GraphModule instances with the same graph structure and input shape but different input dtypes, such as torch.float32 and torch.float64, could therefore produce the same cache key. The second model could silently reuse the first model's compiled onnx-mlir shared object and return incorrect inference results without an error or warning. ``` ## Technical Description The vulnerable logic is in `generate_hash_key()` when `use_lightweight_hashing=True`, which is the default argument in the affected function. Before PR #3427, placeholder nodes with tensor metadata were added to `graph_info` using only a stable placeholder counter and the tensor shape: ```python shape_str = ",".join(shape) node_info.append( f"om_placeholder_{placeholder_counter}_[{shape_str}]" ) ``` The generated placeholder component was identical across dtypes: ```text om_placeholder_0_[1,768] ``` The rest of the lightweight cache key is derived from this graph string and the compile options: ```text graph_hash = sha256(graph_str) options_hash = sha256(optimized_pickle(compile_options)) cache_key = "_om_" + graph_hash + options_hash ``` Because dtype was excluded from `graph_str`, two models that differ only by input dtype could resolve to the same `cache_key` when their compile options also matched. `TorchONNXMLIR` then uses that key to query `global_session_cache`. On a cache hit, it reuses the cached `InferenceSession` and compiled `.so` instead of exporting and compiling the current model. The heavy hashing path is not affected. When `use_lightweight_hashing=False`, `OMFxGraphCachePickler` reduces tensors through PyTorch `extract_tensor_metadata_for_cache_key()`, which includes tensor metadata such as dtype. ## Attack Requirements ```text The attacker must be able to cause at least two model compilation or inference requests to share the same torch_onnxmlir cache namespace. The attacker also needs influence over model/input dtype or over two otherwise equivalent models that differ only by dtype. The issue is most relevant for long-running Python processes, notebooks, benchmarking systems, or model-serving deployments that compile multiple dtype variants of the same model while sharing the same onnx-mlir session cache. ``` ## Collision Conditions ```text Two requests can collide when all of the following hold: 1. The FX graph structure is equivalent. 2. Placeholder order and count are equivalent. 3. Placeholder shapes are equivalent, including symbolic dimension pattern. 4. Compile options serialize to the same optimized pickle. 5. Relevant sampled parameter values and non-placeholder node inputs match. 6. Placeholder tensor dtype differs. ``` Example: ```text Model A input: shape [1, 768], dtype torch.float32 Model B input: shape [1, 768], dtype torch.float64 Compile options: identical Before the fix: Model A placeholder key part: om_placeholder_0_[1,768] Model B placeholder key part: om_placeholder_0_[1,768] Result: same graph hash and same cache key After the fix: Model A placeholder key part: om_placeholder_0_[1,768]_torch.float32 Model B placeholder key part: om_placeholder_0_[1,768]_torch.float64 Result: different graph hash and different cache key ``` ## Proof of Concept The following is a minimal conceptual proof for the vulnerable revision. It shows the expected assertion before and after the patch; exact GraphModule construction depends on the PyTorch version used by the test environment. ```python import torch from torch_onnxmlir.backend import generate_hash_key def make_graph_module(dtype): class Model(torch.nn.Module): def forward(self, x): return x + x model = Model().eval() example = torch.ones((1, 768), dtype=dtype) gm = torch.fx.symbolic_trace(model) # The backend's hashing path relies on example_value metadata on # placeholder nodes. Test harnesses can populate this through the normal # torch.compile/export path or directly for a focused unit test. for node in gm.graph.nodes: if node.op == "placeholder": node.meta["example_value"] = example break return gm options = {"opt_level": 3} gm_f32 = make_graph_module(torch.float32) gm_f64 = make_graph_module(torch.float64) key_f32 = generate_hash_key(gm_f32, options) key_f64 = generate_hash_key(gm_f64, options) # Vulnerable revision: keys are equal because dtype is not in the placeholder # key material. Fixed revision: keys differ. assert key_f32 != key_f64 ``` In an affected runtime, the practical failure sequence is: ```text 1. Compile or run the float32 model first. 2. The compiled onnx-mlir session is stored under the shape-only cache key. 3. Compile or run the float64 model in the same cache namespace. 4. The backend computes the same cache key and reuses the float32 compiled session. 5. Inference completes without an exception but returns results generated by the wrong compiled artifact. ``` ## Impact ```text Successful exploitation causes silent inference integrity failure. The backend can return numerically incorrect output while preserving the expected output shape and without logging an error. Possible consequences include wrong evaluation metrics, incorrect benchmark results, incorrect production inference decisions, and cache poisoning between dtype variants of the same model in shared model-serving environments. This is not known to be a memory-safety issue and does not by itself imply arbitrary code execution. Confidentiality impact is deployment-dependent and is not the primary impact. ``` ## Scope and Limitations ```text The collision requires shared cache state. The PR discussion describes the cache as in-memory, and the current SessionCache implementation also supports loading and writing compiled sessions under TORCHONNXMLIR_CACHE_DIR, defaulting to ~/.cache/. Deployments should consider both process lifetime and configured cache directory reuse. The same_hash_counter optimization does not solve the issue because it can only compare the already-colliding generated hash values. The issue affects the lightweight hashing path. The heavy pickler-based hashing path includes tensor metadata and is not affected by this specific dtype omission. ``` ## Fix / Mitigation PR #3427 fixes the collision by adding tensor dtype to the placeholder key material: ```python shape_str = ",".join(shape) dtype = node.meta["example_value"].dtype node_info.append( f"om_placeholder_{placeholder_counter}_[{shape_str}]_{dtype}" ) ``` Recommended mitigation: ```text 1. Upgrade to a revision containing commit 1a25fe4155065fb8c5de4c3fe55c39531cc18e8a or the PR #3427 merge. 2. Restart long-running processes after upgrade. 3. If TORCHONNXMLIR_CACHE_DIR is used, clear stale onnx-mlir compiled-session entries associated with the old shape-only cache scheme. Avoid deleting an entire shared cache directory blindly. 4. Add regression tests that assert different dtypes produce different generate_hash_key() values for otherwise equivalent GraphModules. ``` ## Disclosure Timeline ```text 2026-03-25: PR #3427 opened with root cause, impact, and patch. 2026-03-26: Maintainer review approved the change. 2026-03-27: Maintainer confirmed an unrelated macOS failure was not caused by the patch and stated the PR was good to merge. 2026-03-30: PR #3427 merged into onnx-mlir main. 2026-03-30: Jenkins Linux amd64 and Linux s390x builds for the patch passed. ``` ## References ```text Upstream PR: https://github.com/onnx/onnx-mlir/pull/3427 Patch commit: https://github.com/onnx/onnx-mlir/commit/1a25fe4155065fb8c5de4c3fe5
स्रोत	⚠️ https://github.com/onnx/onnx-mlir/pull/3427
उपयोगकर्ता	Dem00 (UID 84913)
सबमिशन	18/05/2026 08:22 AM (19 दिन पहले)
संयम	05/06/2026 08:43 AM (18 days later)
स्थिति	स्वीकृत
VulDB प्रविष्टि	368865 [onnx onnx-mlir तक 0.5.0.0 Placeholder Node Cache backend.py generate_hash_key कमजोर एन्क्रिप्शन]
अंक	20

◂ पिछला सिंहावलोकन अगला ▸

Do you know our Splunk app?

Download it now for free!