CVE-2024-35931 in Linuxinfo

Summary

by MITRE • 05/19/2024

In the Linux kernel, the following vulnerability has been resolved:

drm/amdgpu: Skip do PCI error slot reset during RAS recovery

Why: The PCI error slot reset maybe triggered after inject ue to UMC multi times, this caused system hang. [ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 557.373718] [drm] PCIE GART of 512M enabled.
[ 557.373722] [drm] PTB located at 0x0000031FED700000
[ 557.373788] [drm] VRAM is lost due to GPU reset!
[ 557.373789] [drm] PSP is resuming...
[ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset
[ 557.547067] [drm] PCI error: detected callback, state(1)!!
[ 557.547069] [drm] No support for XGMI hive yet...
[ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter
[ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations
[ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered
[ 557.610492] [drm] PCI error: slot reset callback!!
... [ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded!
[ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded!
[ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI
[ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu
[ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023
[ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu]
[ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
[ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00
[ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202
[ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0
[ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010
[ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08
[ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
[ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000
[ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000
[ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0
[ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 560.843444] PKRU: 55555554
[ 560.846480] Call Trace:
[ 560.849225]
[ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea
[ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea
[ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
[ 560.867778] ? show_regs.part.0+0x23/0x29
[ 560.872293] ? __die_body.cold+0x8/0xd
[ 560.876502] ? die_addr+0x3e/0x60
[ 560.880238] ? exc_general_protection+0x1c5/0x410
[ 560.885532] ? asm_exc_general_protection+0x27/0x30
[ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
[ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
[ 560.904520] process_one_work+0x228/0x3d0
How: In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.

Once again VulDB remains the best source for vulnerability data.

Analysis

by VulDB Data Team • 09/24/2025

The vulnerability identified as CVE-2024-35931 affects the Linux kernel's amdgpu driver, specifically within the Radeon GPU drivers used for AMD graphics hardware. This flaw manifests during Recovery After Syndrome (RAS) recovery procedures, where the system attempts to recover from uncorrectable errors in memory subsystems. The issue arises when the driver incorrectly triggers a PCI error slot reset during RAS recovery operations, leading to system hangs and potential crashes. The problem is particularly evident when multiple uncorrectable errors (UEs) are injected into the UMC (Unified Memory Controller) subsystem, causing a cascade of recovery operations that ultimately results in system instability.

The technical root cause involves the interaction between multiple recovery mechanisms within the AMD GPU driver architecture. During RAS recovery, the system issues a mode-1 reset to synchronize all nodes within a GPU hive, which is the expected behavior for handling fatal errors. However, the driver logic incorrectly attempts to initiate an additional PCI slot reset procedure while already in the midst of a recovery operation. This redundant reset attempt conflicts with the existing recovery process, causing memory corruption and ultimately leading to general protection faults. The kernel stack trace shows the fault occurring in the `amdgpu_device_gpu_recover.cold` function, indicating that the recovery process itself becomes corrupted when the additional reset is attempted.

This vulnerability has significant operational impact on systems utilizing AMD GPUs, particularly in high-performance computing environments, data centers, or servers where GPU reliability is critical. The system hang condition can result in complete service disruption, requiring manual intervention or system reboot to restore normal operation. The issue is particularly concerning in environments where GPU memory errors are common, such as in machine learning workloads, scientific computing, or graphics-intensive applications. The vulnerability can also potentially be exploited to cause denial of service conditions, where an attacker might be able to trigger the error injection sequence to repeatedly cause system instability.

The mitigation for this vulnerability involves modifying the amdgpu driver code to prevent the issuance of PCI slot resets during ongoing RAS recovery procedures. The fix ensures that when a mode-1 reset is already in progress for handling RAS errors, no additional reset operations are initiated, preventing the conflict that leads to system hangs. This aligns with the principle of avoiding redundant operations in critical system recovery paths and follows established security practices for preventing system instability. The fix is consistent with common software engineering principles for error handling and system recovery, where overlapping or conflicting recovery operations should be avoided to maintain system stability and prevent cascading failures.

From a cybersecurity perspective, this vulnerability demonstrates the importance of proper resource management and state handling in kernel drivers, particularly those managing hardware recovery mechanisms. The issue relates to CWE-362, which describes concurrent execution of a resource management operation, and CWE-399, which covers resource management errors. The vulnerability also has implications for the ATT&CK framework under the T1490 category, which deals with Execution Guardrails, as it affects the reliability and stability of system execution paths during error recovery. Proper driver validation and testing in recovery scenarios are essential to prevent such issues from manifesting in production environments, particularly in mission-critical systems where GPU reliability is paramount.

Reservation

05/17/2024

Disclosure

05/19/2024

Moderation

accepted

CPE

ready

EPSS

0.00017

KEV

no

Activities

very low

Sources

Do you know our Splunk app?

Download it now for free!