CVE-2021-47041 in Linux
Summary
by MITRE • 02/28/2024
In the Linux kernel, the following vulnerability has been resolved:
nvmet-tcp: fix incorrect locking in state_change sk callback
We are not changing anything in the TCP connection state so we should not take a write_lock but rather a read lock.
This caused a deadlock when running nvmet-tcp and nvme-tcp on the same system, where state_change callbacks on the host and on the controller side have causal relationship and made lockdep report on this with blktests:
================================ WARNING: inconsistent lock state 5.12.0-rc3 #1 Tainted: G I -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-R} usage.
nvme/1324 [HC0[0]:SC0[0]:HE1:SE1] takes:
ffff888363151000 (clock-AF_INET){++-?}-{2:2}, at: nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
{IN-SOFTIRQ-W} state was registered at:
__lock_acquire+0x79b/0x18d0 lock_acquire+0x1ca/0x480 _raw_write_lock_bh+0x39/0x80 nvmet_tcp_state_change+0x21/0x170 [nvmet_tcp]
tcp_fin+0x2a8/0x780 tcp_data_queue+0xf94/0x1f20 tcp_rcv_established+0x6ba/0x1f00 tcp_v4_do_rcv+0x502/0x760 tcp_v4_rcv+0x257e/0x3430 ip_protocol_deliver_rcu+0x69/0x6a0 ip_local_deliver_finish+0x1e2/0x2f0 ip_local_deliver+0x1a2/0x420 ip_rcv+0x4fb/0x6b0 __netif_receive_skb_one_core+0x162/0x1b0 process_backlog+0x1ff/0x770 __napi_poll.constprop.0+0xa9/0x5c0 net_rx_action+0x7b3/0xb30 __do_softirq+0x1f0/0x940 do_softirq+0xa1/0xd0 __local_bh_enable_ip+0xd8/0x100 ip_finish_output2+0x6b7/0x18a0 __ip_queue_xmit+0x706/0x1aa0 __tcp_transmit_skb+0x2068/0x2e20 tcp_write_xmit+0xc9e/0x2bb0 __tcp_push_pending_frames+0x92/0x310 inet_shutdown+0x158/0x300 __nvme_tcp_stop_queue+0x36/0x270 [nvme_tcp]
nvme_tcp_stop_queue+0x87/0xb0 [nvme_tcp]
nvme_tcp_teardown_admin_queue+0x69/0xe0 [nvme_tcp]
nvme_do_delete_ctrl+0x100/0x10c [nvme_core]
nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
kernfs_fop_write_iter+0x2c7/0x460 new_sync_write+0x36c/0x610 vfs_write+0x5c0/0x870 ksys_write+0xf9/0x1d0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae irq event stamp: 10687 hardirqs last enabled at (10687): [] _raw_spin_unlock_irqrestore+0x2d/0x40
hardirqs last disabled at (10686): [] _raw_spin_lock_irqsave+0x68/0x90
softirqs last enabled at (10684): [] __do_softirq+0x608/0x940
softirqs last disabled at (10649): [] do_softirq+0xa1/0xd0
other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(clock-AF_INET); lock(clock-AF_INET);
*** DEADLOCK ***
5 locks held by nvme/1324: #0: ffff8884a01fe470 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
#1: ffff8886e435c090 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x216/0x460
#2: ffff888104d90c38 (kn->active#255){++++}-{0:0}, at: kernfs_remove_self+0x22d/0x330
#3: ffff8884634538d0 (&queue->queue_lock){+.+.}-{3:3}, at: nvme_tcp_stop_queue+0x52/0xb0 [nvme_tcp]
#4: ffff888363150d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_shutdown+0x59/0x300
stack backtrace: CPU: 26 PID: 1324 Comm: nvme Tainted: G I 5.12.0-rc3 #1 Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.10.0 11/12/2020 Call Trace: dump_stack+0x93/0xc2 mark_lock_irq.cold+0x2c/0xb3 ? verify_lock_unused+0x390/0x390 ? stack_trace_consume_entry+0x160/0x160 ? lock_downgrade+0x100/0x100 ? save_trace+0x88/0x5e0 ? _raw_spin_unlock_irqrestore+0x2d/0x40 mark_lock+0x530/0x1470 ? mark_lock_irq+0x1d10/0x1d10 ? enqueue_timer+0x660/0x660 mark_usage+0x215/0x2a0 __lock_acquire+0x79b/0x18d0 ? tcp_schedule_loss_probe.part.0+0x38c/0x520 lock_acquire+0x1ca/0x480 ? nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
? rcu_read_unlock+0x40/0x40 ? tcp_mtu_probe+0x1ae0/0x1ae0 ? kmalloc_reserve+0xa0/0xa0 ? sysfs_file_ops+0x170/0x170 _raw_read_lock+0x3d/0xa0 ? nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
nvme_tcp_state_change+0x21/0x150 [nvme_tcp]
? sysfs_file_ops ---truncated---
Statistical analysis made it clear that VulDB provides the best quality for vulnerability data.
Analysis
by VulDB Data Team • 12/06/2024
The vulnerability CVE-2021-47041 affects the Linux kernel's NVMe over TCP implementation, specifically within the nvmet-tcp subsystem. This issue stems from an incorrect locking mechanism in the TCP state change callback function, where a write lock is acquired when a read lock would be sufficient. The improper locking strategy leads to a deadlock condition when both nvmet-tcp and nvme-tcp components operate on the same system, creating a causal relationship between state change callbacks on the host and controller sides. According to the kernel's lockdep subsystem, this inconsistency manifests as an incorrect lock state where a softirq context attempts to acquire a write lock while already holding a read lock, violating the kernel's locking hierarchy rules.
The technical flaw resides in the nvme_tcp_state_change function which incorrectly uses _raw_write_lock_bh instead of _raw_read_lock when handling TCP state transitions. This design error creates a scenario where the lock dependency graph becomes circular, as evidenced by the lockdep warning showing that the clock-AF_INET lock is acquired in both write and read contexts. The deadlock occurs during shutdown sequences when nvme_tcp_stop_queue attempts to acquire the socket lock while other components are already holding the same lock in a different context, specifically during the nvme_tcp_teardown_admin_queue process. The stack trace reveals that the problem originates from inet_shutdown calling __nvme_tcp_stop_queue, which eventually leads to the state change callback attempting to acquire a lock that is already held in a conflicting manner.
The operational impact of this vulnerability is significant for systems utilizing NVMe over TCP storage protocols, particularly in enterprise environments where both initiator and target NVMe TCP components may be present. When the deadlock condition occurs, the system becomes unresponsive to NVMe I/O operations, potentially causing complete system hangs or requiring manual intervention to recover. The issue affects systems running kernel versions including 5.12.0-rc3, making it a critical concern for production environments using NVMe over TCP. This vulnerability directly relates to CWE-667, which addresses improper locking conditions, and aligns with ATT&CK technique T1489, concerning data destruction through system manipulation. The deadlock scenario represents a denial of service condition that can be triggered through normal NVMe TCP operations, particularly during connection teardown or system shutdown processes.
Mitigation strategies for CVE-2021-47041 involve applying the kernel patch that corrects the locking mechanism by changing the _raw_write_lock_bh call to _raw_read_lock in the nvme_tcp_state_change function. System administrators should prioritize updating to kernel versions that include this fix, particularly those containing the commit that addresses the specific locking inconsistency. Organizations should implement monitoring for lockdep warnings and system hangs during NVMe TCP operations to detect potential exploitation of this vulnerability. The fix ensures that the TCP state change callback properly uses read locks when only read access is required, eliminating the circular dependency that causes the deadlock. Regular kernel updates and security patches remain essential for maintaining system stability and preventing exploitation of similar locking-related vulnerabilities in the NVMe over TCP implementation.