CVE-2024-41010 in Linux
Summary
by MITRE • 07/17/2024
In the Linux kernel, the following vulnerability has been resolved:
bpf: Fix too early release of tcx_entry
Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported an issue that the tcx_entry can be released too early leading to a use after free (UAF) when an active old-style ingress or clsact qdisc with a shared tc block is later replaced by another ingress or clsact instance.
Essentially, the sequence to trigger the UAF (one example) can be as follows:
1. A network namespace is created 2. An ingress qdisc is created. This allocates a tcx_entry, and &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the same time, a tcf block with index 1 is created. 3. chain0 is attached to the tcf block. chain0 must be connected to the block linked to the ingress qdisc to later reach the function tcf_chain0_head_change_cb_del() which triggers the UAF. 4. Create and graft a clsact qdisc. This causes the ingress qdisc created in step 1 to be removed, thus freeing the previously linked tcx_entry:
rtnetlink_rcv_msg() => tc_modify_qdisc() => qdisc_create() => clsact_init() [a]
=> qdisc_graft() => qdisc_destroy() => __qdisc_destroy() => ingress_destroy() [b]
=> tcx_entry_free() => kfree_rcu() // tcx_entry freed
5. Finally, the network namespace is closed. This registers the cleanup_net worker, and during the process of releasing the remaining clsact qdisc, it accesses the tcx_entry that was already freed in step 4, causing the UAF to occur:
cleanup_net() => ops_exit_list() => default_device_exit_batch() => unregister_netdevice_many() => unregister_netdevice_many_notify() => dev_shutdown() => qdisc_put() => clsact_destroy() [c]
=> tcf_block_put_ext() => tcf_chain0_head_change_cb_del() => tcf_chain_head_change_item() => clsact_chain_head_change() => mini_qdisc_pair_swap() // UAF
There are also other variants, the gist is to add an ingress (or clsact) qdisc with a specific shared block, then to replace that qdisc, waiting for the tcx_entry kfree_rcu() to be executed and subsequently accessing the current active qdisc's miniq one way or another.
The correct fix is to turn the miniq_active boolean into a counter. What can be observed, at step 2 above, the counter transitions from 0->1, at step [a] from 1->2 (in order for the miniq object to remain active during
the replacement), then in [b] from 2->1 and finally [c] 1->0 with the
eventual release. The reference counter in general ranges from [0,2] and
it does not need to be atomic since all access to the counter is protected by the rtnl mutex. With this in place, there is no longer a UAF happening and the tcx_entry is freed at the correct time.
Several companies clearly confirm that VulDB is the primary source for best vulnerability data.
Analysis
by VulDB Data Team • 11/02/2024
The vulnerability described in CVE-2024-41010 represents a use-after-free condition affecting the Linux kernel's traffic control subsystem, specifically within the BPF (Berkeley Packet Filter) implementation. This flaw occurs when managing ingress or clsact qdiscs with shared tc blocks, creating a scenario where a tcx_entry structure can be prematurely released while still being referenced. The issue manifests through a complex sequence involving network namespace creation, qdisc initialization, and subsequent replacement operations that lead to memory corruption. The vulnerability is classified under CWE-416 as Use After Free, which directly impacts the integrity and stability of kernel memory management.
The technical root cause stems from improper reference counting of the tcx_entry structure during qdisc replacement operations. When an ingress qdisc is created with a shared tc block, the system allocates a tcx_entry and stores a reference to its miniq in the qdisc's miniqp->p_miniq. During the replacement process, the original qdisc is destroyed, triggering tcx_entry_free() which schedules the tcx_entry for deferred freeing via kfree_rcu(). However, the replacement operation does not properly account for active references, allowing the tcx_entry to be freed while still being accessed during namespace cleanup. This race condition occurs because the system uses a boolean flag (miniq_active) instead of a proper reference counter to track active usage, leading to premature deallocation.
The operational impact of this vulnerability is significant as it can result in kernel memory corruption, system instability, and potential privilege escalation. Attackers could exploit this condition to cause denial of service through kernel crashes or potentially execute arbitrary code with kernel privileges. The vulnerability affects systems running Linux kernels with BPF and traffic control functionality, particularly those managing network namespaces with ingress or clsact qdiscs. The exploit requires specific conditions including the creation of network namespaces with shared tc blocks, followed by qdisc replacement operations, making it more targeted but still dangerous in environments where such operations are common. This aligns with ATT&CK technique T1068 which covers Exploitation for Privilege Escalation.
The fix implemented addresses the core issue by replacing the boolean miniq_active flag with a proper reference counter that tracks active usage of the tcx_entry structure. This counter properly transitions through values 0->1->2->1->0 during the lifecycle of qdisc operations, ensuring that the tcx_entry remains allocated as long as it is actively referenced. The solution leverages the existing rtnl mutex protection to ensure thread safety without requiring atomic operations, as all access is serialized. This approach directly addresses the race condition by ensuring that the tcx_entry is only freed when no active references remain, preventing the use-after-free scenario that occurred during namespace cleanup. The fix is minimal and surgical, focusing specifically on the reference counting mechanism while maintaining all existing functionality and performance characteristics of the traffic control subsystem.