In the Linux kernel, the following vulnerability has been resolved:
net: usb: rtl8150: fix memory leak on usb_submit_urb() failure
In async_set_registers(), when usb_submit_urb() fails, the allocated
async_req structure and URB are not freed, causing a memory leak.
The completion callback async_set_reg_cb() is responsible for freeing
these allocations, but it is only called after the URB is successfully
submitted and completes (successfully or with error). If submission
fails, the callback never runs and the memory is leaked.
Fix this by freeing both the URB and the request structure in the error
path when usb_submit_urb() fails.
In the Linux kernel, the following vulnerability has been resolved:
iommu: disable SVA when CONFIG_X86 is set
Patch series "Fix stale IOTLB entries for kernel address space", v7.
This proposes a fix for a security vulnerability related to IOMMU Shared
Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
page table entries. When a kernel page table page is freed and
reallocated for another purpose, the IOMMU might still hold stale,
incorrect entries. This can be exploited to cause a use-after-free or
write-after-free condition, potentially leading to privilege escalation or
data corruption.
This solution introduces a deferred freeing mechanism for kernel page
table pages, which provides a safe window to notify the IOMMU to
invalidate its caches before the page is reused.
This patch (of 8):
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.
Use-After-Free (UAF) and Write-After-Free (WAF) conditions arise when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a
Write-After-Free issue, as the IOMMU will potentially continue to write
Accessed and Dirty bits to the freed memory while attempting to walk the
stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables all
the way down to the leaf entries, where it realizes the mapping is for the
kernel and errors out. This means the IOMMU still caches these
intermediate page table entries, making the described vulnerability a real
concern.
Disable SVA on x86 architecture until the IOMMU can receive notification
to flush the paging cache before freeing the CPU kernel page table pages.
In the Linux kernel, the following vulnerability has been resolved:
driver core: fix potential null-ptr-deref in device_add()
I got the following null-ptr-deref report while doing fault injection test:
BUG: kernel NULL pointer dereference, address: 0000000000000058
CPU: 2 PID: 278 Comm: 37-i2c-ds2482 Tainted: G B W N 6.1.0-rc3+
RIP: 0010:klist_put+0x2d/0xd0
Call Trace:
<TASK>
klist_remove+0xf1/0x1c0
device_release_driver_internal+0x196/0x210
bus_remove_device+0x1bd/0x240
device_add+0xd3d/0x1100
w1_add_master_device+0x476/0x490 [wire]
ds2482_probe+0x303/0x3e0 [ds2482]
This is how it happened:
w1_alloc_dev()
// The dev->driver is set to w1_master_driver.
memcpy(&dev->dev, device, sizeof(struct device));
device_add()
bus_add_device()
dpm_sysfs_add() // It fails, calls bus_remove_device.
// error path
bus_remove_device()
// The dev->driver is not null, but driver is not bound.
__device_release_driver()
klist_remove(&dev->p->knode_driver) <-- It causes null-ptr-deref.
// normal path
bus_probe_device() // It's not called yet.
device_bind_driver()
If dev->driver is set, in the error path after calling bus_add_device()
in device_add(), bus_remove_device() is called, then the device will be
detached from driver. But device_bind_driver() is not called yet, so it
causes null-ptr-deref while access the 'knode_driver'. To fix this, set
dev->driver to null in the error path before calling bus_remove_device().
In the Linux kernel, the following vulnerability has been resolved:
HID: uclogic: Correct devm device reference for hidinput input_dev name
Reference the HID device rather than the input device for the devm
allocation of the input_dev name. Referencing the input_dev would lead to a
use-after-free when the input_dev was unregistered and subsequently fires a
uevent that depends on the name. At the point of firing the uevent, the
name would be freed by devres management.
Use devm_kasprintf to simplify the logic for allocating memory and
formatting the input_dev name string.
In the Linux kernel, the following vulnerability has been resolved:
btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
From the memory-barriers.txt document regarding memory barrier ordering
guarantees:
(*) These guarantees do not apply to bitfields, because compilers often
generate code to modify these using non-atomic read-modify-write
sequences. Do not attempt to use bitfields to synchronize parallel
algorithms.
(*) Even in cases where bitfields are protected by locks, all fields
in a given bitfield must be protected by one lock. If two fields
in a given bitfield are protected by different locks, the compiler's
non-atomic read-modify-write sequences can cause an update to one
field to corrupt the value of an adjacent field.
btrfs_space_info has a bitfield sharing an underlying word consisting of
the fields full, chunk_alloc, and flush:
struct btrfs_space_info {
struct btrfs_fs_info * fs_info; /* 0 8 */
struct btrfs_space_info * parent; /* 8 8 */
...
int clamp; /* 172 4 */
unsigned int full:1; /* 176: 0 4 */
unsigned int chunk_alloc:1; /* 176: 1 4 */
unsigned int flush:1; /* 176: 2 4 */
...
Therefore, to be safe from parallel read-modify-writes losing a write to
one of the bitfield members protected by a lock, all writes to all the
bitfields must use the lock. They almost universally do, except for
btrfs_clear_space_info_full() which iterates over the space_infos and
writes out found->full = 0 without a lock.
Imagine that we have one thread completing a transaction in which we
finished deleting a block_group and are thus calling
btrfs_clear_space_info_full() while simultaneously the data reclaim
ticket infrastructure is running do_async_reclaim_data_space():
T1 T2
btrfs_commit_transaction
btrfs_clear_space_info_full
data_sinfo->full = 0
READ: full:0, chunk_alloc:0, flush:1
do_async_reclaim_data_space(data_sinfo)
spin_lock(&space_info->lock);
if(list_empty(tickets))
space_info->flush = 0;
READ: full: 0, chunk_alloc:0, flush:1
MOD/WRITE: full: 0, chunk_alloc:0, flush:0
spin_unlock(&space_info->lock);
return;
MOD/WRITE: full:0, chunk_alloc:0, flush:1
and now data_sinfo->flush is 1 but the reclaim worker has exited. This
breaks the invariant that flush is 0 iff there is no work queued or
running. Once this invariant is violated, future allocations that go
into __reserve_bytes() will add tickets to space_info->tickets but will
see space_info->flush is set to 1 and not queue the work. After this,
they will block forever on the resulting ticket, as it is now impossible
to kick the worker again.
I also confirmed by looking at the assembly of the affected kernel that
it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
the assembly is:
andb $0xfb,0x60(%rbx)
and similarly for setting the full (1st) bit to 0:
andb $0xfe,-0x20(%rax)
So I think this is really a bug on practical systems. I have observed
a number of systems in this exact state, but am currently unable to
reproduce it.
Rather than leaving this footgun lying around for the future, take
advantage of the fact that there is room in the struct anyway, and that
it is already quite large and simply change the three bitfield members to
bools. This avoids writes to space_info->full having any effect on
---truncated---
In the Linux kernel, the following vulnerability has been resolved:
team: Move team device type change at the end of team_port_add
Attempting to add a port device that is already up will expectedly fail,
but not before modifying the team device header_ops.
In the case of the syzbot reproducer the gre0 device is
already in state UP when it attempts to add it as a
port device of team0, this fails but before that
header_ops->create of team0 is changed from eth_header to ipgre_header
in the call to team_dev_type_check_change.
Later when we end up in ipgre_header() struct ip_tunnel* points to nonsense
as the private data of the device still holds a struct team.
Example sequence of iproute2 commands to reproduce the hang/BUG():
ip link add dev team0 type team
ip link add dev gre0 type gre
ip link set dev gre0 up
ip link set dev gre0 master team0
ip link set dev team0 up
ping -I team0 1.1.1.1
Move team_dev_type_check_change down where all other checks have passed
as it changes the dev type with no way to restore it in case
one of the checks that follow it fail.
Also make sure to preserve the origial mtu assignment:
- If port_dev is not the same type as dev, dev takes mtu from port_dev
- If port_dev is the same type as dev, port_dev takes mtu from dev
This is done by adding a conditional before the call to dev_set_mtu
to prevent it from assigning port_dev->mtu = dev->mtu and instead
letting team_dev_type_check_change assign dev->mtu = port_dev->mtu.
The conditional is needed because the patch moves the call to
team_dev_type_check_change past dev_set_mtu.
Testing:
- team device driver in-tree selftests
- Add/remove various devices as slaves of team device
- syzbot
In the Linux kernel, the following vulnerability has been resolved:
drm/radeon: delete radeon_fence_process in is_signaled, no deadlock
Delete the attempt to progress the queue when checking if fence is
signaled. This avoids deadlock.
dma-fence_ops::signaled can be called with the fence lock in unknown
state. For radeon, the fence lock is also the wait queue lock. This can
cause a self deadlock when signaled() tries to make forward progress on
the wait queue. But advancing the queue is unneeded because incorrectly
returning false from signaled() is perfectly acceptable.
(cherry picked from commit 527ba26e50ec2ca2be9c7c82f3ad42998a75d0db)
In the Linux kernel, the following vulnerability has been resolved:
ksm: use range-walk function to jump over holes in scan_get_next_rmap_item
Currently, scan_get_next_rmap_item() walks every page address in a VMA to
locate mergeable pages. This becomes highly inefficient when scanning
large virtual memory areas that contain mostly unmapped regions, causing
ksmd to use large amount of cpu without deduplicating much pages.
This patch replaces the per-address lookup with a range walk using
walk_page_range(). The range walker allows KSM to skip over entire
unmapped holes in a VMA, avoiding unnecessary lookups. This problem was
previously discussed in [1].
Consider the following test program which creates a 32 TiB mapping in the
virtual address space but only populates a single page:
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
/* 32 TiB */
const size_t size = 32ul * 1024 * 1024 * 1024 * 1024;
int main() {
char *area = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0);
if (area == MAP_FAILED) {
perror("mmap() failed\n");
return -1;
}
/* Populate a single page such that we get an anon_vma. */
*area = 0;
/* Enable KSM. */
madvise(area, size, MADV_MERGEABLE);
pause();
return 0;
}
$ ./ksm-sparse &
$ echo 1 > /sys/kernel/mm/ksm/run
Without this patch ksmd uses 100% of the cpu for a long time (more then 1
hour in my test machine) scanning all the 32 TiB virtual address space
that contain only one mapped page. This makes ksmd essentially deadlocked
not able to deduplicate anything of value. With this patch ksmd walks
only the one mapped page and skips the rest of the 32 TiB virtual address
space, making the scan fast using little cpu.