In the Linux kernel, the following vulnerability has been resolved:
block: fix race between set_blocksize and read paths
With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.
Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device. The read call can create an order-0 folio to
read the first 4096 bytes from the disk. But then udev is preempted.
Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device. The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.
Now udev resumes, still holding the order-0 folio it allocated. It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev. We then submit the bio with a
NULL block device and crash.
Therefore, truncate the page cache after flushing but before updating
i_blksize. However, that's not enough -- we also need to lock out file
IO and page faults during the update. Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.
I don't know if this is the correct fix, but xfs/259 found it.
In the Linux kernel, the following vulnerability has been resolved:
vhost-scsi: protect vq->log_used with vq->mutex
The vhost-scsi completion path may access vq->log_base when vq->log_used is
already set to false.
vhost-thread QEMU-thread
vhost_scsi_complete_cmd_work()
-> vhost_add_used()
-> vhost_add_used_n()
if (unlikely(vq->log_used))
QEMU disables vq->log_used
via VHOST_SET_VRING_ADDR.
mutex_lock(&vq->mutex);
vq->log_used = false now!
mutex_unlock(&vq->mutex);
QEMU gfree(vq->log_base)
log_used()
-> log_write(vq->log_base)
Assuming the VMM is QEMU. The vq->log_base is from QEMU userpace and can be
reclaimed via gfree(). As a result, this causes invalid memory writes to
QEMU userspace.
The control queue path has the same issue.
In the Linux kernel, the following vulnerability has been resolved:
scsi: target: iscsi: Fix timeout on deleted connection
NOPIN response timer may expire on a deleted connection and crash with
such logs:
Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus (null),i,0x00023d000125,iqn.2017-01.com.iscsi.target,t,0x3d
BUG: Kernel NULL pointer dereference on read at 0x00000000
NIP strlcpy+0x8/0xb0
LR iscsit_fill_cxn_timeout_err_stats+0x5c/0xc0 [iscsi_target_mod]
Call Trace:
iscsit_handle_nopin_response_timeout+0xfc/0x120 [iscsi_target_mod]
call_timer_fn+0x58/0x1f0
run_timer_softirq+0x740/0x860
__do_softirq+0x16c/0x420
irq_exit+0x188/0x1c0
timer_interrupt+0x184/0x410
That is because nopin response timer may be re-started on nopin timer
expiration.
Stop nopin timer before stopping the nopin response timer to be sure
that no one of them will be re-started.
In the Linux kernel, the following vulnerability has been resolved:
genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
The IOMMU translation for MSI message addresses has been a 2-step process,
separated in time:
1) iommu_dma_prepare_msi(): A cookie pointer containing the IOVA address
is stored in the MSI descriptor when an MSI interrupt is allocated.
2) iommu_dma_compose_msi_msg(): this cookie pointer is used to compute a
translated message address.
This has an inherent lifetime problem for the pointer stored in the cookie
that must remain valid between the two steps. However, there is no locking
at the irq layer that helps protect the lifetime. Today, this works under
the assumption that the iommu domain is not changed while MSI interrupts
being programmed. This is true for normal DMA API users within the kernel,
as the iommu domain is attached before the driver is probed and cannot be
changed while a driver is attached.
Classic VFIO type1 also prevented changing the iommu domain while VFIO was
running as it does not support changing the "container" after starting up.
However, iommufd has improved this so that the iommu domain can be changed
during VFIO operation. This potentially allows userspace to directly race
VFIO_DEVICE_ATTACH_IOMMUFD_PT (which calls iommu_attach_group()) and
VFIO_DEVICE_SET_IRQS (which calls into iommu_dma_compose_msi_msg()).
This potentially causes both the cookie pointer and the unlocked call to
iommu_get_domain_for_dev() on the MSI translation path to become UAFs.
Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA
address is already known during iommu_dma_prepare_msi() and cannot change.
Thus, it can simply be stored as an integer in the MSI descriptor.
The other UAF related to iommu_get_domain_for_dev() will be addressed in
patch "iommu: Make iommu_dma_prepare_msi() into a generic operation" by
using the IOMMU group mutex.
In the Linux kernel, the following vulnerability has been resolved:
dm: fix unconditional IO throttle caused by REQ_PREFLUSH
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().
An example from v5.4, similar problem also exists in upstream:
crash> bt 2091206
PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0"
#0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
#1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
#2 [ffff800084a2f880] schedule at ffff800040bfa4b4
#3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
#4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
#5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
#6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
#7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
#8 [ffff800084a2fa60] generic_make_request at ffff800040570138
#9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
#10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
#11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
#12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
#13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
#14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
#15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
#16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
#17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
#18 [ffff800084a2fe70] kthread at ffff800040118de4
After commit 2def2845cc33 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.
Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().
In the Linux kernel, the following vulnerability has been resolved:
virtio: break and reset virtio devices on device_shutdown()
Hongyu reported a hang on kexec in a VM. QEMU reported invalid memory
accesses during the hang.
Invalid read at addr 0x102877002, size 2, region '(null)', reason: rejected
Invalid write at addr 0x102877A44, size 2, region '(null)', reason: rejected
...
It was traced down to virtio-console. Kexec works fine if virtio-console
is not in use.
The issue is that virtio-console continues to write to the MMIO even after
underlying virtio-pci device is reset.
Additionally, Eric noticed that IOMMUs are reset before devices, if
devices are not reset on shutdown they continue to poke at guest memory
and get errors from the IOMMU. Some devices get wedged then.
The problem can be solved by breaking all virtio devices on virtio
bus shutdown, then resetting them.
In the Linux kernel, the following vulnerability has been resolved:
orangefs: Do not truncate file size
'len' is used to store the result of i_size_read(), so making 'len'
a size_t results in truncation to 4GiB on 32-bit systems.
In the Linux kernel, the following vulnerability has been resolved:
dm cache: prevent BUG_ON by blocking retries on failed device resumes
A cache device failing to resume due to mapping errors should not be
retried, as the failure leaves a partially initialized policy object.
Repeating the resume operation risks triggering BUG_ON when reloading
cache mappings into the incomplete policy object.
Reproduce steps:
1. create a cache metadata consisting of 512 or more cache blocks,
with some mappings stored in the first array block of the mapping
array. Here we use cache_restore v1.0 to build the metadata.
cat <<EOF >> cmeta.xml
<superblock uuid="" block_size="128" nr_cache_blocks="512" \
policy="smq" hint_width="4">
<mappings>
<mapping cache_block="0" origin_block="0" dirty="false"/>
</mappings>
</superblock>
EOF
dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2
dmsetup remove cmeta
2. wipe the second array block of the mapping array to simulate
data degradations.
mapping_root=$(dd if=/dev/sdc bs=1c count=8 skip=192 \
2>/dev/null | hexdump -e '1/8 "%u\n"')
ablock=$(dd if=/dev/sdc bs=1c count=8 skip=$((4096*mapping_root+2056)) \
2>/dev/null | hexdump -e '1/8 "%u\n"')
dd if=/dev/zero of=/dev/sdc bs=4k count=1 seek=$ablock
3. try bringing up the cache device. The resume is expected to fail
due to the broken array block.
dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
dmsetup create cdata --table "0 65536 linear /dev/sdc 8192"
dmsetup create corig --table "0 524288 linear /dev/sdc 262144"
dmsetup create cache --notable
dmsetup load cache --table "0 524288 cache /dev/mapper/cmeta \
/dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0"
dmsetup resume cache
4. try resuming the cache again. An unexpected BUG_ON is triggered
while loading cache mappings.
dmsetup resume cache
Kernel logs:
(snip)
------------[ cut here ]------------
kernel BUG at drivers/md/dm-cache-policy-smq.c:752!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 332 Comm: dmsetup Not tainted 6.13.4 #3
RIP: 0010:smq_load_mapping+0x3e5/0x570
Fix by disallowing resume operations for devices that failed the
initial attempt.
In the Linux kernel, the following vulnerability has been resolved:
rseq: Fix segfault on registration when rseq_cs is non-zero
The rseq_cs field is documented as being set to 0 by user-space prior to
registration, however this is not currently enforced by the kernel. This
can result in a segfault on return to user-space if the value stored in
the rseq_cs field doesn't point to a valid struct rseq_cs.
The correct solution to this would be to fail the rseq registration when
the rseq_cs field is non-zero. However, some older versions of glibc
will reuse the rseq area of previous threads without clearing the
rseq_cs field and will also terminate the process if the rseq
registration fails in a secondary thread. This wasn't caught in testing
because in this case the leftover rseq_cs does point to a valid struct
rseq_cs.
What we can do is clear the rseq_cs field on registration when it's
non-zero which will prevent segfaults on registration and won't break
the glibc versions that reuse rseq areas on thread creation.
In the Linux kernel, the following vulnerability has been resolved:
crypto: lzo - Fix compression buffer overrun
Unlike the decompression code, the compression code in LZO never
checked for output overruns. It instead assumes that the caller
always provides enough buffer space, disregarding the buffer length
provided by the caller.
Add a safe compression interface that checks for the end of buffer
before each write. Use the safe interface in crypto/lzo.