In the Linux kernel, the following vulnerability has been resolved:
net/sched: sch_red: Replace direct dequeue call with peek and qdisc_dequeue_peeked
When red qdisc has children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (red in this case), it will do the following:
1a. do a peek() - and when sensing there's an skb the child can offer, then
- the child in this case(red) calls its child's (qfq) peek.
qfq does the right thing and will return the gso_skb queue packet.
Note: if there wasnt a gso_skb entry then qfq will store it there.
1b. invoke a dequeue() on the child (red). And herein lies the problem.
- red will call the child's dequeue() which will essentially just
try to grab something of qfq's queue.
[ 78.667668][ T363] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[ 78.667927][ T363] CPU: 1 UID: 0 PID: 363 Comm: ping Not tainted 7.1.0-rc1-00033-g46f74a3f7d57-dirty #790 PREEMPT(full)
[ 78.668263][ T363] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 78.668486][ T363] RIP: 0010:qfq_dequeue+0x446/0xc90 [sch_qfq]
[ 78.668718][ T363] Code: 54 c0 e8 dd 90 00 f1 48 c7 c7 e0 03 54 c0 48 89 de e8 ce 90 00 f1 48 8d 7b 48 b8 ff ff 37 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 <80> 3c 02 00 74 05 e8 ef a1 e1 f1 48 8b 7b 48 48 8d 54 24 58 48 8d
[ 78.669312][ T363] RSP: 0018:ffff88810de573e0 EFLAGS: 00010216
[ 78.669533][ T363] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 78.669790][ T363] RDX: 0000000000000009 RSI: 0000000000000004 RDI: 0000000000000048
[ 78.670044][ T363] RBP: ffff888110dc4000 R08: ffffffffb1b0885a R09: fffffbfff6ba9078
[ 78.670297][ T363] R10: 0000000000000003 R11: ffff888110e31c80 R12: 0000001880000000
[ 78.670560][ T363] R13: ffff888110dc4150 R14: ffff888110dc42b8 R15: 0000000000000200
[ 78.670814][ T363] FS: 00007f66a8f09c40(0000) GS:ffff888163428000(0000) knlGS:0000000000000000
[ 78.671110][ T363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 78.671324][ T363] CR2: 000055db4c6a30a8 CR3: 000000010da67000 CR4: 0000000000750ef0
[ 78.671585][ T363] PKRU: 55555554
[ 78.671713][ T363] Call Trace:
[ 78.671843][ T363] <TASK>
[ 78.671936][ T363] ? __pfx_qfq_dequeue+0x10/0x10 [sch_qfq]
[ 78.672148][ T363] ? __pfx__printk+0x10/0x10
[ 78.672322][ T363] ? srso_alias_return_thunk+0x5/0xfbef5
[ 78.672496][ T363] ? lockdep_hardirqs_on_prepare+0xa8/0x1a0
[ 78.672706][ T363] ? srso_alias_return_thunk+0x5/0xfbef5
[ 78.672875][ T363] ? trace_hardirqs_on+0x19/0x1a0
[ 78.673047][ T363] red_dequeue+0x65/0x270 [sch_red]
[ 78.673217][ T363] ? srso_alias_return_thunk+0x5/0xfbef5
[ 78.673385][ T363] tbf_dequeue.cold+0xb0/0x70c [sch_tbf]
[ 78.673566][ T363] __qdisc_run+0x169/0x1900
The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.
In the Linux kernel, the following vulnerability has been resolved:
net/rds: reset op_nents when zerocopy page pin fails
When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(),
the pinned pages are released with put_page(), and
rm->data.op_mmp_znotifier is cleared. But we fail to properly
clear rm->data.op_nents.
Later when rds_message_purge() is called from rds_sendmsg() the
cleanup loop iterates over the incorrectly non zero number of
op_nents and frees them again.
Fix this by properly resetting op_nents when it should be in
rds_message_zcopy_from_user().
In the Linux kernel, the following vulnerability has been resolved:
crypto: pcrypt - Fix handling of MAY_BACKLOG requests
MAY_BACKLOG requests can return EBUSY. Handle them by checking
for that value and filtering out EINPROGRESS notifications.
In the Linux kernel, the following vulnerability has been resolved:
lib/crypto: mpi: Fix integer underflow in mpi_read_raw_from_sgl()
Yiming reports an integer underflow in mpi_read_raw_from_sgl() when
subtracting "lzeros" from the unsigned "nbytes".
For this to happen, the scatterlist "sgl" needs to occupy more bytes
than the "nbytes" parameter and the first "nbytes + 1" bytes of the
scatterlist must be zero. Under these conditions, the while loop
iterating over the scatterlist will count more zeroes than "nbytes",
subtract the number of zeroes from "nbytes" and cause the underflow.
When commit 2d4d1eea540b ("lib/mpi: Add mpi sgl helpers") originally
introduced the bug, it couldn't be triggered because all callers of
mpi_read_raw_from_sgl() passed a scatterlist whose length was equal to
"nbytes".
However since commit 63ba4d67594a ("KEYS: asymmetric: Use new crypto
interface without scatterlists"), the underflow can now actually be
triggered. When invoking a KEYCTL_PKEY_ENCRYPT system call with a
larger "out_len" than "in_len" and filling the "in" buffer with zeroes,
crypto_akcipher_sync_prep() will create an all-zero scatterlist used for
both the "src" and "dst" member of struct akcipher_request and thereby
fulfil the conditions to trigger the bug:
sys_keyctl()
keyctl_pkey_e_d_s()
asymmetric_key_eds_op()
software_key_eds_op()
crypto_akcipher_sync_encrypt()
crypto_akcipher_sync_prep()
crypto_akcipher_encrypt()
rsa_enc()
mpi_read_raw_from_sgl()
To the user this will be visible as a DoS as the kernel spins forever,
causing soft lockup splats as a side effect.
Fix it.
In the Linux kernel, the following vulnerability has been resolved:
KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated
Explicitly set/clear CR8 write interception when AVIC is (de)activated to
fix a bug where KVM leaves the interception enabled after AVIC is
activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8
will remain intercepted in perpetuity.
On its own, the dangling CR8 intercept is "just" a performance issue, but
combined with the TPR sync bug fixed by commit d02e48830e3f ("KVM: SVM:
Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging
intercept is fatal to Windows guests as the TPR seen by hardware gets
wildly out of sync with reality.
Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored
when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in
KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this
is firmly an SVM implementation flaw/detail.
WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should
never enter the guest with AVIC enabled and CR8 writes intercepted.
[Squash fix to avic_deactivate_vmcb. - Paolo]
In the Linux kernel, the following vulnerability has been resolved:
mmc: core: Avoid bitfield RMW for claim/retune flags
Move claimed and retune control flags out of the bitfield word to
avoid unrelated RMW side effects in asynchronous contexts.
The host->claimed bit shared a word with retune flags. Writes to claimed
in __mmc_claim_host() or retune_now in mmc_mq_queue_rq() can overwrite
other bits when concurrent updates happen in other contexts, triggering
spurious WARN_ON(!host->claimed). Convert claimed, can_retune,
retune_now and retune_paused to bool to remove shared-word coupling.
In the Linux kernel, the following vulnerability has been resolved:
rxrpc: Also unshare DATA/RESPONSE packets when paged frags are present
The DATA-packet handler in rxrpc_input_call_event() and the RESPONSE
handler in rxrpc_verify_response() copy the skb to a linear one before
calling into the security ops only when skb_cloned() is true. An skb
that is not cloned but still carries externally-owned paged fragments
(e.g. SKBFL_SHARED_FRAG set by splice() into a UDP socket via
__ip_append_data, or a chained skb_has_frag_list()) falls through to
the in-place decryption path, which binds the frag pages directly into
the AEAD/skcipher SGL via skb_to_sgvec().
Extend the gate to also unshare when skb_has_frag_list() or
skb_has_shared_frag() is true. This catches the splice-loopback vector
and other externally-shared frag sources while preserving the
zero-copy fast path for skbs whose frags are kernel-private (e.g. NIC
page_pool RX, GRO). The OOM/trace handling already in place is reused.
In the Linux kernel, the following vulnerability has been resolved:
unshare: fix unshare_fs() handling
There's an unpleasant corner case in unshare(2), when we have a
CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that
case copy_mnt_ns() gets passed current->fs instead of a private copy,
which causes interesting warts in proof of correctness]
> I guess if private means fs->users == 1, the condition could still be true.
Unfortunately, it's worse than just a convoluted proof of correctness.
Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
(and current->fs->users == 1).
We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and
flips current->fs->{pwd,root} to corresponding locations in the new namespace.
Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
destroyed and its mount tree is dissolved, but... current->fs->root and
current->fs->pwd are both left pointing to now detached mounts.
They are pinning those, so it's not a UAF, but it leaves the calling
process with unshare(2) failing with -ENOMEM _and_ leaving it with
pwd and root on detached isolated mounts. The last part is clearly a bug.
There is other fun related to that mess (races with pivot_root(), including
the one between pivot_root() and fork(), of all things), but this one
is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
fs_struct even if it hadn't been shared in the first place". Sure, we could
go for something like "if both CLONE_NEWNS *and* one of the things that might
end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
force allocation of new fs_struct", but let's keep it simple - the cost
of copy_fs_struct() is trivial.
Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
a freshly allocated fs_struct, yet to be attached to anything. That
seriously simplifies the analysis...
FWIW, that bug had been there since the introduction of unshare(2) ;-/
In the Linux kernel, the following vulnerability has been resolved:
net/mlx5e: Fix DMA FIFO desync on error CQE SQ recovery
In case of a TX error CQE, a recovery flow is triggered,
mlx5e_reset_txqsq_cc_pc() resets dma_fifo_cc to 0 but not dma_fifo_pc,
desyncing the DMA FIFO producer and consumer.
After recovery, the producer pushes new DMA entries at the old
dma_fifo_pc, while the consumer reads from position 0.
This causes us to unmap stale DMA addresses from before the recovery.
The DMA FIFO is a purely software construct with no HW counterpart.
At the point of reset, all WQEs have been flushed so dma_fifo_cc is
already equal to dma_fifo_pc. There is no need to reset either counter,
similar to how skb_fifo pc/cc are untouched.
Remove the 'dma_fifo_cc = 0' reset.
This fixes the following WARNING:
WARNING: CPU: 0 PID: 0 at drivers/iommu/dma-iommu.c:1240 iommu_dma_unmap_page+0x79/0x90
Modules linked in: mlx5_vdpa vringh vdpa bonding mlx5_ib mlx5_vfio_pci ipip mlx5_fwctl tunnel4 mlx5_core ib_ipoib geneve ip6_gre ip_gre gre nf_tables ip6_tunnel rdma_ucm ib_uverbs ib_umad vfio_pci vfio_pci_core act_mirred act_skbedit act_vlan vhost_net vhost tap ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress vhost_iotlb iptable_raw tunnel6 vfio_iommu_type1 vfio openvswitch nsh rpcsec_gss_krb5 auth_rpcgss oid_registry xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter overlay zram zsmalloc rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core fuse [last unloaded: nf_tables]
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc5_for_upstream_min_debug_2024_12_30_21_33 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:iommu_dma_unmap_page+0x79/0x90
Code: 2b 4d 3b 21 72 26 4d 3b 61 08 73 20 49 89 d8 44 89 f9 5b 4c 89 f2 4c 89 e6 48 89 ef 5d 41 5c 41 5d 41 5e 41 5f e9 c7 ae 9e ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 66 2e 0f 1f 84 00 00 00 00
Call Trace:
<IRQ>
? __warn+0x7d/0x110
? iommu_dma_unmap_page+0x79/0x90
? report_bug+0x16d/0x180
? handle_bug+0x4f/0x90
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
? iommu_dma_unmap_page+0x79/0x90
? iommu_dma_unmap_page+0x2e/0x90
dma_unmap_page_attrs+0x10d/0x1b0
mlx5e_tx_wi_dma_unmap+0xbe/0x120 [mlx5_core]
mlx5e_poll_tx_cq+0x16d/0x690 [mlx5_core]
mlx5e_napi_poll+0x8b/0xac0 [mlx5_core]
__napi_poll+0x24/0x190
net_rx_action+0x32a/0x3b0
? mlx5_eq_comp_int+0x7e/0x270 [mlx5_core]
? notifier_call_chain+0x35/0xa0
handle_softirqs+0xc9/0x270
irq_exit_rcu+0x71/0xd0
common_interrupt+0x7f/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40