In the Linux kernel, the following vulnerability has been resolved:
md/raid1,raid10: don't ignore IO flags
If blk-wbt is enabled by default, it's found that raid write performance
is quite bad because all IO are throttled by wbt of underlying disks,
due to flag REQ_IDLE is ignored. And turns out this behaviour exist since
blk-wbt is introduced.
Other than REQ_IDLE, other flags should not be ignored as well, for
example REQ_META can be set for filesystems, clearing it can cause priority
reverse problems; And REQ_NOWAIT should not be cleared as well, because
io will wait instead of failing directly in underlying disks.
Fix those problems by keep IO flags from master bio.
Fises: f51d46d0e7cb ("md: add support for REQ_NOWAIT")
In the Linux kernel, the following vulnerability has been resolved:
f2fs: fix potential deadloop in prepare_compress_overwrite()
Jan Prusakowski reported a kernel hang issue as below:
When running xfstests on linux-next kernel (6.14.0-rc3, 6.12) I
encountered a problem in generic/475 test where fsstress process
gets blocked in __f2fs_write_data_pages() and the test hangs.
The options I used are:
MKFS_OPTIONS -- -O compression -O extra_attr -O project_quota -O quota /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o discard,compress_extension=* /dev/vdc /vdc
INFO: task kworker/u8:0:11 blocked for more than 122 seconds.
Not tainted 6.14.0-rc3-xfstests-lockdep #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack:0 pid:11 tgid:11 ppid:2 task_flags:0x4208160 flags:0x00004000
Workqueue: writeback wb_workfn (flush-253:0)
Call Trace:
<TASK>
__schedule+0x309/0x8e0
schedule+0x3a/0x100
schedule_preempt_disabled+0x15/0x30
__mutex_lock+0x59a/0xdb0
__f2fs_write_data_pages+0x3ac/0x400
do_writepages+0xe8/0x290
__writeback_single_inode+0x5c/0x360
writeback_sb_inodes+0x22f/0x570
wb_writeback+0xb0/0x410
wb_do_writeback+0x47/0x2f0
wb_workfn+0x5a/0x1c0
process_one_work+0x223/0x5b0
worker_thread+0x1d5/0x3c0
kthread+0xfd/0x230
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
The root cause is: once generic/475 starts toload error table to dm
device, f2fs_prepare_compress_overwrite() will loop reading compressed
cluster pages due to IO error, meanwhile it has held .writepages lock,
it can block all other writeback tasks.
Let's fix this issue w/ below changes:
- add f2fs_handle_page_eio() in prepare_compress_overwrite() to
detect IO error.
- detect cp_error earler in f2fs_read_multi_pages().
In the Linux kernel, the following vulnerability has been resolved:
ax25: Remove broken autobind
Binding AX25 socket by using the autobind feature leads to memory leaks
in ax25_connect() and also refcount leaks in ax25_release(). Memory
leak was detected with kmemleak:
================================================================
unreferenced object 0xffff8880253cd680 (size 96):
backtrace:
__kmalloc_node_track_caller_noprof (./include/linux/kmemleak.h:43)
kmemdup_noprof (mm/util.c:136)
ax25_rt_autobind (net/ax25/ax25_route.c:428)
ax25_connect (net/ax25/af_ax25.c:1282)
__sys_connect_file (net/socket.c:2045)
__sys_connect (net/socket.c:2064)
__x64_sys_connect (net/socket.c:2067)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
================================================================
When socket is bound, refcounts must be incremented the way it is done
in ax25_bind() and ax25_setsockopt() (SO_BINDTODEVICE). In case of
autobind, the refcounts are not incremented.
This bug leads to the following issue reported by Syzkaller:
================================================================
ax25_connect(): syz-executor318 uses autobind, please contact jreuter@yaina.de
------------[ cut here ]------------
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 5317 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
Modules linked in:
CPU: 0 UID: 0 PID: 5317 Comm: syz-executor318 Not tainted 6.14.0-rc4-syzkaller-00278-gece144f151ac #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
...
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:336 [inline]
refcount_dec include/linux/refcount.h:351 [inline]
ref_tracker_free+0x6af/0x7e0 lib/ref_tracker.c:236
netdev_tracker_free include/linux/netdevice.h:4302 [inline]
netdev_put include/linux/netdevice.h:4319 [inline]
ax25_release+0x368/0x960 net/ax25/af_ax25.c:1080
__sock_release net/socket.c:647 [inline]
sock_close+0xbc/0x240 net/socket.c:1398
__fput+0x3e9/0x9f0 fs/file_table.c:464
__do_sys_close fs/open.c:1580 [inline]
__se_sys_close fs/open.c:1565 [inline]
__x64_sys_close+0x7f/0x110 fs/open.c:1565
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
</TASK>
================================================================
Considering the issues above and the comments left in the code that say:
"check if we can remove this feature. It is broken."; "autobinding in this
may or may not work"; - it is better to completely remove this feature than
to fix it because it is broken and leads to various kinds of memory bugs.
Now calling connect() without first binding socket will result in an
error (-EINVAL). Userspace software that relies on the autobind feature
might get broken. However, this feature does not seem widely used with
this specific driver as it was not reliable at any point of time, and it
is already broken anyway. E.g. ax25-tools and ax25-apps packages for
popular distributions do not use the autobind feature for AF_AX25.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
In the Linux kernel, the following vulnerability has been resolved:
net: Remove RTNL dance for SIOCBRADDIF and SIOCBRDELIF.
SIOCBRDELIF is passed to dev_ioctl() first and later forwarded to
br_ioctl_call(), which causes unnecessary RTNL dance and the splat
below [0] under RTNL pressure.
Let's say Thread A is trying to detach a device from a bridge and
Thread B is trying to remove the bridge.
In dev_ioctl(), Thread A bumps the bridge device's refcnt by
netdev_hold() and releases RTNL because the following br_ioctl_call()
also re-acquires RTNL.
In the race window, Thread B could acquire RTNL and try to remove
the bridge device. Then, rtnl_unlock() by Thread B will release RTNL
and wait for netdev_put() by Thread A.
Thread A, however, must hold RTNL after the unlock in dev_ifsioc(),
which may take long under RTNL pressure, resulting in the splat by
Thread B.
Thread A (SIOCBRDELIF) Thread B (SIOCBRDELBR)
---------------------- ----------------------
sock_ioctl sock_ioctl
`- sock_do_ioctl `- br_ioctl_call
`- dev_ioctl `- br_ioctl_stub
|- rtnl_lock |
|- dev_ifsioc '
' |- dev = __dev_get_by_name(...)
|- netdev_hold(dev, ...) .
/ |- rtnl_unlock ------. |
| |- br_ioctl_call `---> |- rtnl_lock
Race | | `- br_ioctl_stub |- br_del_bridge
Window | | | |- dev = __dev_get_by_name(...)
| | | May take long | `- br_dev_delete(dev, ...)
| | | under RTNL pressure | `- unregister_netdevice_queue(dev, ...)
| | | | `- rtnl_unlock
\ | |- rtnl_lock <-' `- netdev_run_todo
| |- ... `- netdev_run_todo
| `- rtnl_unlock |- __rtnl_unlock
| |- netdev_wait_allrefs_any
|- netdev_put(dev, ...) <----------------'
Wait refcnt decrement
and log splat below
To avoid blocking SIOCBRDELBR unnecessarily, let's not call
dev_ioctl() for SIOCBRADDIF and SIOCBRDELIF.
In the dev_ioctl() path, we do the following:
1. Copy struct ifreq by get_user_ifreq in sock_do_ioctl()
2. Check CAP_NET_ADMIN in dev_ioctl()
3. Call dev_load() in dev_ioctl()
4. Fetch the master dev from ifr.ifr_name in dev_ifsioc()
3. can be done by request_module() in br_ioctl_call(), so we move
1., 2., and 4. to br_ioctl_stub().
Note that 2. is also checked later in add_del_if(), but it's better
performed before RTNL.
SIOCBRADDIF and SIOCBRDELIF have been processed in dev_ioctl() since
the pre-git era, and there seems to be no specific reason to process
them there.
[0]:
unregister_netdevice: waiting for wpan3 to become free. Usage count = 2
ref_tracker: wpan3@ffff8880662d8608 has 1/1 users at
__netdev_tracker_alloc include/linux/netdevice.h:4282 [inline]
netdev_hold include/linux/netdevice.h:4311 [inline]
dev_ifsioc+0xc6a/0x1160 net/core/dev_ioctl.c:624
dev_ioctl+0x255/0x10c0 net/core/dev_ioctl.c:826
sock_do_ioctl+0x1ca/0x260 net/socket.c:1213
sock_ioctl+0x23a/0x6c0 net/socket.c:1318
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl fs/ioctl.c:892 [inline]
__x64_sys_ioctl+0x1a4/0x210 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
In the Linux kernel, the following vulnerability has been resolved:
ext4: avoid journaling sb update on error if journal is destroying
Presently we always BUG_ON if trying to start a transaction on a journal marked
with JBD2_UNMOUNT, since this should never happen. However, while ltp running
stress tests, it was observed that in case of some error handling paths, it is
possible for update_super_work to start a transaction after the journal is
destroyed eg:
(umount)
ext4_kill_sb
kill_block_super
generic_shutdown_super
sync_filesystem /* commits all txns */
evict_inodes
/* might start a new txn */
ext4_put_super
flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
jbd2_journal_destroy
journal_kill_thread
journal->j_flags |= JBD2_UNMOUNT;
jbd2_journal_commit_transaction
jbd2_journal_get_descriptor_buffer
jbd2_journal_bmap
ext4_journal_bmap
ext4_map_blocks
...
ext4_inode_error
ext4_handle_error
schedule_work(&sbi->s_sb_upd_work)
/* work queue kicks in */
update_super_work
jbd2_journal_start
start_this_handle
BUG_ON(journal->j_flags &
JBD2_UNMOUNT)
Hence, introduce a new mount flag to indicate journal is destroying and only do
a journaled (and deferred) update of sb if this flag is not set. Otherwise, just
fallback to an un-journaled commit.
Further, in the journal destroy path, we have the following sequence:
1. Set mount flag indicating journal is destroying
2. force a commit and wait for it
3. flush pending sb updates
This sequence is important as it ensures that, after this point, there is no sb
update that might be journaled so it is safe to update the sb outside the
journal. (To avoid race discussed in 2d01ddc86606)
Also, we don't need a similar check in ext4_grp_locked_error since it is only
called from mballoc and AFAICT it would be always valid to schedule work here.
In the Linux kernel, the following vulnerability has been resolved:
btrfs: fix block group refcount race in btrfs_create_pending_block_groups()
Block group creation is done in two phases, which results in a slightly
unintuitive property: a block group can be allocated/deallocated from
after btrfs_make_block_group() adds it to the space_info with
btrfs_add_bg_to_space_info(), but before creation is completely completed
in btrfs_create_pending_block_groups(). As a result, it is possible for a
block group to go unused and have 'btrfs_mark_bg_unused' called on it
concurrently with 'btrfs_create_pending_block_groups'. This causes a
number of issues, which were fixed with the block group flag
'BLOCK_GROUP_FLAG_NEW'.
However, this fix is not quite complete. Since it does not use the
unused_bg_lock, it is possible for the following race to occur:
btrfs_create_pending_block_groups btrfs_mark_bg_unused
if list_empty // false
list_del_init
clear_bit
else if (test_bit) // true
list_move_tail
And we get into the exact same broken ref count and invalid new_bgs
state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to
prevent.
The broken refcount aspect will result in a warning like:
[1272.943527] refcount_t: underflow; use-after-free.
[1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
[1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
[1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G W 6.14.0-rc5+ #108
[1272.946368] Tainted: [W]=WARN
[1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
[1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
[1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
[1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
[1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
[1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
[1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
[1272.952850] FS: 0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
[1272.953458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
[1272.954474] Call Trace:
[1272.954655] <TASK>
[1272.954812] ? refcount_warn_saturate+0xba/0x110
[1272.955173] ? __warn.cold+0x93/0xd7
[1272.955487] ? refcount_warn_saturate+0xba/0x110
[1272.955816] ? report_bug+0xe7/0x120
[1272.956103] ? handle_bug+0x53/0x90
[1272.956424] ? exc_invalid_op+0x13/0x60
[1272.956700] ? asm_exc_invalid_op+0x16/0x20
[1272.957011] ? refcount_warn_saturate+0xba/0x110
[1272.957399] btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
[1272.957853] btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
[1272.958289] btrfs_discard_workfn+0x194/0x380 [btrfs]
[1272.958729] process_one_work+0x130/0x290
[1272.959026] worker_thread+0x2ea/0x420
[1272.959335] ? __pfx_worker_thread+0x10/0x10
[1272.959644] kthread+0xd7/0x1c0
[1272.959872] ? __pfx_kthread+0x10/0x10
[1272.960172] ret_from_fork+0x30/0x50
[1272.960474] ? __pfx_kthread+0x10/0x10
[1272.960745] ret_from_fork_asm+0x1a/0x30
[1272.961035] </TASK>
[1272.961238] ---[ end trace 0000000000000000 ]---
Though we have seen them in the async discard workfn as well. It is
most likely to happen after a relocation finishes which cancels discard,
tears down the block group, etc.
Fix this fully by taking the lock arou
---truncated---
In the Linux kernel, the following vulnerability has been resolved:
net: fix NULL pointer dereference in l3mdev_l3_rcv
When delete l3s ipvlan:
ip link del link eth0 ipvlan1 type ipvlan mode l3s
This may cause a null pointer dereference:
Call trace:
ip_rcv_finish+0x48/0xd0
ip_rcv+0x5c/0x100
__netif_receive_skb_one_core+0x64/0xb0
__netif_receive_skb+0x20/0x80
process_backlog+0xb4/0x204
napi_poll+0xe8/0x294
net_rx_action+0xd8/0x22c
__do_softirq+0x12c/0x354
This is because l3mdev_l3_rcv() visit dev->l3mdev_ops after
ipvlan_l3s_unregister() assign the dev->l3mdev_ops to NULL. The process
like this:
(CPU1) | (CPU2)
l3mdev_l3_rcv() |
check dev->priv_flags: |
master = skb->dev; |
|
| ipvlan_l3s_unregister()
| set dev->priv_flags
| dev->l3mdev_ops = NULL;
|
visit master->l3mdev_ops |
To avoid this by do not set dev->l3mdev_ops when unregister l3s ipvlan.
In the Linux kernel, the following vulnerability has been resolved:
ibmvnic: Use kernel helpers for hex dumps
Previously, when the driver was printing hex dumps, the buffer was cast
to an 8 byte long and printed using string formatters. If the buffer
size was not a multiple of 8 then a read buffer overflow was possible.
Therefore, create a new ibmvnic function that loops over a buffer and
calls hex_dump_to_buffer instead.
This patch address KASAN reports like the one below:
ibmvnic 30000003 env3: Login Buffer:
ibmvnic 30000003 env3: 01000000af000000
<...>
ibmvnic 30000003 env3: 2e6d62692e736261
ibmvnic 30000003 env3: 65050003006d6f63
==================================================================
BUG: KASAN: slab-out-of-bounds in ibmvnic_login+0xacc/0xffc [ibmvnic]
Read of size 8 at addr c0000001331a9aa8 by task ip/17681
<...>
Allocated by task 17681:
<...>
ibmvnic_login+0x2f0/0xffc [ibmvnic]
ibmvnic_open+0x148/0x308 [ibmvnic]
__dev_open+0x1ac/0x304
<...>
The buggy address is located 168 bytes inside of
allocated 175-byte region [c0000001331a9a00, c0000001331a9aaf)
<...>
=================================================================
ibmvnic 30000003 env3: 000000000033766e
In the Linux kernel, the following vulnerability has been resolved:
bonding: check xdp prog when set bond mode
Following operations can trigger a warning[1]:
ip netns add ns1
ip netns exec ns1 ip link add bond0 type bond mode balance-rr
ip netns exec ns1 ip link set dev bond0 xdp obj af_xdp_kern.o sec xdp
ip netns exec ns1 ip link set bond0 type bond mode broadcast
ip netns del ns1
When delete the namespace, dev_xdp_uninstall() is called to remove xdp
program on bond dev, and bond_xdp_set() will check the bond mode. If bond
mode is changed after attaching xdp program, the warning may occur.
Some bond modes (broadcast, etc.) do not support native xdp. Set bond mode
with xdp program attached is not good. Add check for xdp program when set
bond mode.
[1]
------------[ cut here ]------------
WARNING: CPU: 0 PID: 11 at net/core/dev.c:9912 unregister_netdevice_many_notify+0x8d9/0x930
Modules linked in:
CPU: 0 UID: 0 PID: 11 Comm: kworker/u4:0 Not tainted 6.14.0-rc4 #107
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
Workqueue: netns cleanup_net
RIP: 0010:unregister_netdevice_many_notify+0x8d9/0x930
Code: 00 00 48 c7 c6 6f e3 a2 82 48 c7 c7 d0 b3 96 82 e8 9c 10 3e ...
RSP: 0018:ffffc90000063d80 EFLAGS: 00000282
RAX: 00000000ffffffa1 RBX: ffff888004959000 RCX: 00000000ffffdfff
RDX: 0000000000000000 RSI: 00000000ffffffea RDI: ffffc90000063b48
RBP: ffffc90000063e28 R08: ffffffff82d39b28 R09: 0000000000009ffb
R10: 0000000000000175 R11: ffffffff82d09b40 R12: ffff8880049598e8
R13: 0000000000000001 R14: dead000000000100 R15: ffffc90000045000
FS: 0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000d406b60 CR3: 000000000483e000 CR4: 00000000000006f0
Call Trace:
<TASK>
? __warn+0x83/0x130
? unregister_netdevice_many_notify+0x8d9/0x930
? report_bug+0x18e/0x1a0
? handle_bug+0x54/0x90
? exc_invalid_op+0x18/0x70
? asm_exc_invalid_op+0x1a/0x20
? unregister_netdevice_many_notify+0x8d9/0x930
? bond_net_exit_batch_rtnl+0x5c/0x90
cleanup_net+0x237/0x3d0
process_one_work+0x163/0x390
worker_thread+0x293/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0xec/0x1e0
? __pfx_kthread+0x10/0x10
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
In the Linux kernel, the following vulnerability has been resolved:
net: dsa: sja1105: fix kasan out-of-bounds warning in sja1105_table_delete_entry()
There are actually 2 problems:
- deleting the last element doesn't require the memmove of elements
[i + 1, end) over it. Actually, element i+1 is out of bounds.
- The memmove itself should move size - i - 1 elements, because the last
element is out of bounds.
The out-of-bounds element still remains out of bounds after being
accessed, so the problem is only that we touch it, not that it becomes
in active use. But I suppose it can lead to issues if the out-of-bounds
element is part of an unmapped page.