In the Linux kernel, the following vulnerability has been resolved:
mm/vmscan: don't try to reclaim hwpoison folio
Syzkaller reports a bug as follows:
Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000
Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users
Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed
page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e
memcg:ffff0000dd6d9000
anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff)
raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9
raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000
page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio))
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:184!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3
Hardware name: linux,dummy-virt (DT)
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : add_to_swap+0xbc/0x158
lr : add_to_swap+0xbc/0x158
sp : ffff800087f37340
x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780
x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0
x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4
x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000
x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c
x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b
x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000
x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001
x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000
Call trace:
add_to_swap+0xbc/0x158
shrink_folio_list+0x12ac/0x2648
shrink_inactive_list+0x318/0x948
shrink_lruvec+0x450/0x720
shrink_node_memcgs+0x280/0x4a8
shrink_node+0x128/0x978
balance_pgdat+0x4f0/0xb20
kswapd+0x228/0x438
kthread+0x214/0x230
ret_from_fork+0x10/0x20
I can reproduce this issue with the following steps:
1) When a dirty swapcache page is isolated by reclaim process and the
page isn't locked, inject memory failure for the page.
me_swapcache_dirty() clears uptodate flag and tries to delete from lru,
but fails. Reclaim process will put the hwpoisoned page back to lru.
2) The process that maps the hwpoisoned page exits, the page is deleted
the page will never be freed and will be in the lru forever.
3) If we trigger a reclaim again and tries to reclaim the page,
add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is
cleared.
To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the
hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap
it in shrink_folio_list(), otherwise the folio will fail to be unmaped by
hwpoison_user_mappings() since the folio isn't in lru list.
In the Linux kernel, the following vulnerability has been resolved:
riscv: uprobes: Add missing fence.i after building the XOL buffer
The XOL (execute out-of-line) buffer is used to single-step the
replaced instruction(s) for uprobes. The RISC-V port was missing a
proper fence.i (i$ flushing) after constructing the XOL buffer, which
can result in incorrect execution of stale/broken instructions.
This was found running the BPF selftests "test_progs:
uprobe_autoattach, attach_probe" on the Spacemit K1/X60, where the
uprobes tests randomly blew up.
In the Linux kernel, the following vulnerability has been resolved:
ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING"
wait_event_timeout() will set the state of the current
task to TASK_UNINTERRUPTIBLE, before doing the condition check. This
means that ksmbd_durable_scavenger_alive() will try to acquire the mutex
while already in a sleeping state. The scheduler warns us by giving
the following warning:
do not call blocking ops when !TASK_RUNNING; state=2 set at
[<0000000061515a6f>] prepare_to_wait_event+0x9f/0x6c0
WARNING: CPU: 2 PID: 4147 at kernel/sched/core.c:10099 __might_sleep+0x12f/0x160
mutex lock is not needed in ksmbd_durable_scavenger_alive().
In the Linux kernel, the following vulnerability has been resolved:
bpf: Fix kmemleak warning for percpu hashmap
Vlad Poenaru reported the following kmemleak issue:
unreferenced object 0x606fd7c44ac8 (size 32):
backtrace (crc 0):
pcpu_alloc_noprof+0x730/0xeb0
bpf_map_alloc_percpu+0x69/0xc0
prealloc_init+0x9d/0x1b0
htab_map_alloc+0x363/0x510
map_create+0x215/0x3a0
__sys_bpf+0x16b/0x3e0
__x64_sys_bpf+0x18/0x20
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Further investigation shows the reason is due to not 8-byte aligned
store of percpu pointer in htab_elem_set_ptr():
*(void __percpu **)(l->key + key_size) = pptr;
Note that the whole htab_elem alignment is 8 (for x86_64). If the key_size
is 4, that means pptr is stored in a location which is 4 byte aligned but
not 8 byte aligned. In mm/kmemleak.c, scan_block() scans the memory based
on 8 byte stride, so it won't detect above pptr, hence reporting the memory
leak.
In htab_map_alloc(), we already have
htab->elem_size = sizeof(struct htab_elem) +
round_up(htab->map.key_size, 8);
if (percpu)
htab->elem_size += sizeof(void *);
else
htab->elem_size += round_up(htab->map.value_size, 8);
So storing pptr with 8-byte alignment won't cause any problem and can fix
kmemleak too.
The issue can be reproduced with bpf selftest as well:
1. Enable CONFIG_DEBUG_KMEMLEAK config
2. Add a getchar() before skel destroy in test_hash_map() in prog_tests/for_each.c.
The purpose is to keep map available so kmemleak can be detected.
3. run './test_progs -t for_each/hash_map &' and a kmemleak should be reported.
In the Linux kernel, the following vulnerability has been resolved:
driver core: fix potential NULL pointer dereference in dev_uevent()
If userspace reads "uevent" device attribute at the same time as another
threads unbinds the device from its driver, change to dev->driver from a
valid pointer to NULL may result in crash. Fix this by using READ_ONCE()
when fetching the pointer, and take bus' drivers klist lock to make sure
driver instance will not disappear while we access it.
Use WRITE_ONCE() when setting the driver pointer to ensure there is no
tearing.
In the Linux kernel, the following vulnerability has been resolved:
bpf: consider that tail calls invalidate packet pointers
Tail-called programs could execute any of the helpers that invalidate
packet pointers. Hence, conservatively assume that each tail call
invalidates packet pointers.
Making the change in bpf_helper_changes_pkt_data() automatically makes
use of check_cfg() logic that computes 'changes_pkt_data' effect for
global sub-programs, such that the following program could be
rejected:
int tail_call(struct __sk_buff *sk)
{
bpf_tail_call_static(sk, &jmp_table, 0);
return 0;
}
SEC("tc")
int not_safe(struct __sk_buff *sk)
{
int *p = (void *)(long)sk->data;
... make p valid ...
tail_call(sk);
*p = 42; /* this is unsafe */
...
}
The tc_bpf2bpf.c:subprog_tc() needs change: mark it as a function that
can invalidate packet pointers. Otherwise, it can't be freplaced with
tailcall_freplace.c:entry_freplace() that does a tail call.
In the Linux kernel, the following vulnerability has been resolved:
bpf: track changes_pkt_data property for global functions
When processing calls to certain helpers, verifier invalidates all
packet pointers in a current state. For example, consider the
following program:
__attribute__((__noinline__))
long skb_pull_data(struct __sk_buff *sk, __u32 len)
{
return bpf_skb_pull_data(sk, len);
}
SEC("tc")
int test_invalidate_checks(struct __sk_buff *sk)
{
int *p = (void *)(long)sk->data;
if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP;
skb_pull_data(sk, 0);
*p = 42;
return TCX_PASS;
}
After a call to bpf_skb_pull_data() the pointer 'p' can't be used
safely. See function filter.c:bpf_helper_changes_pkt_data() for a list
of such helpers.
At the moment verifier invalidates packet pointers when processing
helper function calls, and does not traverse global sub-programs when
processing calls to global sub-programs. This means that calls to
helpers done from global sub-programs do not invalidate pointers in
the caller state. E.g. the program above is unsafe, but is not
rejected by verifier.
This commit fixes the omission by computing field
bpf_subprog_info->changes_pkt_data for each sub-program before main
verification pass.
changes_pkt_data should be set if:
- subprogram calls helper for which bpf_helper_changes_pkt_data
returns true;
- subprogram calls a global function,
for which bpf_subprog_info->changes_pkt_data should be set.
The verifier.c:check_cfg() pass is modified to compute this
information. The commit relies on depth first instruction traversal
done by check_cfg() and absence of recursive function calls:
- check_cfg() would eventually visit every call to subprogram S in a
state when S is fully explored;
- when S is fully explored:
- every direct helper call within S is explored
(and thus changes_pkt_data is set if needed);
- every call to subprogram S1 called by S was visited with S1 fully
explored (and thus S inherits changes_pkt_data from S1).
The downside of such approach is that dead code elimination is not
taken into account: if a helper call inside global function is dead
because of current configuration, verifier would conservatively assume
that the call occurs for the purpose of the changes_pkt_data
computation.
In the Linux kernel, the following vulnerability has been resolved:
bpf: check changes_pkt_data property for extension programs
When processing calls to global sub-programs, verifier decides whether
to invalidate all packet pointers in current state depending on the
changes_pkt_data property of the global sub-program.
Because of this, an extension program replacing a global sub-program
must be compatible with changes_pkt_data property of the sub-program
being replaced.
This commit:
- adds changes_pkt_data flag to struct bpf_prog_aux:
- this flag is set in check_cfg() for main sub-program;
- in jit_subprogs() for other sub-programs;
- modifies bpf_check_attach_btf_id() to check changes_pkt_data flag;
- moves call to check_attach_btf_id() after the call to check_cfg(),
because it needs changes_pkt_data flag to be set:
bpf_check:
... ...
- check_attach_btf_id resolve_pseudo_ldimm64
resolve_pseudo_ldimm64 --> bpf_prog_is_offloaded
bpf_prog_is_offloaded check_cfg
check_cfg + check_attach_btf_id
... ...
The following fields are set by check_attach_btf_id():
- env->ops
- prog->aux->attach_btf_trace
- prog->aux->attach_func_name
- prog->aux->attach_func_proto
- prog->aux->dst_trampoline
- prog->aux->mod
- prog->aux->saved_dst_attach_type
- prog->aux->saved_dst_prog_type
- prog->expected_attach_type
Neither of these fields are used by resolve_pseudo_ldimm64() or
bpf_prog_offload_verifier_prep() (for netronome and netdevsim
drivers), so the reordering is safe.
In the Linux kernel, the following vulnerability has been resolved:
net: dsa: free routing table on probe failure
If complete = true in dsa_tree_setup(), it means that we are the last
switch of the tree which is successfully probing, and we should be
setting up all switches from our probe path.
After "complete" becomes true, dsa_tree_setup_cpu_ports() or any
subsequent function may fail. If that happens, the entire tree setup is
in limbo: the first N-1 switches have successfully finished probing
(doing nothing but having allocated persistent memory in the tree's
dst->ports, and maybe dst->rtable), and switch N failed to probe, ending
the tree setup process before anything is tangible from the user's PoV.
If switch N fails to probe, its memory (ports) will be freed and removed
from dst->ports. However, the dst->rtable elements pointing to its ports,
as created by dsa_link_touch(), will remain there, and will lead to
use-after-free if dereferenced.
If dsa_tree_setup_switches() returns -EPROBE_DEFER, which is entirely
possible because that is where ds->ops->setup() is, we get a kasan
report like this:
==================================================================
BUG: KASAN: slab-use-after-free in mv88e6xxx_setup_upstream_port+0x240/0x568
Read of size 8 at addr ffff000004f56020 by task kworker/u8:3/42
Call trace:
__asan_report_load8_noabort+0x20/0x30
mv88e6xxx_setup_upstream_port+0x240/0x568
mv88e6xxx_setup+0xebc/0x1eb0
dsa_register_switch+0x1af4/0x2ae0
mv88e6xxx_register_switch+0x1b8/0x2a8
mv88e6xxx_probe+0xc4c/0xf60
mdio_probe+0x78/0xb8
really_probe+0x2b8/0x5a8
__driver_probe_device+0x164/0x298
driver_probe_device+0x78/0x258
__device_attach_driver+0x274/0x350
Allocated by task 42:
__kasan_kmalloc+0x84/0xa0
__kmalloc_cache_noprof+0x298/0x490
dsa_switch_touch_ports+0x174/0x3d8
dsa_register_switch+0x800/0x2ae0
mv88e6xxx_register_switch+0x1b8/0x2a8
mv88e6xxx_probe+0xc4c/0xf60
mdio_probe+0x78/0xb8
really_probe+0x2b8/0x5a8
__driver_probe_device+0x164/0x298
driver_probe_device+0x78/0x258
__device_attach_driver+0x274/0x350
Freed by task 42:
__kasan_slab_free+0x48/0x68
kfree+0x138/0x418
dsa_register_switch+0x2694/0x2ae0
mv88e6xxx_register_switch+0x1b8/0x2a8
mv88e6xxx_probe+0xc4c/0xf60
mdio_probe+0x78/0xb8
really_probe+0x2b8/0x5a8
__driver_probe_device+0x164/0x298
driver_probe_device+0x78/0x258
__device_attach_driver+0x274/0x350
The simplest way to fix the bug is to delete the routing table in its
entirety. dsa_tree_setup_routing_table() has no problem in regenerating
it even if we deleted links between ports other than those of switch N,
because dsa_link_touch() first checks whether the port pair already
exists in dst->rtable, allocating if not.
The deletion of the routing table in its entirety already exists in
dsa_tree_teardown(), so refactor that into a function that can also be
called from the tree setup error path.
In my analysis of the commit to blame, it is the one which added
dsa_link elements to dst->rtable. Prior to that, each switch had its own
ds->rtable which is freed when the switch fails to probe. But the tree
is potentially persistent memory.