In the Linux kernel, the following vulnerability has been resolved:
ipvs: do not keep dest_dst if dev is going down
There is race between the netdev notifier ip_vs_dst_event()
and the code that caches dst with dev that is going down.
As the FIB can be notified for the closed device after our
handler finishes, it is possible valid route to be returned
and cached resuling in a leaked dev reference until the dest
is not removed.
To prevent new dest_dst to be attached to dest just after the
handler dropped the old one, add a netif_running() check
to make sure the notifier handler is not currently running
for device that is closing.
In the Linux kernel, the following vulnerability has been resolved:
ext4: fix dirtyclusters double decrement on fs shutdown
fstests test generic/388 occasionally reproduces a warning in
ext4_put_super() associated with the dirty clusters count:
WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4]
Tracing the failure shows that the warning fires due to an
s_dirtyclusters_counter value of -1. IOW, this appears to be a
spurious decrement as opposed to some sort of leak. Further tracing
of the dirty cluster count deltas and an LLM scan of the resulting
output identified the cause as a double decrement in the error path
between ext4_mb_mark_diskspace_used() and the caller
ext4_mb_new_blocks().
First, note that generic/388 is a shutdown vs. fsstress test and so
produces a random set of operations and shutdown injections. In the
problematic case, the shutdown triggers an error return from the
ext4_handle_dirty_metadata() call(s) made from
ext4_mb_mark_context(). The changed value is non-zero at this point,
so ext4_mb_mark_diskspace_used() does not exit after the error
bubbles up from ext4_mb_mark_context(). Instead, the former
decrements both cluster counters and returns the error up to
ext4_mb_new_blocks(). The latter falls into the !ar->len out path
which decrements the dirty clusters counter a second time, creating
the inconsistency.
To avoid this problem and simplify ownership of the cluster
reservation in this codepath, lift the counter reduction to a single
place in the caller. This makes it more clear that
ext4_mb_new_blocks() is responsible for acquiring cluster
reservation (via ext4_claim_free_clusters()) in the !delalloc case
as well as releasing it, regardless of whether it ends up consumed
or returned due to failure.
In the Linux kernel, the following vulnerability has been resolved:
xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path
icmp_route_lookup() performs multiple route lookups to find a suitable
route for sending ICMP error messages, with special handling for XFRM
(IPsec) policies.
The lookup sequence is:
1. First, lookup output route for ICMP reply (dst = original src)
2. Pass through xfrm_lookup() for policy check
3. If blocked (-EPERM) or dst is not local, enter "reverse path"
4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec
which reverses the original packet's flow (saddr<->daddr swapped)
5. If fl4_dec.saddr is local (we are the original destination), use
__ip_route_output_key() for output route lookup
6. If fl4_dec.saddr is NOT local (we are a forwarding node), use
ip_route_input() to simulate the reverse packet's input path
7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag
The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr
(original packet's source) as destination. If this address becomes local
between the initial check and ip_route_input() call (e.g., due to
concurrent "ip addr add"), ip_route_input() returns a LOCAL route with
dst.output set to ip_rt_bug.
This route is then used for ICMP output, causing dst_output() to call
ip_rt_bug(), triggering a WARN_ON:
------------[ cut here ]------------
WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1
Call Trace:
<TASK>
ip_push_pending_frames+0x202/0x240
icmp_push_reply+0x30d/0x430
__icmp_send+0x1149/0x24f0
ip_options_compile+0xa2/0xd0
ip_rcv_finish_core+0x829/0x1950
ip_rcv+0x2d7/0x420
__netif_receive_skb_one_core+0x185/0x1f0
netif_receive_skb+0x90/0x450
tun_get_user+0x3413/0x3fb0
tun_chr_write_iter+0xe4/0x220
...
Fix this by checking rt2->rt_type after ip_route_input(). If it's
RTN_LOCAL, the route cannot be used for output, so treat it as an error.
The reproducer requires kernel modification to widen the race window,
making it unsuitable as a selftest. It is available at:
https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81
In the Linux kernel, the following vulnerability has been resolved:
power: supply: wm97xx: Fix NULL pointer dereference in power_supply_changed()
In `probe()`, `request_irq()` is called before allocating/registering a
`power_supply` handle. If an interrupt is fired between the call to
`request_irq()` and `power_supply_register()`, the `power_supply` handle
will be used uninitialized in `power_supply_changed()` in
`wm97xx_bat_update()` (triggered from the interrupt handler). This will
lead to a `NULL` pointer dereference since
Fix this racy `NULL` pointer dereference by making sure the IRQ is
requested _after_ the registration of the `power_supply` handle. Since
the IRQ is the last thing requests in the `probe()` now, remove the
error path for freeing it. Instead add one for unregistering the
`power_supply` handle when IRQ request fails.
In the Linux kernel, the following vulnerability has been resolved:
ipvs: skip ipv6 extension headers for csum checks
Protocol checksum validation fails for IPv6 if there are extension
headers before the protocol header. iph->len already contains its
offset, so use it to fix the problem.
In the Linux kernel, the following vulnerability has been resolved:
smack: /smack/doi: accept previously used values
Writing to /smack/doi a value that has ever been
written there in the past disables networking for
non-ambient labels.
E.g.
# cat /smack/doi
3
# netlabelctl -p cipso list
Configured CIPSO mappings (1)
DOI value : 3
mapping type : PASS_THROUGH
# netlabelctl -p map list
Configured NetLabel domain mappings (3)
domain: "_" (IPv4)
protocol: UNLABELED
domain: DEFAULT (IPv4)
protocol: CIPSO, DOI = 3
domain: DEFAULT (IPv6)
protocol: UNLABELED
# cat /smack/ambient
_
# cat /proc/$$/attr/smack/current
_
# ping -c1 10.1.95.12
64 bytes from 10.1.95.12: icmp_seq=1 ttl=64 time=0.964 ms
# echo foo >/proc/$$/attr/smack/current
# ping -c1 10.1.95.12
64 bytes from 10.1.95.12: icmp_seq=1 ttl=64 time=0.956 ms
unknown option 86
# echo 4 >/smack/doi
# echo 3 >/smack/doi
!> [ 214.050395] smk_cipso_doi:691 cipso add rc = -17
# echo 3 >/smack/doi
!> [ 249.402261] smk_cipso_doi:678 remove rc = -2
!> [ 249.402261] smk_cipso_doi:691 cipso add rc = -17
# ping -c1 10.1.95.12
!!> ping: 10.1.95.12: Address family for hostname not supported
# echo _ >/proc/$$/attr/smack/current
# ping -c1 10.1.95.12
64 bytes from 10.1.95.12: icmp_seq=1 ttl=64 time=0.617 ms
This happens because Smack keeps decommissioned DOIs,
fails to re-add them, and consequently refuses to add
the “default” domain map:
# netlabelctl -p cipso list
Configured CIPSO mappings (2)
DOI value : 3
mapping type : PASS_THROUGH
DOI value : 4
mapping type : PASS_THROUGH
# netlabelctl -p map list
Configured NetLabel domain mappings (2)
domain: "_" (IPv4)
protocol: UNLABELED
!> (no ipv4 map for default domain here)
domain: DEFAULT (IPv6)
protocol: UNLABELED
Fix by clearing decommissioned DOI definitions and
serializing concurrent DOI updates with a new lock.
Also:
- allow /smack/doi to live unconfigured, since
adding a map (netlbl_cfg_cipsov4_map_add) may fail.
CIPSO_V4_DOI_UNKNOWN(0) indicates the unconfigured DOI
- add new DOI before removing the old default map,
so the old map remains if the add fails
(2008-02-04, Casey Schaufler)
In the Linux kernel, the following vulnerability has been resolved:
netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO
nf_osf_match_one() computes ctx->window % f->wss.val in the
OSF_WSS_MODULO branch with no guard for f->wss.val == 0. A
CAP_NET_ADMIN user can add such a fingerprint via nfnetlink; a
subsequent matching TCP SYN divides by zero and panics the kernel.
Reject the bogus fingerprint in nfnl_osf_add_callback() above the
per-option for-loop. f->wss is per-fingerprint, not per-option, so
the check must run regardless of f->opt_num (including 0). Also
reject wss.wc >= OSF_WSS_MAX; nf_osf_match_one() already treats that
as "should not happen".
Crash:
Oops: divide error: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
Call Trace:
<IRQ>
nf_osf_match (net/netfilter/nfnetlink_osf.c:220)
xt_osf_match_packet (net/netfilter/xt_osf.c:32)
ipt_do_table (net/ipv4/netfilter/ip_tables.c:348)
nf_hook_slow (net/netfilter/core.c:622)
ip_local_deliver (net/ipv4/ip_input.c:265)
ip_rcv (include/linux/skbuff.h:1162)
__netif_receive_skb_one_core (net/core/dev.c:6181)
process_backlog (net/core/dev.c:6642)
__napi_poll (net/core/dev.c:7710)
net_rx_action (net/core/dev.c:7945)
handle_softirqs (kernel/softirq.c:622)
In the Linux kernel, the following vulnerability has been resolved:
slip: bound decode() reads against the compressed packet length
slhc_uncompress() parses a VJ-compressed TCP header by advancing a
pointer through the packet via decode() and pull16(). Neither helper
bounds-checks against isize, and decode() masks its return with
& 0xffff so it can never return the -1 that callers test for -- those
error paths are dead code.
A short compressed frame whose change byte requests optional fields
lets decode() read past the end of the packet. The over-read bytes
are folded into the cached cstate and reflected into subsequent
reconstructed packets.
Make decode() and pull16() take the packet end pointer and return -1
when exhausted. Add a bounds check before the TCP-checksum read.
The existing == -1 tests now do what they were always meant to.
In the Linux kernel, the following vulnerability has been resolved:
rtmutex: Use waiter::task instead of current in remove_waiter()
remove_waiter() is used by the slowlock paths, but it is also used for
proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from
futex_requeue().
In the latter case waiter::task is not current, but remove_waiter()
operates on current for the dequeue operation. That results in several
problems:
1) the rbtree dequeue happens without waiter::task::pi_lock being held
2) the waiter task's pi_blocked_on state is not cleared, which leaves a
dangling pointer primed for UAF around.
3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter
task
Use waiter::task instead of current in all related operations in
remove_waiter() to cure those problems.
[ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the
changelog ]
In the Linux kernel, the following vulnerability has been resolved:
net/sched: sch_red: Replace direct dequeue call with peek and qdisc_dequeue_peeked
When red qdisc has children (eg qfq qdisc) whose peek() callback is
qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such
qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from
its child (red in this case), it will do the following:
1a. do a peek() - and when sensing there's an skb the child can offer, then
- the child in this case(red) calls its child's (qfq) peek.
qfq does the right thing and will return the gso_skb queue packet.
Note: if there wasnt a gso_skb entry then qfq will store it there.
1b. invoke a dequeue() on the child (red). And herein lies the problem.
- red will call the child's dequeue() which will essentially just
try to grab something of qfq's queue.
[ 78.667668][ T363] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[ 78.667927][ T363] CPU: 1 UID: 0 PID: 363 Comm: ping Not tainted 7.1.0-rc1-00033-g46f74a3f7d57-dirty #790 PREEMPT(full)
[ 78.668263][ T363] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 78.668486][ T363] RIP: 0010:qfq_dequeue+0x446/0xc90 [sch_qfq]
[ 78.668718][ T363] Code: 54 c0 e8 dd 90 00 f1 48 c7 c7 e0 03 54 c0 48 89 de e8 ce 90 00 f1 48 8d 7b 48 b8 ff ff 37 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 <80> 3c 02 00 74 05 e8 ef a1 e1 f1 48 8b 7b 48 48 8d 54 24 58 48 8d
[ 78.669312][ T363] RSP: 0018:ffff88810de573e0 EFLAGS: 00010216
[ 78.669533][ T363] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 78.669790][ T363] RDX: 0000000000000009 RSI: 0000000000000004 RDI: 0000000000000048
[ 78.670044][ T363] RBP: ffff888110dc4000 R08: ffffffffb1b0885a R09: fffffbfff6ba9078
[ 78.670297][ T363] R10: 0000000000000003 R11: ffff888110e31c80 R12: 0000001880000000
[ 78.670560][ T363] R13: ffff888110dc4150 R14: ffff888110dc42b8 R15: 0000000000000200
[ 78.670814][ T363] FS: 00007f66a8f09c40(0000) GS:ffff888163428000(0000) knlGS:0000000000000000
[ 78.671110][ T363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 78.671324][ T363] CR2: 000055db4c6a30a8 CR3: 000000010da67000 CR4: 0000000000750ef0
[ 78.671585][ T363] PKRU: 55555554
[ 78.671713][ T363] Call Trace:
[ 78.671843][ T363] <TASK>
[ 78.671936][ T363] ? __pfx_qfq_dequeue+0x10/0x10 [sch_qfq]
[ 78.672148][ T363] ? __pfx__printk+0x10/0x10
[ 78.672322][ T363] ? srso_alias_return_thunk+0x5/0xfbef5
[ 78.672496][ T363] ? lockdep_hardirqs_on_prepare+0xa8/0x1a0
[ 78.672706][ T363] ? srso_alias_return_thunk+0x5/0xfbef5
[ 78.672875][ T363] ? trace_hardirqs_on+0x19/0x1a0
[ 78.673047][ T363] red_dequeue+0x65/0x270 [sch_red]
[ 78.673217][ T363] ? srso_alias_return_thunk+0x5/0xfbef5
[ 78.673385][ T363] tbf_dequeue.cold+0xb0/0x70c [sch_tbf]
[ 78.673566][ T363] __qdisc_run+0x169/0x1900
The right thing to do in #1b is to grab the skb off gso_skb queue.
This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked()
method instead.