PCI devices can make use of a functionality called phantom functions,
that when enabled allows the device to generate requests using the IDs
of functions that are otherwise unpopulated. This allows a device to
extend the number of outstanding requests.
Such phantom functions need an IOMMU context setup, but failure to
setup the context is not fatal when the device is assigned. Not
failing device assignment when such failure happens can lead to the
primary device being assigned to a guest, while some of the phantom
functions are assigned to a different domain.
Incorrect placement of a preprocessor directive in source code results
in logic that doesn't operate as intended when support for HVM guests is
compiled out of Xen.
Recent x86 CPUs offer functionality named Control-flow Enforcement
Technology (CET). A sub-feature of this are Shadow Stacks (CET-SS).
CET-SS is a hardware feature designed to protect against Return Oriented
Programming attacks. When enabled, traditional stacks holding both data
and return addresses are accompanied by so called "shadow stacks",
holding little more than return addresses. Shadow stacks aren't
writable by normal instructions, and upon function returns their
contents are used to check for possible manipulation of a return address
coming from the traditional stack.
In particular certain memory accesses need intercepting by Xen. In
various cases the necessary emulation involves kind of replaying of
the instruction. Such replaying typically involves filling and then
invoking of a stub. Such a replayed instruction may raise an
exceptions, which is expected and dealt with accordingly.
Unfortunately the interaction of both of the above wasn't right:
Recovery involves removal of a call frame from the (traditional) stack.
The counterpart of this operation for the shadow stack was missing.
The current setup of the quarantine page tables assumes that the
quarantine domain (dom_io) has been initialized with an address width
of DEFAULT_DOMAIN_ADDRESS_WIDTH (48) and hence 4 page table levels.
However dom_io being a PV domain gets the AMD-Vi IOMMU page tables
levels based on the maximum (hot pluggable) RAM address, and hence on
systems with no RAM above the 512GB mark only 3 page-table levels are
configured in the IOMMU.
On systems without RAM above the 512GB boundary
amd_iommu_quarantine_init() will setup page tables for the scratch
page with 4 levels, while the IOMMU will be configured to use 3 levels
only, resulting in the last page table directory (PDE) effectively
becoming a page table entry (PTE), and hence a device in quarantine
mode gaining write access to the page destined to be a PDE.
Due to this page table level mismatch, the sink page the device gets
read/write access to is no longer cleared between device assignment,
possibly leading to data leaks.
The fixes for XSA-422 (Branch Type Confusion) and XSA-434 (Speculative
Return Stack Overflow) are not IRQ-safe. It was believed that the
mitigations always operated in contexts with IRQs disabled.
However, the original XSA-254 fix for Meltdown (XPTI) deliberately left
interrupts enabled on two entry paths; one unconditionally, and one
conditionally on whether XPTI was active.
As BTC/SRSO and Meltdown affect different CPU vendors, the mitigations
are not active together by default. Therefore, there is a race
condition whereby a malicious PV guest can bypass BTC/SRSO protections
and launch a BTC/SRSO attack against Xen.
Arm provides multiple helpers to clean & invalidate the cache
for a given region. This is, for instance, used when allocating
guest memory to ensure any writes (such as the ones during scrubbing)
have reached memory before handing over the page to a guest.
Unfortunately, the arithmetics in the helpers can overflow and would
then result to skip the cache cleaning/invalidation. Therefore there
is no guarantee when all the writes will reach the memory.
This undefined behavior was meant to be addressed by XSA-437, but the
approach was not sufficient.
Arm provides multiple helpers to clean & invalidate the cache
for a given region. This is, for instance, used when allocating
guest memory to ensure any writes (such as the ones during scrubbing)
have reached memory before handing over the page to a guest.
Unfortunately, the arithmetics in the helpers can overflow and would
then result to skip the cache cleaning/invalidation. Therefore there
is no guarantee when all the writes will reach the memory.
For migration as well as to work around kernels unaware of L1TF (see
XSA-273), PV guests may be run in shadow paging mode. Since Xen itself
needs to be mapped when PV guests run, Xen and shadowed PV guests run
directly the respective shadow page tables. For 64-bit PV guests this
means running on the shadow of the guest root page table.
In the course of dealing with shortage of memory in the shadow pool
associated with a domain, shadows of page tables may be torn down. This
tearing down may include the shadow root page table that the CPU in
question is presently running on. While a precaution exists to
supposedly prevent the tearing down of the underlying live page table,
the time window covered by that precaution isn't large enough.
When a transaction is committed, C Xenstored will first check
the quota is correct before attempting to commit any nodes. It would
be possible that accounting is temporarily negative if a node has
been removed outside of the transaction.
Unfortunately, some versions of C Xenstored are assuming that the
quota cannot be negative and are using assert() to confirm it. This
will lead to C Xenstored crash when tools are built without -DNDEBUG
(this is the default).
Closing of an event channel in the Linux kernel can result in a deadlock.
This happens when the close is being performed in parallel to an unrelated
Xen console action and the handling of a Xen console interrupt in an
unprivileged guest.
The closing of an event channel is e.g. triggered by removal of a
paravirtual device on the other side. As this action will cause console
messages to be issued on the other side quite often, the chance of
triggering the deadlock is not neglectable.
Note that 32-bit Arm-guests are not affected, as the 32-bit Linux kernel
on Arm doesn't use queued-RW-locks, which are required to trigger the
issue (on Arm32 a waiting writer doesn't block further readers to get
the lock).