PVH guests have their ACPI tables constructed by the toolstack. The
construction involves building the tables in local memory, which are
then copied into guest memory. While actually used parts of the local
memory are filled in correctly, excess space that is being allocated is
left with its prior contents.
Certain PCI devices in a system might be assigned Reserved Memory
Regions (specified via Reserved Memory Region Reporting, "RMRR") for
Intel VT-d or Unity Mapping ranges for AMD-Vi. These are typically used
for platform tasks such as legacy USB emulation.
Since the precise purpose of these regions is unknown, once a device
associated with such a region is active, the mappings of these regions
need to remain continuouly accessible by the device. In the logic
establishing these mappings, error handling was flawed, resulting in
such mappings to potentially remain in place when they should have been
removed again. Respective guests would then gain access to memory
regions which they aren't supposed to have access to.
In x86's APIC (Advanced Programmable Interrupt Controller) architecture,
error conditions are reported in a status register. Furthermore, the OS
can opt to receive an interrupt when a new error occurs.
It is possible to configure the error interrupt with an illegal vector,
which generates an error when an error interrupt is raised.
This case causes Xen to recurse through vlapic_error(). The recursion
itself is bounded; errors accumulate in the the status register and only
generate an interrupt when a new status bit becomes set.
However, the lock protecting this state in Xen will try to be taken
recursively, and deadlock.
An optional feature of PCI MSI called "Multiple Message" allows a
device to use multiple consecutive interrupt vectors. Unlike for MSI-X,
the setting up of these consecutive vectors needs to happen all in one
go. In this handling an error path could be taken in different
situations, with or without a particular lock held. This error path
wrongly releases the lock even when it is not currently held.
Unlike 32-bit PV guests, HVM guests may switch freely between 64-bit and
other modes. This in particular means that they may set registers used
to pass 32-bit-mode hypercall arguments to values outside of the range
32-bit code would be able to set them to.
When processing of hypercalls takes a considerable amount of time,
the hypervisor may choose to invoke a hypercall continuation. Doing so
involves putting (perhaps updated) hypercall arguments in respective
registers. For guests not running in 64-bit mode this further involves
a certain amount of translation of the values.
Unfortunately internal sanity checking of these translated values
assumes high halves of registers to always be clear when invoking a
hypercall. When this is found not to be the case, it triggers a
consistency check in the hypervisor and causes a crash.
PCI devices can make use of a functionality called phantom functions,
that when enabled allows the device to generate requests using the IDs
of functions that are otherwise unpopulated. This allows a device to
extend the number of outstanding requests.
Such phantom functions need an IOMMU context setup, but failure to
setup the context is not fatal when the device is assigned. Not
failing device assignment when such failure happens can lead to the
primary device being assigned to a guest, while some of the phantom
functions are assigned to a different domain.
Recent x86 CPUs offer functionality named Control-flow Enforcement
Technology (CET). A sub-feature of this are Shadow Stacks (CET-SS).
CET-SS is a hardware feature designed to protect against Return Oriented
Programming attacks. When enabled, traditional stacks holding both data
and return addresses are accompanied by so called "shadow stacks",
holding little more than return addresses. Shadow stacks aren't
writable by normal instructions, and upon function returns their
contents are used to check for possible manipulation of a return address
coming from the traditional stack.
In particular certain memory accesses need intercepting by Xen. In
various cases the necessary emulation involves kind of replaying of
the instruction. Such replaying typically involves filling and then
invoking of a stub. Such a replayed instruction may raise an
exceptions, which is expected and dealt with accordingly.
Unfortunately the interaction of both of the above wasn't right:
Recovery involves removal of a call frame from the (traditional) stack.
The counterpart of this operation for the shadow stack was missing.
The current setup of the quarantine page tables assumes that the
quarantine domain (dom_io) has been initialized with an address width
of DEFAULT_DOMAIN_ADDRESS_WIDTH (48) and hence 4 page table levels.
However dom_io being a PV domain gets the AMD-Vi IOMMU page tables
levels based on the maximum (hot pluggable) RAM address, and hence on
systems with no RAM above the 512GB mark only 3 page-table levels are
configured in the IOMMU.
On systems without RAM above the 512GB boundary
amd_iommu_quarantine_init() will setup page tables for the scratch
page with 4 levels, while the IOMMU will be configured to use 3 levels
only, resulting in the last page table directory (PDE) effectively
becoming a page table entry (PTE), and hence a device in quarantine
mode gaining write access to the page destined to be a PDE.
Due to this page table level mismatch, the sink page the device gets
read/write access to is no longer cleared between device assignment,
possibly leading to data leaks.
The fixes for XSA-422 (Branch Type Confusion) and XSA-434 (Speculative
Return Stack Overflow) are not IRQ-safe. It was believed that the
mitigations always operated in contexts with IRQs disabled.
However, the original XSA-254 fix for Meltdown (XPTI) deliberately left
interrupts enabled on two entry paths; one unconditionally, and one
conditionally on whether XPTI was active.
As BTC/SRSO and Meltdown affect different CPU vendors, the mitigations
are not active together by default. Therefore, there is a race
condition whereby a malicious PV guest can bypass BTC/SRSO protections
and launch a BTC/SRSO attack against Xen.
When a transaction is committed, C Xenstored will first check
the quota is correct before attempting to commit any nodes. It would
be possible that accounting is temporarily negative if a node has
been removed outside of the transaction.
Unfortunately, some versions of C Xenstored are assuming that the
quota cannot be negative and are using assert() to confirm it. This
will lead to C Xenstored crash when tools are built without -DNDEBUG
(this is the default).