When setting up interrupt remapping for legacy PCI(-X) devices,
including PCI(-X) bridges, a lookup of the upstream bridge is required.
This lookup, itself involving acquiring of a lock, is done in a context
where acquiring that lock is unsafe. This can lead to a deadlock.
Certain instructions need intercepting and emulating by Xen. In some
cases Xen emulates the instruction by replaying it, using an executable
stub. Some instructions may raise an exception, which is supposed to be
handled gracefully. Certain replayed instructions have additional logic
to set up and recover the changes to the arithmetic flags.
For replayed instructions where the flags recovery logic is used, the
metadata for exception handling was incorrect, preventing Xen from
handling the the exception gracefully, treating it as fatal instead.
The hypervisor contains code to accelerate VGA memory accesses for HVM
guests, when the (virtual) VGA is in "standard" mode. Locking involved
there has an unusual discipline, leaving a lock acquired past the
return from the function that acquired it. This behavior results in a
problem when emulating an instruction with two memory accesses, both of
which touch VGA memory (plus some further constraints which aren't
relevant here). When emulating the 2nd access, the lock that is already
being held would be attempted to be re-acquired, resulting in a
deadlock.
This deadlock was already found when the code was first introduced, but
was analysed incorrectly and the fix was incomplete. Analysis in light
of the new finding cannot find a way to make the existing locking
discipline work.
In staging, this logic has all been removed because it was discovered
to be accidentally disabled since Xen 4.7. Therefore, we are fixing the
locking problem by backporting the removal of most of the feature. Note
that even with the feature disabled, the lock would still be acquired
for any accesses to the VGA MMIO region.
PVH guests have their ACPI tables constructed by the toolstack. The
construction involves building the tables in local memory, which are
then copied into guest memory. While actually used parts of the local
memory are filled in correctly, excess space that is being allocated is
left with its prior contents.
Certain PCI devices in a system might be assigned Reserved Memory
Regions (specified via Reserved Memory Region Reporting, "RMRR") for
Intel VT-d or Unity Mapping ranges for AMD-Vi. These are typically used
for platform tasks such as legacy USB emulation.
Since the precise purpose of these regions is unknown, once a device
associated with such a region is active, the mappings of these regions
need to remain continuouly accessible by the device. In the logic
establishing these mappings, error handling was flawed, resulting in
such mappings to potentially remain in place when they should have been
removed again. Respective guests would then gain access to memory
regions which they aren't supposed to have access to.
When multiple devices share resources and one of them is to be passed
through to a guest, security of the entire system and of respective
guests individually cannot really be guaranteed without knowing
internals of any of the involved guests. Therefore such a configuration
cannot really be security-supported, yet making that explicit was so far
missing.
Resources the sharing of which is known to be problematic include, but
are not limited to
- - PCI Base Address Registers (BARs) of multiple devices mapping to the
same page (4k on x86),
- - INTx lines.
In x86's APIC (Advanced Programmable Interrupt Controller) architecture,
error conditions are reported in a status register. Furthermore, the OS
can opt to receive an interrupt when a new error occurs.
It is possible to configure the error interrupt with an illegal vector,
which generates an error when an error interrupt is raised.
This case causes Xen to recurse through vlapic_error(). The recursion
itself is bounded; errors accumulate in the the status register and only
generate an interrupt when a new status bit becomes set.
However, the lock protecting this state in Xen will try to be taken
recursively, and deadlock.
An optional feature of PCI MSI called "Multiple Message" allows a
device to use multiple consecutive interrupt vectors. Unlike for MSI-X,
the setting up of these consecutive vectors needs to happen all in one
go. In this handling an error path could be taken in different
situations, with or without a particular lock held. This error path
wrongly releases the lock even when it is not currently held.
Unlike 32-bit PV guests, HVM guests may switch freely between 64-bit and
other modes. This in particular means that they may set registers used
to pass 32-bit-mode hypercall arguments to values outside of the range
32-bit code would be able to set them to.
When processing of hypercalls takes a considerable amount of time,
the hypervisor may choose to invoke a hypercall continuation. Doing so
involves putting (perhaps updated) hypercall arguments in respective
registers. For guests not running in 64-bit mode this further involves
a certain amount of translation of the values.
Unfortunately internal sanity checking of these translated values
assumes high halves of registers to always be clear when invoking a
hypercall. When this is found not to be the case, it triggers a
consistency check in the hypervisor and causes a crash.
Because of a logical error in XSA-407 (Branch Type Confusion), the
mitigation is not applied properly when it is intended to be used.
XSA-434 (Speculative Return Stack Overflow) uses the same
infrastructure, so is equally impacted.
For more details, see:
https://xenbits.xen.org/xsa/advisory-407.html
https://xenbits.xen.org/xsa/advisory-434.html