/* * ============================ * MSI MESSAGE FORMATS (on x86) * ============================ * * Message Signaled Interrupts are simply DMA transactions from the device. * It really is just "write 32 bits when you want attention." * The MSI (or MSI-X) message configured in the device is just the 64 bits of * the address to write to, and the 32 bits to write there. * * You can use this to do polled I/O by telling the device to write into a * data structure of your own choosing, then checking to see when it does so. * * Or you can tell the device to poke at MMIO on *another* device, for example * when it's finished receiving a packet and it's time for the next device to * process that packet. * * Of course, the way it's *supposed* to be used is to poke MMIO on another * device whose *sole* purpose is to raise an interrupt to the CPU. * * It's mostly been forgotten now, but on Intel chipsets used with the Pentium * and P6 family CPUs, the MMIO device used for this was the I/O APIC. There * was a "IRQ Pin Assertion Register" at 0xFEC00020, and a device could write * a pin number to that register to artificially assert an input pin. So * devices could be configured to use this, and as far as the rest of the * system was concerned it would be as if they actually had a line interrupt * wired to the corresponding pin on the I/O APIC. The I/O APIC would then * send the interrupt to the CPU via the APIC serial bus, just like for true * line interrupts. * * For Pentium 4 and Xeon onwards, Intel moved away from the APIC serial bus * and started to use the main system bus for interrupts. Devices can now * issue MMIO writes directly to the APIC at address 0x00000000FEExxxxx. * * When the APIC receives a write transaction across the system bus, it looks * at the low 20 bits of the address as well as the data being written. These * convey all the information about which interrupt vector to raise on which * CPU, and a few more details besides. Some of those details include special * cases like cluster delivery modes and ways to deliver NMI/INIT/etc. which * we won't go into here. * * This is MSI as we currently know it, and even the I/O APIC now effectively * turns line interrupts into MSIs by sending them on the system bus this way. * * * Compatibility Format * -------------------- * * Originally, there was only one way of interpreting the bits in the MSI * message. This is what Intel documentation now calls "Compatibility Format" * (§5.1.2.1 of the VT-d spec). It is as follows: * * Address: 1111.1110.1110.dddd.dddd.0000.0000.rmxx * 0xFEE . Dest ID . Rsvd .↑↑↑ * ||└-Don't Care * |└-Destination Mode * └-Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || └-Delivery Mode * |└-Trigger Mode Level * └-Trigger Mode * * Crucially, this format has only 8 bits for the Destination ID. Since 0xFF * is the broadcast address, this allows only up to 255 CPUs to be supported. * * For many years the Reserved bits in bit 4-11 of the address were labelled * in some Intel documentation as "Extended Destination ID", but never used. * * The vector to be delivered to the destination CPU is in the low bits of the * data. For devices with multiple interrupts, modern PCI MSI-X allows the * full address+data bits for each one to be configured independently, so they * can target arbitrary vectors on arbitrary CPUs. * * However, the older PCI multi-MSI standard only allowed the base MSI to be * configured, and every additional interrupt supported by the device was * signalled just by adding to the value of the data field. This means that * multi-MSI devices could raise a set of consecutive vectors on the *same* * CPU for different interrupts, but not raise interrupts to different CPUs. * * * I/O APIC Redirection Table Entries * ---------------------------------- * * As noted above, the I/O APIC is now just a device for turning line-level * interrupts into MSI messages. Each pin on the I/O APIC has a Redirection * Table Entry (RTE) which configures the MSI message to be sent. * * The 64 bits in the original definition of the I/O APIC RTE map to all the * fields of the resulting MSI, including the Extended Destination ID. It's * just that they appear to have been shuffled into a strange order, because * back in the mists of time they actually corresponded more closely to the * message format on the APIC serial bus. * * RTE[63-32]: dddd.dddd.eeee.eeee.xxxx.xxxx.xxxx.xxxx * Dest ID .ExtDestId. Reserved * * RTE[31-0]: xxxx.xxxx.xxxx.xxxM.TRPs.mDDD.vvvv.vvvv * ↑ ↑↑↑↑ ↑ ↑ . Vector * | |||| | └-Delivery Mode * | |||| └-Destination Mode * | |||└-Delivery Status (RO) * | ||└-Pin polarity * | |└-Remote IRR (RO) * | └-Trigger Mode * └- Mask * * These days, the field definitions are largely fictional because the I/O * APIC doesn't actually interpret most of those bits, and just passes them on * in an MSI message (with an important caveat noted below). The definitions * still make sense when the MSI generated by the I/O APIC is received as a * Compatibility Format MSI by a standard APIC, but when it is received by an * IOMMU and interpreted as a different format (as described later), they make * a lot less sense. It's much better to think of the RTE just as a weird * arrangement of the bits of the MSI message which will be generated, with * some remaining fields which *are* still used by the I/O APIC itself (mask, * polarity, status etc.): * * RTE[63-32]: aaaa.aaaa.aaaa.aaaa.xxxx.xxxx.xxxx.xxxx * MSI Address [20-4] . Don't Care * * RTE[31-0]: xxxx.xxxx.xxxx.xxxM.DRPs.Addd.dddd.dddd * ↑ ↑↑↑↑ ↑ MSI Data[11-0] * | |||| | * | |||| └- MSI Address[3] * | |||└-Delivery Status (RO) * | ||└-Pin polarity * | |└-Remote IRR (RO) * | └-MSI Data[15] * └- Mask * * * You can see this in VMMs like QEMU, where the I/O APIC emulation just takes * the RTE and swizzles the bits around to create address+data of an MSI * message, adding the standard 0xFEExxxxx to the generated address. QEMU then * literally forwards that MSI as memory transaction in the physical address * space to which the I/O APIC is attached. The memory transaction is then * passed through the standard address decoding just as DMA writes from * devices would be. It is ultimately received and handled by either the APIC * or the IOMMU which handles the corresponding address space. * * Conversely, operating systems can configure the I/O APIC RTE by first * composing an MSI message in the format expected by the upstream APIC or * IOMMU which will receive it, and then just swizzling the bits into the * appropriate places. * * (Some operating systems, including old versions of Linux, instead have * complex special cases within the I/O APIC code, with special knowledge of * the upstream IOMMU formats. Or hooks into the IOMMU drivers to generate I/O * APIC RTEs directly, instead of just composing an MSI message the generic * way and deriving the RTE from that.) * * There is a caveat to this simplicity though, and it has to do with the way * that the I/O APIC handles level-triggered interrupts. When the interrupt is * first asserted, the I/O APIC sends the MSI message upstream to be handled. * Upon completing the interrupt, the CPU sends an "End of Interrupt" (EOI) to * the I/O APIC. At that point, the I/O APIC needs to send a new interrupt if * the level on the input pin is still asserted. * * The EOI from the CPU tells the I/O APIC which *vector* the CPU has finished * processing. And thus the I/O APIC still looks at the low 8 bits of the RTE, * which correspond to the low 8 bits of the MSI data, to determine which * interrupt is being EOI'd. So even if the IOMMU receiving the MSI message * does not even care about the contents of those bits (e.g. the Intel IOMMU * as described below), the operating system still needs to put appropriate * values in those bits for level-triggered interrupts. Likewise, bit 15 of * the RTE, which corresponds to bit 15 of the MSI data, is the bit which * indicates that a given pin is level-triggered. * * * Intel "Remappable Format" * ------------------------- * * When Intel started supporting more than 255 CPUs, the 8-bit limit in what * was not yet called "Compatibility Format" became a problem. To support * the full 32 bits of logical x2APIC IDs they had to come up with another * solution. Since MSIs are basically just a DMA write, the logical place for * this was the IOMMU, which already intercepts DMA writes from devices. So * they invented "Interrupt Remapping". The "Remappable Format" MSI does not * directly encode which vector to send to which CPU; instead it just * identifies an index into an IOMMU table (the Interrupt Remapping Table). * * The Interrupt Remapping Table Entry (IRTE) contains all the information * which was once present in the MSI address+data, but allows for a full 32 * bits of destination ID. (It can also be used for posted interrupts, * delivering the interrupt *directly* to a vCPU in guest mode). * * To signal a Remappable Format MSI, Intel used bit 4 of the MSI address, * which is the lowest of the bits which were previously labelled "Extended * Destination ID". With an Intel IOMMU doing Interrupt Remapping, devices * can send both Remappable Format MSIs, *and* Compatibility Format, and the * IOMMU will only actually remap the former. (It can be told to block the * latter, for security reasons.) * * Intel calls the IRTE index the "handle". In the simple case, the full 15 * bits of the handle are conveyed in the address of the MSI (bits 19-5 and * bit 2), and the data written to that address is completely ignored. * * However, this would not support the legacy multi-MSI devices which only * have one MSI address/data configuration register and simply add one to the * data for each consecutive interrupt source. So the Intel IOMMU also has an * optional "subhandle" in the low bits of the data. If bit 3 of the address * (Subhandle Valid) is set, the IOMMU adds this subhandle to the handle * extracted from the address, and uses the result as the index into its * Interrupt Remapping Table. This even allows legacy multi-MSI devices to * target different CPUs with their different interrupt sources, which they * could not before. * * Address: 1111.1110.1110.hhhh.hhhh.hhhh.hhh1.shxx * 0xFEE . Handle[14:0] .↑↑↑ * ||└-Don't Care * |└-Handle[15] * └-Subhandle Valid (SHV) * * Data: 0000.0000.0000.0000.ssss.ssss.ssss.ssss * Reserved . Subhandle (if SHV==1 in address) * * As described earlier, the I/O APIC has legacy reasons to care about the * bits which end up in bits 7-0 and bit 15 of the data, which were once the * vector and trigger mode respectively. Since the operating system has no * need to set SHV=1 for MSIs generated by the I/O APIC, the IOMMU can ignore * the data completely, and the operating system is free to place whatever * values it likes in there to keep the I/O APIC happy for level-triggered * interrupts. * * * AMD Remappable MSI * ------------------ * * AMD's IOMMU is completely different to Intel's, and they didn't make * things anywhere near as complicated. When the IOMMU is enabled, a * device cannot send "Compatibility Format" MSIs any more, so there is * no need to tell one format from the other. AMD just used the low 11 * bits of the data as the IRTE index, and nothing else matters. * * Address: 1111.1110.1110.xxxx.xxxx.xxxx.xxxx.xxxx * 0xFEE . Don't Care * * Data: xxxx.xxxx.xxxx.xxxx.xxxx.xiii.iiii.iiii * Don't Care IRTE Index * * The reason for using only 11 bits of IRTE index is because, as described * above, the I/O APIC actually *does* care about bit 11 of the MSI data, (or, * more accurately, it cares about the RTE bit which gets shuffled into bit 11 * of the MSI data). That's the original "Trigger Mode" bit, which lets the * I/O APIC know that this is a level-triggered interrupt. * * Although the Intel IOMMU has a single Interrupt Remapping Table and a * single number space for IRTE indices across the whole system, the AMD * IOMMU has a table per device — so multiple devices may use IRTE index * number zero, for example. This, sadly, becomes important later. * * * The 15-bit MSI extension * ------------------------ * * The problem with IOMMUs is that they were designed to support DMA * translation, and there is no architectural way to disable that and offer * guests an IOMMU which *only* supports Interrupt Remapping. We really don't * want guests doing their own DMA translation, as it has severe performance * and security implications. * * So KVM, Hyper-V and Xen all define a virt extension which uses 7 of the * original "Extended Destination ID" bits to give support for up to 32768 * virtual CPUs. (This extension avoids the low bit which Intel used to * indicate Remappable Format). This format is exactly like the Compatibility * Format, except that bits 5-11 of the MSI address are used as bits 8-15 * of the destination APIC ID: * * Address: 1111.1110.1110.dddd.dddd.DDDD.DDD0.rmxx * 0xFEE . Dest ID .ExtDest .↑↑↑ * ||└-Don't Care * |└-Destination Mode * └-Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || └-Delivery Mode * |└-Trigger Mode Level * └-Trigger Mode * * We have thus far mostly glossed over the distinction between logical and * physical destination IDs, indicated by the Destination Mode bit, because * these MSI formats are merely a transport for that information and have * little to do with its interpretation. * * However, we should note that in certain cases, the distinction between * logical and physical mode does matter. In x2APIC mode, each logical * "cluster" contains 16 CPUs. Logical mode addressing splits the 32-bit * destination ID into two parts; the top 16 bits contain the "cluster ID", * which is the physical APIC ID divided by 16. The low 16 bits are a bitmask * of which CPUs within that cluster should be eligible to receive the * interrupt. So, for example, an interrupt could be targeted at CPUs 21, 23, * 24, and 25 by using the logical destination ID 0x0001.03a0. * * Astute readers will have noticed that with only 15 bits of destination * ID, logical mode can only address the first cluster (CPUs 0-15), and in fact * can't even set the bit for CPU#15 either. * * So when using this 15-bit MSI format, it is expected that guests will set * the Destination Mode bit to zero to use physical addressing mode, where the * destination ID in the MSI message is simply the physical APIC ID of the * single CPU which is the target of the interrupt. Enlightened operating * systems ought to be capable of this for themselves, but hypervisors can * give them a helpful nudge by setting bit 19 ("Force APIC physical * destination mode") in the Fixed Feature Flags field of the Fixed ACPI * Description Table (FADT). A strict reading of the ACPI specification would * suggest that this flag is only for xAPIC mode, but both Windows and Linux * do honour it in x2APIC mode too. * * * Xen MSI → PIRQ mapping * ---------------------- * * All of the above are implementable in real hardware. Actual external PCI * devices can perform memory transactions to addresses in the physical * address range 0x00000000FEExxxxx, which reach the APIC and cause * interrupts to be injected into the relevant CPU. * * But Xen guests know that they are running in a virtual machine. So they * know that the PCI config space is a complete fiction. For example, if they * set up a BAR of a given device with a certain address, that is a *guest* * physical address. The hypervisor probably doesn't even change anything on * the device itself; it just adjusts the EPT page tables to make the * corresponding BAR *appear* in the guest physical address space at the * desired location. * * MSI messages in a virtual environment are similarly fictional. If the guest * configures an MSI message in a PCI device with a certain vCPU APIC ID and * vector, the real hardware wouldn't know what to do with that. (Well, we * could design an IOMMU which *could* cope with that, let guests write * directly to the PCI devices' MSI tables, and use the resulting MSIs for * posted interrupts as a first-class citizen, but nobody's done that.) * * In practice, what happens is that the hypervisor registers its *own* * handler for the hardware interrupt in question (routing it to a given * vector on a given *host* CPU, typically handled by VFIO in the KVM case). * When that host interrupt handler is triggered, the hypervisor needs to * inject an interrupt to the guest vCPU accordingly. From that point, it's * just the same as raising an MSI from an *emulated* PCI device. Most * hypervisors, including Xen and KVM, do *not* have a mechanism to simply * write to guest memory *instead* of injecting an interrupt. So if the guest * configured the MSI to target an address outside the 0x00000000FEExxxxx * range, it just gets dropped. (Boo, no DPDK polled-mode implementations * abusing MSIs for memory writes, in virt guests!) * * This means that we can abuse the high 32 bits of the address even in a * guest-visible way, right? Nothing would ever go wrong... * * Xen was the first to do this. It needed a way to map MSI from PCI devices * to a 'PIRQ', which is a form of Xen paravirtualised interrupt which binds * to Xen Event Channels. By using vector#0, Xen guests indicate a special * MSI message which is to be routed to a PIRQ. The actual PIRQ# is then in * the original Destination ID field... and the high bits of the address. * * (We'll gloss over the way that Xen snoops on these even while masked, and * actually unmasks the MSI when the guests binds to the corresponding PIRQ, * because there's only so much pain I can inflict on the reader in one * sitting.) * * AddrHi: DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000 * PIRQ#[31-8] . Rsvd * * AddrLo: 1111.1110.1110.dddd.dddd.0000.0000.xxxx * 0xFEE .PIRQ[7-0]. Rsvd .Don't Care * * Data: xxxx.xxxx.xxxx.xxxx.xxxx.xxxx.0000.0000 * Don't Care . Vector == 0 * * When Xen attempts to raise such an MSI to the guest, it doesn't inject it * via the virtual APIC at all. It is routed to the PIRQ and thus to the Xen * event channel mechanism instead. * * * KVM X2APIC MSI API * ------------------ * * KVM has an ioctl() for injecting MSI interrupts, and routing table entries * which cause MSIs to be injected to the guest when triggered. For * convenience, KVM originally just used the Compatibility Format MSI message * as its userspace ABI for configuring these. This got less convenient when * x2APIC came along and we needed an extra 24 bits for the Destination ID. * * KVM's solution was to abuse the high 32 bits of the address, If this was a * true memory transaction, such a write would miss the APIC completely and * scribble over guest memory at an address like 0x00000100FEExxxxx. But in * this case it's just an ABI between KVM and userspace, using bits which * would otherwise be completely redundant. KVM uses the high 24 bits of the * MSI address (bits 40-63) as the high 24 bits of the destination ID. * * AddrHi: DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000 * Destination ID [31-8] . Rsvd * * AddrLo: 1111.1110.1110.dddd.dddd.0000.0000.rmxx * 0xFEE . ↑ . Rsvd .↑↑↑ * DestID[8-0] ||└-Don't Care * |└-Destination Mode * └-Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || └-Delivery Mode * |└-Trigger Mode Level * └-Trigger Mode * * This hack is not visible to a KVM guest. What a KVM guest programs into * the MSI descriptors of passthrough or emulated PCI devices is completely * different, and (at this point in our tale of woe, at least) never sets * the high 32 bits of the target address to anything but zero. * * * IOMMU interrupts * ---------------- * * Since an IOMMU is responsible for remapping interrupts so they can reach * CPUs with higher APIC IDs, how do we actually configure the events from * the IOMMU itself? * * Intel uses the same format as the KVM x2APIC API (which may in fact have * been why KVM did it that way). Since it's never going to be an actual * memory transaction, it's safe to abuse the high bits of the address. Intel * offers { Data, Address, Upper Address } registers for each type of event * that the IOMMU can generate for itself, with the high 24 bits of the * destination ID in the high 24 bits of the address as shown above for KVM. * * AMD's IOMMU uses a completely different 64-bit register format (e.g. XT * IOMMU General Interrupt Control Register) which doesn't pretend very hard * to look like an MSI at all. But just happens to have the DestMode at bit * 2, like in the MSI address. And just happens to have the vector and * Delivery Mode (from the low 9 bits of the MSI data) in the low 9 bits of * its high word (bits 32-40 of the register). And then just throws the * actual destination ID in around them in some other bits: * * Low32: dddd.dddd.dddd.dddd.dddd.dddd.xxxx.xmxx * Destination ID [23-0] . ↑ . ↑↑ * Don't |└-Don't Care * Care └-Destination Mode * * High32: DDDD.DDDD.xxxx.xxxx.xxxx.xxxD.vvvv.vvvv * DestId[31-24] ↑. Vector * └-Delivery Mode * * * Windows, part 1: Intel IOMMU with no DMA translation * ---------------------------------------------------- * * As noted above, the 15-bit extension was invented to avoid the need for * an IOMMU, because it is undesirable to offer a virtual IOMMU to guests * with support for them to do their own additional level of DMA translation. * * However, although Hyper-V exposes the 15-bit MSI feature, Windows as a * guest OS does not use it. In order to support Windows guests with more * than 255 vCPUs, a hack was found for the Intel IOMMU. Although there is no * official way to advertise that the IOMMU does not support DMA translation, * there *are* "Supported Adjusted Guest Address Width" bits which advertise * the ability to use 3-level, 4-level, and/or 5-level page tables. If * Windows encounters an IOMMU which sets *none* of these bits, Windows will * quietly refrain from attempting to use that IOMMU for DMA translation, but * will still use it for Interrupt Remapping. * * However, this only works correctly if Windows is running on an Intel CPU. * When Windows runs on an AMD CPU, it will happily configure and use the * Intel IOMMU, but misconfigures the MSI messages that it programs into the * devices. For I/O APIC interrupts, Windows programs the IRTE in the Intel * IOMMU correctly... but then configures the I/O APIC using the AMD format * (with the IRTE index where the vector would have been). A hack to the * virtual Intel IOMMU emulation can make it cope with this bug... but sadly * it *only* works for I/O APIC interrupts. For actual PCI MSI, Windows still * configures the device with an AMD-style remappable MSI but *doesn't* * actually configure the IRTE in the IOMMU at all. This is probably because * Intel's IRT is system-wide, while AMD has one per device; Windows does * seem to think it's using a separate IRTE space, so the first MSI vector * gets IRTE index 0 which conflicts with I/O APIC pin 0. * * So for PCI, the hypervisor has no idea where Windows intended a given MSI * to be routed, and cannot work around the Windows bugs to support >255 AMD * vCPUs this way. * * * Windows, part 2: No IOMMU * ------------------------- * * If we do *not* offer an IOMMU to a Windows guest which has CPUs with high * APIC IDs, we encounter a *different* Windows bug, which is easier to work * around. Windows doesn't use the 15-bit extension described above, but it * *does* just throw the high bits of the destination ID into bits 32-55 of * the MSI address. * * (This obviously only works for devices which can generate 64-bit MSIs, * which does not include the I/O APIC or HPET. Persuading Windows to set * up the I/O APIC when there are CPUs with high APIC IDs is a different * issue, and not covered here.) * * Done without negotiation or discovery of any hypervisor feature, this abuse * of high address bits arguably ought to cause the device to write to an * address in guest *memory* and miss the APIC at 0x00000000FEExxxxx * altogether, but we already admitted almost no hypervisors actually *do* * that. (QEMU is the exception here, because for *emulated* PCI devices, * pci_msi_trigger() does actually generate true write cycles in the * corresponding DMA address space.) * * We can cope with this Windows bug and even use it to our advantage, by * spotting the high bits in the MSI address and using them. It does require * that we have an API which is specifically for *MSI*, not to be conflated * with actual DMA writes. So QEMU's pci_msi_trigger() would have to do * things differently. But let's pretend, for the same of argument, that I'm * typing this C-comment essay into a VMM other than QEMU, which already * does think that way and has a cleaner separation of emulated-PCI vs. the * VFIO or true emulation which can back it, and *does* always handle MSIs * explicity. * * In that case, all the translation function has to do, in addition to * invoking all the IOMMU and Xen and 15-bit translators as QEMU's * kvm_arch_fixup_msi_route() function already does, is add one more trivial * special case. This format is the same as the KVM x2APIC API format, with * the top 32 bits of the address shifted by 8 bits: * * AddrHi: 0000.0000.DDDD.DDDD.DDDD.DDDD.DDDD.DDDD.0000.0000 * Rsvd . Destination ID bits 8-31 * * AddrLo: 1111.1110.1110.dddd.dddd.0000.0000.rmxx * 0xFEE . Dest ID . Rsvd .↑↑↑ * ||└-Don't Care * |└-Destination Mode * └-Redirection Hint * * Data: 0000.0000.0000.0000.TL00.0DDD.vvvv.vvvv * Reserved .↑↑ ↑ . Vector * || └-Delivery Mode * |└-Trigger Mode Level * └-Trigger Mode */