Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 59c5c58c authored by Paolo Bonzini's avatar Paolo Bonzini
Browse files

Merge tag 'kvm-ppc-next-5.2-2' of...

Merge tag 'kvm-ppc-next-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

PPC KVM update for 5.2

* Support for guests to access the new POWER9 XIVE interrupt controller
  hardware directly, reducing interrupt latency and overhead for guests.

* In-kernel implementation of the H_PAGE_INIT hypercall.

* Reduce memory usage of sparsely-populated IOMMU tables.

* Several bug fixes.

Second PPC KVM update for 5.2

* Fix a bug, fix a spelling mistake, remove some useless code.
parents f93f7ede 4894fbcc
Loading
Loading
Loading
Loading
+32 −0
Original line number Diff line number Diff line
@@ -56,3 +56,35 @@ POWER9. Loads and stores to the watchpoint locations will not be
trapped in GDB. The watchpoint is remembered, so if the guest is
migrated back to the POWER8 host, it will start working again.

Force enabling the DAWR
=============================
Kernels (since ~v5.2) have an option to force enable the DAWR via:

  echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous

This enables the DAWR even on POWER9.

This is a dangerous setting, USE AT YOUR OWN RISK.

Some users may not care about a bad user crashing their box
(ie. single user/desktop systems) and really want the DAWR.  This
allows them to force enable DAWR.

This flag can also be used to disable DAWR access. Once this is
cleared, all DAWR access should be cleared immediately and your
machine once again safe from crashing.

Userspace may get confused by toggling this. If DAWR is force
enabled/disabled between getting the number of breakpoints (via
PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
inconsistent view of what's available. Similarly for guests.

For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
enabled in the host AND the guest. For this reason, this won't work on
POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.

To double check the DAWR is working, run this kernel selftest:
  tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
Any errors/failures/skips mean something is wrong.
+10 −0
Original line number Diff line number Diff line
@@ -1967,6 +1967,7 @@ registers, find a list below:
  PPC   | KVM_REG_PPC_TLB3PS            | 32
  PPC   | KVM_REG_PPC_EPTCFG            | 32
  PPC   | KVM_REG_PPC_ICP_STATE         | 64
  PPC   | KVM_REG_PPC_VP_STATE          | 128
  PPC   | KVM_REG_PPC_TB_OFFSET         | 64
  PPC   | KVM_REG_PPC_SPMC1             | 32
  PPC   | KVM_REG_PPC_SPMC2             | 32
@@ -4487,6 +4488,15 @@ struct kvm_sync_regs {
        struct kvm_vcpu_events events;
};

6.75 KVM_CAP_PPC_IRQ_XIVE

Architectures: ppc
Target: vcpu
Parameters: args[0] is the XIVE device fd
            args[1] is the XIVE CPU number (server ID) for this vcpu

This capability connects the vcpu to an in-kernel XIVE device.

7. Capabilities that can be enabled on VMs
------------------------------------------

+197 −0
Original line number Diff line number Diff line
POWER9 eXternal Interrupt Virtualization Engine (XIVE Gen1)
==========================================================

Device types supported:
  KVM_DEV_TYPE_XIVE     POWER9 XIVE Interrupt Controller generation 1

This device acts as a VM interrupt controller. It provides the KVM
interface to configure the interrupt sources of a VM in the underlying
POWER9 XIVE interrupt controller.

Only one XIVE instance may be instantiated. A guest XIVE device
requires a POWER9 host and the guest OS should have support for the
XIVE native exploitation interrupt mode. If not, it should run using
the legacy interrupt mode, referred as XICS (POWER7/8).

* Device Mappings

  The KVM device exposes different MMIO ranges of the XIVE HW which
  are required for interrupt management. These are exposed to the
  guest in VMAs populated with a custom VM fault handler.

  1. Thread Interrupt Management Area (TIMA)

  Each thread has an associated Thread Interrupt Management context
  composed of a set of registers. These registers let the thread
  handle priority management and interrupt acknowledgment. The most
  important are :

      - Interrupt Pending Buffer     (IPB)
      - Current Processor Priority   (CPPR)
      - Notification Source Register (NSR)

  They are exposed to software in four different pages each proposing
  a view with a different privilege. The first page is for the
  physical thread context and the second for the hypervisor. Only the
  third (operating system) and the fourth (user level) are exposed the
  guest.

  2. Event State Buffer (ESB)

  Each source is associated with an Event State Buffer (ESB) with
  either a pair of even/odd pair of pages which provides commands to
  manage the source: to trigger, to EOI, to turn off the source for
  instance.

  3. Device pass-through

  When a device is passed-through into the guest, the source
  interrupts are from a different HW controller (PHB4) and the ESB
  pages exposed to the guest should accommadate this change.

  The passthru_irq helpers, kvmppc_xive_set_mapped() and
  kvmppc_xive_clr_mapped() are called when the device HW irqs are
  mapped into or unmapped from the guest IRQ number space. The KVM
  device extends these helpers to clear the ESB pages of the guest IRQ
  number being mapped and then lets the VM fault handler repopulate.
  The handler will insert the ESB page corresponding to the HW
  interrupt of the device being passed-through or the initial IPI ESB
  page if the device has being removed.

  The ESB remapping is fully transparent to the guest and the OS
  device driver. All handling is done within VFIO and the above
  helpers in KVM-PPC.

* Groups:

  1. KVM_DEV_XIVE_GRP_CTRL
  Provides global controls on the device
  Attributes:
    1.1 KVM_DEV_XIVE_RESET (write only)
    Resets the interrupt controller configuration for sources and event
    queues. To be used by kexec and kdump.
    Errors: none

    1.2 KVM_DEV_XIVE_EQ_SYNC (write only)
    Sync all the sources and queues and mark the EQ pages dirty. This
    to make sure that a consistent memory state is captured when
    migrating the VM.
    Errors: none

  2. KVM_DEV_XIVE_GRP_SOURCE (write only)
  Initializes a new source in the XIVE device and mask it.
  Attributes:
    Interrupt source number  (64-bit)
  The kvm_device_attr.addr points to a __u64 value:
  bits:     | 63   ....  2 |   1   |   0
  values:   |    unused    | level | type
  - type:  0:MSI 1:LSI
  - level: assertion level in case of an LSI.
  Errors:
    -E2BIG:  Interrupt source number is out of range
    -ENOMEM: Could not create a new source block
    -EFAULT: Invalid user pointer for attr->addr.
    -ENXIO:  Could not allocate underlying HW interrupt

  3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only)
  Configures source targeting
  Attributes:
    Interrupt source number  (64-bit)
  The kvm_device_attr.addr points to a __u64 value:
  bits:     | 63   ....  33 |  32  | 31 .. 3 |  2 .. 0
  values:   |    eisn       | mask |  server | priority
  - priority: 0-7 interrupt priority level
  - server: CPU number chosen to handle the interrupt
  - mask: mask flag (unused)
  - eisn: Effective Interrupt Source Number
  Errors:
    -ENOENT: Unknown source number
    -EINVAL: Not initialized source number
    -EINVAL: Invalid priority
    -EINVAL: Invalid CPU number.
    -EFAULT: Invalid user pointer for attr->addr.
    -ENXIO:  CPU event queues not configured or configuration of the
             underlying HW interrupt failed
    -EBUSY:  No CPU available to serve interrupt

  4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write)
  Configures an event queue of a CPU
  Attributes:
    EQ descriptor identifier (64-bit)
  The EQ descriptor identifier is a tuple (server, priority) :
  bits:     | 63   ....  32 | 31 .. 3 |  2 .. 0
  values:   |    unused     |  server | priority
  The kvm_device_attr.addr points to :
    struct kvm_ppc_xive_eq {
	__u32 flags;
	__u32 qshift;
	__u64 qaddr;
	__u32 qtoggle;
	__u32 qindex;
	__u8  pad[40];
    };
  - flags: queue flags
    KVM_XIVE_EQ_ALWAYS_NOTIFY (required)
	forces notification without using the coalescing mechanism
	provided by the XIVE END ESBs.
  - qshift: queue size (power of 2)
  - qaddr: real address of queue
  - qtoggle: current queue toggle bit
  - qindex: current queue index
  - pad: reserved for future use
  Errors:
    -ENOENT: Invalid CPU number
    -EINVAL: Invalid priority
    -EINVAL: Invalid flags
    -EINVAL: Invalid queue size
    -EINVAL: Invalid queue address
    -EFAULT: Invalid user pointer for attr->addr.
    -EIO:    Configuration of the underlying HW failed

  5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
  Synchronize the source to flush event notifications
  Attributes:
    Interrupt source number  (64-bit)
  Errors:
    -ENOENT: Unknown source number
    -EINVAL: Not initialized source number

* VCPU state

  The XIVE IC maintains VP interrupt state in an internal structure
  called the NVT. When a VP is not dispatched on a HW processor
  thread, this structure can be updated by HW if the VP is the target
  of an event notification.

  It is important for migration to capture the cached IPB from the NVT
  as it synthesizes the priorities of the pending interrupts. We
  capture a bit more to report debug information.

  KVM_REG_PPC_VP_STATE (2 * 64bits)
  bits:     |  63  ....  32  |  31  ....  0  |
  values:   |   TIMA word0   |   TIMA word1  |
  bits:     | 127       ..........       64  |
  values:   |            unused              |

* Migration:

  Saving the state of a VM using the XIVE native exploitation mode
  should follow a specific sequence. When the VM is stopped :

  1. Mask all sources (PQ=01) to stop the flow of events.

  2. Sync the XIVE device with the KVM control KVM_DEV_XIVE_EQ_SYNC to
  flush any in-flight event notification and to stabilize the EQs. At
  this stage, the EQ pages are marked dirty to make sure they are
  transferred in the migration sequence.

  3. Capture the state of the source targeting, the EQs configuration
  and the state of thread interrupt context registers.

  Restore is similar :

  1. Restore the EQ configuration. As targeting depends on it.
  2. Restore targeting
  3. Restore the thread interrupt contexts
  4. Restore the source states
  5. Let the vCPU run
+8 −0
Original line number Diff line number Diff line
@@ -90,10 +90,18 @@ static inline void hw_breakpoint_disable(void)
extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs);
int hw_breakpoint_handler(struct die_args *args);

extern int set_dawr(struct arch_hw_breakpoint *brk);
extern bool dawr_force_enable;
static inline bool dawr_enabled(void)
{
	return dawr_force_enable;
}

#else	/* CONFIG_HAVE_HW_BREAKPOINT */
static inline void hw_breakpoint_disable(void) { }
static inline void thread_change_pc(struct task_struct *tsk,
					struct pt_regs *regs) { }
static inline bool dawr_enabled(void) { return false; }
#endif	/* CONFIG_HAVE_HW_BREAKPOINT */
#endif	/* __KERNEL__ */
#endif	/* _PPC_BOOK3S_64_HW_BREAKPOINT_H */
+10 −1
Original line number Diff line number Diff line
@@ -201,6 +201,8 @@ struct kvmppc_spapr_tce_iommu_table {
	struct kref kref;
};

#define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))

struct kvmppc_spapr_tce_table {
	struct list_head list;
	struct kvm *kvm;
@@ -210,6 +212,7 @@ struct kvmppc_spapr_tce_table {
	u64 offset;		/* in pages */
	u64 size;		/* window size in pages */
	struct list_head iommu_tables;
	struct mutex alloc_lock;
	struct page *pages[0];
};

@@ -222,6 +225,7 @@ extern struct kvm_device_ops kvm_xics_ops;
struct kvmppc_xive;
struct kvmppc_xive_vcpu;
extern struct kvm_device_ops kvm_xive_ops;
extern struct kvm_device_ops kvm_xive_native_ops;

struct kvmppc_passthru_irqmap;

@@ -312,7 +316,11 @@ struct kvm_arch {
#endif
#ifdef CONFIG_KVM_XICS
	struct kvmppc_xics *xics;
	struct kvmppc_xive *xive;
	struct kvmppc_xive *xive;    /* Current XIVE device in use */
	struct {
		struct kvmppc_xive *native;
		struct kvmppc_xive *xics_on_xive;
	} xive_devices;
	struct kvmppc_passthru_irqmap *pimap;
#endif
	struct kvmppc_ops *kvm_ops;
@@ -449,6 +457,7 @@ struct kvmppc_passthru_irqmap {
#define KVMPPC_IRQ_DEFAULT	0
#define KVMPPC_IRQ_MPIC		1
#define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */

#define MMIO_HPTE_CACHE_SIZE	4

Loading