Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit ffcb09f2 authored by Radim Krčmář's avatar Radim Krčmář
Browse files
PPC KVM update for 4.10:

 * Support for KVM guests on POWER9 using the hashed page table MMU.
 * Updates and improvements to the halt-polling support on PPC, from
   Suraj Jitindar Singh.
 * An optimization to speed up emulated MMIO, from Yongji Xie.
 * Various other minor cleanups.
parents bf65014d 6ccad8ce
Loading
Loading
Loading
Loading
+2 −0
Original line number Diff line number Diff line
@@ -6,6 +6,8 @@ cpuid.txt
	- KVM-specific cpuid leaves (x86).
devices/
	- KVM_CAP_DEVICE_CTRL userspace API.
halt-polling.txt
	- notes on halt-polling
hypercalls.txt
	- KVM hypercalls.
locking.txt
+3 −0
Original line number Diff line number Diff line
@@ -2023,6 +2023,8 @@ registers, find a list below:
  PPC   | KVM_REG_PPC_WORT              | 64
  PPC	| KVM_REG_PPC_SPRG9             | 64
  PPC	| KVM_REG_PPC_DBSR              | 32
  PPC   | KVM_REG_PPC_TIDR              | 64
  PPC   | KVM_REG_PPC_PSSCR             | 64
  PPC   | KVM_REG_PPC_TM_GPR0           | 64
          ...
  PPC   | KVM_REG_PPC_TM_GPR31          | 64
@@ -2039,6 +2041,7 @@ registers, find a list below:
  PPC   | KVM_REG_PPC_TM_VSCR           | 32
  PPC   | KVM_REG_PPC_TM_DSCR           | 64
  PPC   | KVM_REG_PPC_TM_TAR            | 64
  PPC   | KVM_REG_PPC_TM_XER            | 64
        |                               |
  MIPS  | KVM_REG_MIPS_R0               | 64
          ...
+127 −0
Original line number Diff line number Diff line
The KVM halt polling system
===========================

The KVM halt polling system provides a feature within KVM whereby the latency
of a guest can, under some circumstances, be reduced by polling in the host
for some time period after the guest has elected to no longer run by cedeing.
That is, when a guest vcpu has ceded, or in the case of powerpc when all of the
vcpus of a single vcore have ceded, the host kernel polls for wakeup conditions
before giving up the cpu to the scheduler in order to let something else run.

Polling provides a latency advantage in cases where the guest can be run again
very quickly by at least saving us a trip through the scheduler, normally on
the order of a few micro-seconds, although performance benefits are workload
dependant. In the event that no wakeup source arrives during the polling
interval or some other task on the runqueue is runnable the scheduler is
invoked. Thus halt polling is especially useful on workloads with very short
wakeup periods where the time spent halt polling is minimised and the time
savings of not invoking the scheduler are distinguishable.

The generic halt polling code is implemented in:

	virt/kvm/kvm_main.c: kvm_vcpu_block()

The powerpc kvm-hv specific case is implemented in:

	arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked()

Halt Polling Interval
=====================

The maximum time for which to poll before invoking the scheduler, referred to
as the halt polling interval, is increased and decreased based on the perceived
effectiveness of the polling in an attempt to limit pointless polling.
This value is stored in either the vcpu struct:

	kvm_vcpu->halt_poll_ns

or in the case of powerpc kvm-hv, in the vcore struct:

	kvmppc_vcore->halt_poll_ns

Thus this is a per vcpu (or vcore) value.

During polling if a wakeup source is received within the halt polling interval,
the interval is left unchanged. In the event that a wakeup source isn't
received during the polling interval (and thus schedule is invoked) there are
two options, either the polling interval and total block time[0] were less than
the global max polling interval (see module params below), or the total block
time was greater than the global max polling interval.

In the event that both the polling interval and total block time were less than
the global max polling interval then the polling interval can be increased in
the hope that next time during the longer polling interval the wake up source
will be received while the host is polling and the latency benefits will be
received. The polling interval is grown in the function grow_halt_poll_ns() and
is multiplied by the module parameter halt_poll_ns_grow.

In the event that the total block time was greater than the global max polling
interval then the host will never poll for long enough (limited by the global
max) to wakeup during the polling interval so it may as well be shrunk in order
to avoid pointless polling. The polling interval is shrunk in the function
shrink_halt_poll_ns() and is divided by the module parameter
halt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0.

It is worth noting that this adjustment process attempts to hone in on some
steady state polling interval but will only really do a good job for wakeups
which come at an approximately constant rate, otherwise there will be constant
adjustment of the polling interval.

[0] total block time: the time between when the halt polling function is
		      invoked and a wakeup source received (irrespective of
		      whether the scheduler is invoked within that function).

Module Parameters
=================

The kvm module has 3 tuneable module parameters to adjust the global max
polling interval as well as the rate at which the polling interval is grown and
shrunk. These variables are defined in include/linux/kvm_host.h and as module
parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the
powerpc kvm-hv case.

Module Parameter    |	     Description	      |	     Default Value
--------------------------------------------------------------------------------
halt_poll_ns	    | The global max polling interval | KVM_HALT_POLL_NS_DEFAULT
		    | which defines the ceiling value |
		    | of the polling interval for     | (per arch value)
		    | each vcpu. 		      |
--------------------------------------------------------------------------------
halt_poll_ns_grow   | The value by which the halt     |	2
		    | polling interval is multiplied  |
		    | in the grow_halt_poll_ns()      |
		    | function.			      |
--------------------------------------------------------------------------------
halt_poll_ns_shrink | The value by which the halt     |	0
		    | polling interval is divided in  |
		    | the shrink_halt_poll_ns()	      |
		    | function.			      |
--------------------------------------------------------------------------------

These module parameters can be set from the debugfs files in:

	/sys/module/kvm/parameters/

Note: that these module parameters are system wide values and are not able to
      be tuned on a per vm basis.

Further Notes
=============

- Care should be taken when setting the halt_poll_ns module parameter as a
large value has the potential to drive the cpu usage to 100% on a machine which
would be almost entirely idle otherwise. This is because even if a guest has
wakeups during which very little work is done and which are quite far apart, if
the period is shorter than the global max polling interval (halt_poll_ns) then
the host will always poll for the entire block time and thus cpu utilisation
will go to 100%.

- Halt polling essentially presents a trade off between power usage and latency
and the module parameters should be used to tune the affinity for this. Idle
cpu time is essentially converted to host kernel time with the aim of decreasing
latency when entering the guest.

- Halt polling will only be conducted by the host when no other tasks are
runnable on that cpu, otherwise the polling will cease immediately and
schedule will be invoked to allow that other task to run. Thus this doesn't
allow a guest to denial of service the cpu.
+44 −0
Original line number Diff line number Diff line
@@ -14,6 +14,9 @@

#include <linux/threads.h>
#include <linux/kprobes.h>
#ifdef CONFIG_KVM
#include <linux/kvm_host.h>
#endif

#include <uapi/asm/ucontext.h>

@@ -109,4 +112,45 @@ void early_setup_secondary(void);
/* time */
void accumulate_stolen_time(void);

/* kvm */
#ifdef CONFIG_KVM
long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
			 unsigned long ioba, unsigned long tce);
long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
				  unsigned long liobn, unsigned long ioba,
				  unsigned long tce_list, unsigned long npages);
long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
			   unsigned long liobn, unsigned long ioba,
			   unsigned long tce_value, unsigned long npages);
long int kvmppc_rm_h_confer(struct kvm_vcpu *vcpu, int target,
                            unsigned int yield_count);
long kvmppc_h_random(struct kvm_vcpu *vcpu);
void kvmhv_commence_exit(int trap);
long kvmppc_realmode_machine_check(struct kvm_vcpu *vcpu);
void kvmppc_subcore_enter_guest(void);
void kvmppc_subcore_exit_guest(void);
long kvmppc_realmode_hmi_handler(void);
long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
                    long pte_index, unsigned long pteh, unsigned long ptel);
long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
                     unsigned long pte_index, unsigned long avpn);
long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu);
long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
                      unsigned long pte_index, unsigned long avpn,
                      unsigned long va);
long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long flags,
                   unsigned long pte_index);
long kvmppc_h_clear_ref(struct kvm_vcpu *vcpu, unsigned long flags,
                        unsigned long pte_index);
long kvmppc_h_clear_mod(struct kvm_vcpu *vcpu, unsigned long flags,
                        unsigned long pte_index);
long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr,
                          unsigned long slb_v, unsigned int status, bool data);
unsigned long kvmppc_rm_h_xirr(struct kvm_vcpu *vcpu);
int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server,
                    unsigned long mfrr);
int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
#endif

#endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
+39 −8
Original line number Diff line number Diff line
@@ -70,7 +70,9 @@

#define HPTE_V_SSIZE_SHIFT	62
#define HPTE_V_AVPN_SHIFT	7
#define HPTE_V_COMMON_BITS	ASM_CONST(0x000fffffffffffff)
#define HPTE_V_AVPN		ASM_CONST(0x3fffffffffffff80)
#define HPTE_V_AVPN_3_0		ASM_CONST(0x000fffffffffff80)
#define HPTE_V_AVPN_VAL(x)	(((x) & HPTE_V_AVPN) >> HPTE_V_AVPN_SHIFT)
#define HPTE_V_COMPARE(x,y)	(!(((x) ^ (y)) & 0xffffffffffffff80UL))
#define HPTE_V_BOLTED		ASM_CONST(0x0000000000000010)
@@ -80,14 +82,16 @@
#define HPTE_V_VALID		ASM_CONST(0x0000000000000001)

/*
 * ISA 3.0 have a different HPTE format.
 * ISA 3.0 has a different HPTE format.
 */
#define HPTE_R_3_0_SSIZE_SHIFT	58
#define HPTE_R_3_0_SSIZE_MASK	(3ull << HPTE_R_3_0_SSIZE_SHIFT)
#define HPTE_R_PP0		ASM_CONST(0x8000000000000000)
#define HPTE_R_TS		ASM_CONST(0x4000000000000000)
#define HPTE_R_KEY_HI		ASM_CONST(0x3000000000000000)
#define HPTE_R_RPN_SHIFT	12
#define HPTE_R_RPN		ASM_CONST(0x0ffffffffffff000)
#define HPTE_R_RPN_3_0		ASM_CONST(0x01fffffffffff000)
#define HPTE_R_PP		ASM_CONST(0x0000000000000003)
#define HPTE_R_PPP		ASM_CONST(0x8000000000000003)
#define HPTE_R_N		ASM_CONST(0x0000000000000004)
@@ -316,11 +320,42 @@ static inline unsigned long hpte_encode_avpn(unsigned long vpn, int psize,
	 */
	v = (vpn >> (23 - VPN_SHIFT)) & ~(mmu_psize_defs[psize].avpnm);
	v <<= HPTE_V_AVPN_SHIFT;
	if (!cpu_has_feature(CPU_FTR_ARCH_300))
	v |= ((unsigned long) ssize) << HPTE_V_SSIZE_SHIFT;
	return v;
}

/*
 * ISA v3.0 defines a new HPTE format, which differs from the old
 * format in having smaller AVPN and ARPN fields, and the B field
 * in the second dword instead of the first.
 */
static inline unsigned long hpte_old_to_new_v(unsigned long v)
{
	/* trim AVPN, drop B */
	return v & HPTE_V_COMMON_BITS;
}

static inline unsigned long hpte_old_to_new_r(unsigned long v, unsigned long r)
{
	/* move B field from 1st to 2nd dword, trim ARPN */
	return (r & ~HPTE_R_3_0_SSIZE_MASK) |
		(((v) >> HPTE_V_SSIZE_SHIFT) << HPTE_R_3_0_SSIZE_SHIFT);
}

static inline unsigned long hpte_new_to_old_v(unsigned long v, unsigned long r)
{
	/* insert B field */
	return (v & HPTE_V_COMMON_BITS) |
		((r & HPTE_R_3_0_SSIZE_MASK) <<
		 (HPTE_V_SSIZE_SHIFT - HPTE_R_3_0_SSIZE_SHIFT));
}

static inline unsigned long hpte_new_to_old_r(unsigned long r)
{
	/* clear out B field */
	return r & ~HPTE_R_3_0_SSIZE_MASK;
}

/*
 * This function sets the AVPN and L fields of the HPTE  appropriately
 * using the base page size and actual page size.
@@ -341,12 +376,8 @@ static inline unsigned long hpte_encode_v(unsigned long vpn, int base_psize,
 * aligned for the requested page size
 */
static inline unsigned long hpte_encode_r(unsigned long pa, int base_psize,
					  int actual_psize, int ssize)
					  int actual_psize)
{

	if (cpu_has_feature(CPU_FTR_ARCH_300))
		pa |= ((unsigned long) ssize) << HPTE_R_3_0_SSIZE_SHIFT;

	/* A 4K page needs no special encoding */
	if (actual_psize == MMU_PAGE_4K)
		return pa & HPTE_R_RPN;
Loading