Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 5e83f6fb authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm

* 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (198 commits)
  KVM: VMX: Fix host GDT.LIMIT corruption
  KVM: MMU: using __xchg_spte more smarter
  KVM: MMU: cleanup spte set and accssed/dirty tracking
  KVM: MMU: don't atomicly set spte if it's not present
  KVM: MMU: fix page dirty tracking lost while sync page
  KVM: MMU: fix broken page accessed tracking with ept enabled
  KVM: MMU: add missing reserved bits check in speculative path
  KVM: MMU: fix mmu notifier invalidate handler for huge spte
  KVM: x86 emulator: fix xchg instruction emulation
  KVM: x86: Call mask notifiers from pic
  KVM: x86: never re-execute instruction with enabled tdp
  KVM: Document KVM_GET_SUPPORTED_CPUID2 ioctl
  KVM: x86: emulator: inc/dec can have lock prefix
  KVM: MMU: Eliminate redundant temporaries in FNAME(fetch)
  KVM: MMU: Validate all gptes during fetch, not just those used for new pages
  KVM: MMU: Simplify spte fetch() function
  KVM: MMU: Add gpte_valid() helper
  KVM: MMU: Add validate_direct_spte() helper
  KVM: MMU: Add drop_large_spte() helper
  KVM: MMU: Use __set_spte to link shadow pages
  ...
parents fe445c6e 3444d7da
Loading
Loading
Loading
Loading
+0 −21
Original line number Diff line number Diff line
@@ -487,17 +487,6 @@ Who: Jan Kiszka <jan.kiszka@web.de>

----------------------------

What:	KVM memory aliases support
When:	July 2010
Why:	Memory aliasing support is used for speeding up guest vga access
	through the vga windows.

	Modern userspace no longer uses this feature, so it's just bitrotted
	code and can be removed with no impact.
Who:	Avi Kivity <avi@redhat.com>

----------------------------

What:	xtime, wall_to_monotonic
When:	2.6.36+
Files:	kernel/time/timekeeping.c include/linux/time.h
@@ -508,16 +497,6 @@ Who: John Stultz <johnstul@us.ibm.com>

----------------------------

What:	KVM kernel-allocated memory slots
When:	July 2010
Why:	Since 2.6.25, kvm supports user-allocated memory slots, which are
	much more flexible than kernel-allocated slots.  All current userspace
	supports the newer interface and this code can be removed with no
	impact.
Who:	Avi Kivity <avi@redhat.com>

----------------------------

What:	KVM paravirt mmu host support
When:	January 2011
Why:	The paravirt mmu host support is slower than non-paravirt mmu, both
+174 −34
Original line number Diff line number Diff line
@@ -126,6 +126,10 @@ user fills in the size of the indices array in nmsrs, and in return
kvm adjusts nmsrs to reflect the actual number of msrs and fills in
the indices array with their numbers.

Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
not returned in the MSR list, as different vcpus can have a different number
of banks, as set via the KVM_X86_SETUP_MCE ioctl.

4.4 KVM_CHECK_EXTENSION

Capability: basic
@@ -160,29 +164,7 @@ Type: vm ioctl
Parameters: struct kvm_memory_region (in)
Returns: 0 on success, -1 on error

struct kvm_memory_region {
	__u32 slot;
	__u32 flags;
	__u64 guest_phys_addr;
	__u64 memory_size; /* bytes */
};

/* for kvm_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES  1UL

This ioctl allows the user to create or modify a guest physical memory
slot.  When changing an existing slot, it may be moved in the guest
physical memory space, or its flags may be modified.  It may not be
resized.  Slots may not overlap.

The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which
instructs kvm to keep track of writes to memory within the slot.  See
the KVM_GET_DIRTY_LOG ioctl.

It is recommended to use the KVM_SET_USER_MEMORY_REGION ioctl instead
of this API, if available.  This newer API allows placing guest memory
at specified locations in the host address space, yielding better
control and easy access.
This ioctl is obsolete and has been removed.

4.6 KVM_CREATE_VCPU

@@ -226,17 +208,7 @@ Type: vm ioctl
Parameters: struct kvm_memory_alias (in)
Returns: 0 (success), -1 (error)

struct kvm_memory_alias {
	__u32 slot;  /* this has a different namespace than memory slots */
	__u32 flags;
	__u64 guest_phys_addr;
	__u64 memory_size;
	__u64 target_phys_addr;
};

Defines a guest physical address space region as an alias to another
region.  Useful for aliased address, for example the VGA low memory
window. Should not be used with userspace memory.
This ioctl is obsolete and has been removed.

4.9 KVM_RUN

@@ -892,6 +864,174 @@ arguments.
This ioctl is only useful after KVM_CREATE_IRQCHIP.  Without an in-kernel
irqchip, the multiprocessing state must be maintained by userspace.

4.39 KVM_SET_IDENTITY_MAP_ADDR

Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
Architectures: x86
Type: vm ioctl
Parameters: unsigned long identity (in)
Returns: 0 on success, -1 on error

This ioctl defines the physical address of a one-page region in the guest
physical address space.  The region must be within the first 4GB of the
guest physical address space and must not conflict with any memory slot
or any mmio address.  The guest may malfunction if it accesses this memory
region.

This ioctl is required on Intel-based hosts.  This is needed on Intel hardware
because of a quirk in the virtualization implementation (see the internals
documentation when it pops into existence).

4.40 KVM_SET_BOOT_CPU_ID

Capability: KVM_CAP_SET_BOOT_CPU_ID
Architectures: x86, ia64
Type: vm ioctl
Parameters: unsigned long vcpu_id
Returns: 0 on success, -1 on error

Define which vcpu is the Bootstrap Processor (BSP).  Values are the same
as the vcpu id in KVM_CREATE_VCPU.  If this ioctl is not called, the default
is vcpu 0.

4.41 KVM_GET_XSAVE

Capability: KVM_CAP_XSAVE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xsave (out)
Returns: 0 on success, -1 on error

struct kvm_xsave {
	__u32 region[1024];
};

This ioctl would copy current vcpu's xsave struct to the userspace.

4.42 KVM_SET_XSAVE

Capability: KVM_CAP_XSAVE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xsave (in)
Returns: 0 on success, -1 on error

struct kvm_xsave {
	__u32 region[1024];
};

This ioctl would copy userspace's xsave struct to the kernel.

4.43 KVM_GET_XCRS

Capability: KVM_CAP_XCRS
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xcrs (out)
Returns: 0 on success, -1 on error

struct kvm_xcr {
	__u32 xcr;
	__u32 reserved;
	__u64 value;
};

struct kvm_xcrs {
	__u32 nr_xcrs;
	__u32 flags;
	struct kvm_xcr xcrs[KVM_MAX_XCRS];
	__u64 padding[16];
};

This ioctl would copy current vcpu's xcrs to the userspace.

4.44 KVM_SET_XCRS

Capability: KVM_CAP_XCRS
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xcrs (in)
Returns: 0 on success, -1 on error

struct kvm_xcr {
	__u32 xcr;
	__u32 reserved;
	__u64 value;
};

struct kvm_xcrs {
	__u32 nr_xcrs;
	__u32 flags;
	struct kvm_xcr xcrs[KVM_MAX_XCRS];
	__u64 padding[16];
};

This ioctl would set vcpu's xcr to the value userspace specified.

4.45 KVM_GET_SUPPORTED_CPUID

Capability: KVM_CAP_EXT_CPUID
Architectures: x86
Type: system ioctl
Parameters: struct kvm_cpuid2 (in/out)
Returns: 0 on success, -1 on error

struct kvm_cpuid2 {
	__u32 nent;
	__u32 padding;
	struct kvm_cpuid_entry2 entries[0];
};

#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1
#define KVM_CPUID_FLAG_STATEFUL_FUNC    2
#define KVM_CPUID_FLAG_STATE_READ_NEXT  4

struct kvm_cpuid_entry2 {
	__u32 function;
	__u32 index;
	__u32 flags;
	__u32 eax;
	__u32 ebx;
	__u32 ecx;
	__u32 edx;
	__u32 padding[3];
};

This ioctl returns x86 cpuid features which are supported by both the hardware
and kvm.  Userspace can use the information returned by this ioctl to
construct cpuid information (for KVM_SET_CPUID2) that is consistent with
hardware, kernel, and userspace capabilities, and with user requirements (for
example, the user may wish to constrain cpuid to emulate older hardware,
or for feature consistency across a cluster).

Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
with the 'nent' field indicating the number of entries in the variable-size
array 'entries'.  If the number of entries is too low to describe the cpu
capabilities, an error (E2BIG) is returned.  If the number is too high,
the 'nent' field is adjusted and an error (ENOMEM) is returned.  If the
number is just right, the 'nent' field is adjusted to the number of valid
entries in the 'entries' array, which is then filled.

The entries returned are the host cpuid as returned by the cpuid instruction,
with unknown or unsupported features masked out.  The fields in each entry
are defined as follows:

  function: the eax value used to obtain the entry
  index: the ecx value used to obtain the entry (for entries that are
         affected by ecx)
  flags: an OR of zero or more of the following:
        KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
           if the index field is valid
        KVM_CPUID_FLAG_STATEFUL_FUNC:
           if cpuid for this function returns different values for successive
           invocations; there will be several entries with the same function,
           all with this flag set
        KVM_CPUID_FLAG_STATE_READ_NEXT:
           for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
           the first entry to be read by a cpu
   eax, ebx, ecx, edx: the values returned by the cpuid instruction for
         this function/index combination

5. The kvm_run structure

Application code obtains a pointer to the kvm_run structure by
+48 −4
Original line number Diff line number Diff line
@@ -77,10 +77,10 @@ Memory

Guest memory (gpa) is part of the user address space of the process that is
using kvm.  Userspace defines the translation between guest addresses and user
addresses (gpa->hva); note that two gpas may alias to the same gva, but not
addresses (gpa->hva); note that two gpas may alias to the same hva, but not
vice versa.

These gvas may be backed using any method available to the host: anonymous
These hvas may be backed using any method available to the host: anonymous
memory, file backed memory, and device memory.  Memory might be paged by the
host at any time.

@@ -161,7 +161,7 @@ Shadow pages contain the following information:
  role.cr4_pae:
    Contains the value of cr4.pae for which the page is valid (e.g. whether
    32-bit or 64-bit gptes are in use).
  role.cr4_nxe:
  role.nxe:
    Contains the value of efer.nxe for which the page is valid.
  role.cr0_wp:
    Contains the value of cr0.wp for which the page is valid.
@@ -180,7 +180,9 @@ Shadow pages contain the following information:
    guest pages as leaves.
  gfns:
    An array of 512 guest frame numbers, one for each present pte.  Used to
    perform a reverse map from a pte to a gfn.
    perform a reverse map from a pte to a gfn. When role.direct is set, any
    element of this array can be calculated from the gfn field when used, in
    this case, the array of gfns is not allocated. See role.direct and gfn.
  slot_bitmap:
    A bitmap containing one bit per memory slot.  If the page contains a pte
    mapping a page from memory slot n, then bit n of slot_bitmap will be set
@@ -296,6 +298,48 @@ Host translation updates:
  - look up affected sptes through reverse map
  - drop (or update) translations

Emulating cr0.wp
================

If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
works for the guest kernel, not guest guest userspace.  When the guest
cr0.wp=1, this does not present a problem.  However when the guest cr0.wp=0,
we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
semantics require allowing any guest kernel access plus user read access).

We handle this by mapping the permissions to two possible sptes, depending
on fault type:

- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
  disallows user access)
- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
  write access)

(user write faults generate a #PF)

Large pages
===========

The mmu supports all combinations of large and small guest and host pages.
Supported page sizes include 4k, 2M, 4M, and 1G.  4M pages are treated as
two separate 2M pages, on both guest and host, since the mmu always uses PAE
paging.

To instantiate a large spte, four constraints must be satisfied:

- the spte must point to a large host page
- the guest pte must be a large pte of at least equivalent size (if tdp is
  enabled, there is no guest pte and this condition is satisified)
- if the spte will be writeable, the large page frame may not overlap any
  write-protected pages
- the guest page must be wholly contained by a single memory slot

To check the last two conditions, the mmu maintains a ->write_count set of
arrays for each memory slot and large page size.  Every write protected page
causes its write_count to be incremented, thus preventing instantiation of
a large spte.  The frames at the end of an unaligned memory slot have
artificically inflated ->write_counts so they can never be instantiated.

Further reading
===============

+153 −0
Original line number Diff line number Diff line
KVM-specific MSRs.
Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
=====================================================

KVM makes use of some custom MSRs to service some requests.
At present, this facility is only used by kvmclock.

Custom MSRs have a range reserved for them, that goes from
0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
but they are deprecated and their use is discouraged.

Custom MSR list
--------

The current supported Custom MSR list is:

MSR_KVM_WALL_CLOCK_NEW:   0x4b564d00

	data: 4-byte alignment physical address of a memory area which must be
	in guest RAM. This memory is expected to hold a copy of the following
	structure:

	struct pvclock_wall_clock {
		u32   version;
		u32   sec;
		u32   nsec;
	} __attribute__((__packed__));

	whose data will be filled in by the hypervisor. The hypervisor is only
	guaranteed to update this data at the moment of MSR write.
	Users that want to reliably query this information more than once have
	to write more than once to this MSR. Fields have the following meanings:

		version: guest has to check version before and after grabbing
		time information and check that they are both equal and even.
		An odd version indicates an in-progress update.

		sec: number of seconds for wallclock.

		nsec: number of nanoseconds for wallclock.

	Note that although MSRs are per-CPU entities, the effect of this
	particular MSR is global.

	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
	leaf prior to usage.

MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01

	data: 4-byte aligned physical address of a memory area which must be in
	guest RAM, plus an enable bit in bit 0. This memory is expected to hold
	a copy of the following structure:

	struct pvclock_vcpu_time_info {
		u32   version;
		u32   pad0;
		u64   tsc_timestamp;
		u64   system_time;
		u32   tsc_to_system_mul;
		s8    tsc_shift;
		u8    flags;
		u8    pad[2];
	} __attribute__((__packed__)); /* 32 bytes */

	whose data will be filled in by the hypervisor periodically. Only one
	write, or registration, is needed for each VCPU. The interval between
	updates of this structure is arbitrary and implementation-dependent.
	The hypervisor may update this structure at any time it sees fit until
	anything with bit0 == 0 is written to it.

	Fields have the following meanings:

		version: guest has to check version before and after grabbing
		time information and check that they are both equal and even.
		An odd version indicates an in-progress update.

		tsc_timestamp: the tsc value at the current VCPU at the time
		of the update of this structure. Guests can subtract this value
		from current tsc to derive a notion of elapsed time since the
		structure update.

		system_time: a host notion of monotonic time, including sleep
		time at the time this structure was last updated. Unit is
		nanoseconds.

		tsc_to_system_mul: a function of the tsc frequency. One has
		to multiply any tsc-related quantity by this value to get
		a value in nanoseconds, besides dividing by 2^tsc_shift

		tsc_shift: cycle to nanosecond divider, as a power of two, to
		allow for shift rights. One has to shift right any tsc-related
		quantity by this value to get a value in nanoseconds, besides
		multiplying by tsc_to_system_mul.

		With this information, guests can derive per-CPU time by
		doing:

			time = (current_tsc - tsc_timestamp)
			time = (time * tsc_to_system_mul) >> tsc_shift
			time = time + system_time

		flags: bits in this field indicate extended capabilities
		coordinated between the guest and the hypervisor. Availability
		of specific flags has to be checked in 0x40000001 cpuid leaf.
		Current flags are:

		 flag bit   | cpuid bit    | meaning
		-------------------------------------------------------------
			    |	           | time measures taken across
		     0      |	   24      | multiple cpus are guaranteed to
			    |		   | be monotonic
		-------------------------------------------------------------

	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
	leaf prior to usage.


MSR_KVM_WALL_CLOCK:  0x11

	data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.

	This MSR falls outside the reserved KVM range and may be removed in the
	future. Its usage is deprecated.

	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
	leaf prior to usage.

MSR_KVM_SYSTEM_TIME: 0x12

	data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.

	This MSR falls outside the reserved KVM range and may be removed in the
	future. Its usage is deprecated.

	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
	leaf prior to usage.

	The suggested algorithm for detecting kvmclock presence is then:

		if (!kvm_para_available())    /* refer to cpuid.txt */
			return NON_PRESENT;

		flags = cpuid_eax(0x40000001);
		if (flags & 3) {
			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
			return PRESENT;
		} else if (flags & 0) {
			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
			return PRESENT;
		} else
			return NON_PRESENT;
+38 −0
Original line number Diff line number Diff line
Review checklist for kvm patches
================================

1.  The patch must follow Documentation/CodingStyle and
    Documentation/SubmittingPatches.

2.  Patches should be against kvm.git master branch.

3.  If the patch introduces or modifies a new userspace API:
    - the API must be documented in Documentation/kvm/api.txt
    - the API must be discoverable using KVM_CHECK_EXTENSION

4.  New state must include support for save/restore.

5.  New features must default to off (userspace should explicitly request them).
    Performance improvements can and should default to on.

6.  New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2

7.  Emulator changes should be accompanied by unit tests for qemu-kvm.git
    kvm/test directory.

8.  Changes should be vendor neutral when possible.  Changes to common code
    are better than duplicating changes to vendor code.

9.  Similarly, prefer changes to arch independent code than to arch dependent
    code.

10. User/kernel interfaces and guest/host interfaces must be 64-bit clean
    (all variables and sizes naturally aligned on 64-bit; use specific types
    only - u64 rather than ulong).

11. New guest visible features must either be documented in a hardware manual
    or be accompanied by documentation.

12. Features must be robust against reset and kexec - for example, shared
    host/guest memory must be unshared to prevent the host from writing to
    guest memory that the guest has not reserved for this purpose.
Loading