Merge branch 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm (5e83f6fb) · Commits · e / devices / android_kernel_xiaomi_markw

Documentation/feature-removal-schedule.txt

+0 −21

Original line number	Diff line number	Diff line
		@@ -487,17 +487,6 @@ Who: Jan Kiszka <jan.kiszka@web.de>

		----------------------------

		What: KVM memory aliases support
		When: July 2010
		Why: Memory aliasing support is used for speeding up guest vga access
		through the vga windows.

		Modern userspace no longer uses this feature, so it's just bitrotted
		code and can be removed with no impact.
		Who: Avi Kivity <avi@redhat.com>

		----------------------------

		What: xtime, wall_to_monotonic
		When: 2.6.36+
		Files: kernel/time/timekeeping.c include/linux/time.h
		@@ -508,16 +497,6 @@ Who: John Stultz <johnstul@us.ibm.com>

		----------------------------

		What: KVM kernel-allocated memory slots
		When: July 2010
		Why: Since 2.6.25, kvm supports user-allocated memory slots, which are
		much more flexible than kernel-allocated slots. All current userspace
		supports the newer interface and this code can be removed with no
		impact.
		Who: Avi Kivity <avi@redhat.com>

		----------------------------

		What: KVM paravirt mmu host support
		When: January 2011
		Why: The paravirt mmu host support is slower than non-paravirt mmu, both

Documentation/kvm/api.txt

+174 −34

Original line number	Diff line number	Diff line
		@@ -126,6 +126,10 @@ user fills in the size of the indices array in nmsrs, and in return
		kvm adjusts nmsrs to reflect the actual number of msrs and fills in
		the indices array with their numbers.

		Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
		not returned in the MSR list, as different vcpus can have a different number
		of banks, as set via the KVM_X86_SETUP_MCE ioctl.

		4.4 KVM_CHECK_EXTENSION

		Capability: basic
		@@ -160,29 +164,7 @@ Type: vm ioctl
		Parameters: struct kvm_memory_region (in)
		Returns: 0 on success, -1 on error

		struct kvm_memory_region {
		__u32 slot;
		__u32 flags;
		__u64 guest_phys_addr;
		__u64 memory_size; /* bytes */
		};

		/* for kvm_memory_region::flags */
		#define KVM_MEM_LOG_DIRTY_PAGES 1UL

		This ioctl allows the user to create or modify a guest physical memory
		slot. When changing an existing slot, it may be moved in the guest
		physical memory space, or its flags may be modified. It may not be
		resized. Slots may not overlap.

		The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which
		instructs kvm to keep track of writes to memory within the slot. See
		the KVM_GET_DIRTY_LOG ioctl.

		It is recommended to use the KVM_SET_USER_MEMORY_REGION ioctl instead
		of this API, if available. This newer API allows placing guest memory
		at specified locations in the host address space, yielding better
		control and easy access.
		This ioctl is obsolete and has been removed.

		4.6 KVM_CREATE_VCPU

		@@ -226,17 +208,7 @@ Type: vm ioctl
		Parameters: struct kvm_memory_alias (in)
		Returns: 0 (success), -1 (error)

		struct kvm_memory_alias {
		__u32 slot; /* this has a different namespace than memory slots */
		__u32 flags;
		__u64 guest_phys_addr;
		__u64 memory_size;
		__u64 target_phys_addr;
		};

		Defines a guest physical address space region as an alias to another
		region. Useful for aliased address, for example the VGA low memory
		window. Should not be used with userspace memory.
		This ioctl is obsolete and has been removed.

		4.9 KVM_RUN

		@@ -892,6 +864,174 @@ arguments.
		This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
		irqchip, the multiprocessing state must be maintained by userspace.

		4.39 KVM_SET_IDENTITY_MAP_ADDR

		Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
		Architectures: x86
		Type: vm ioctl
		Parameters: unsigned long identity (in)
		Returns: 0 on success, -1 on error

		This ioctl defines the physical address of a one-page region in the guest
		physical address space. The region must be within the first 4GB of the
		guest physical address space and must not conflict with any memory slot
		or any mmio address. The guest may malfunction if it accesses this memory
		region.

		This ioctl is required on Intel-based hosts. This is needed on Intel hardware
		because of a quirk in the virtualization implementation (see the internals
		documentation when it pops into existence).

		4.40 KVM_SET_BOOT_CPU_ID

		Capability: KVM_CAP_SET_BOOT_CPU_ID
		Architectures: x86, ia64
		Type: vm ioctl
		Parameters: unsigned long vcpu_id
		Returns: 0 on success, -1 on error

		Define which vcpu is the Bootstrap Processor (BSP). Values are the same
		as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default
		is vcpu 0.

		4.41 KVM_GET_XSAVE

		Capability: KVM_CAP_XSAVE
		Architectures: x86
		Type: vcpu ioctl
		Parameters: struct kvm_xsave (out)
		Returns: 0 on success, -1 on error

		struct kvm_xsave {
		__u32 region[1024];
		};

		This ioctl would copy current vcpu's xsave struct to the userspace.

		4.42 KVM_SET_XSAVE

		Capability: KVM_CAP_XSAVE
		Architectures: x86
		Type: vcpu ioctl
		Parameters: struct kvm_xsave (in)
		Returns: 0 on success, -1 on error

		struct kvm_xsave {
		__u32 region[1024];
		};

		This ioctl would copy userspace's xsave struct to the kernel.

		4.43 KVM_GET_XCRS

		Capability: KVM_CAP_XCRS
		Architectures: x86
		Type: vcpu ioctl
		Parameters: struct kvm_xcrs (out)
		Returns: 0 on success, -1 on error

		struct kvm_xcr {
		__u32 xcr;
		__u32 reserved;
		__u64 value;
		};

		struct kvm_xcrs {
		__u32 nr_xcrs;
		__u32 flags;
		struct kvm_xcr xcrs[KVM_MAX_XCRS];
		__u64 padding[16];
		};

		This ioctl would copy current vcpu's xcrs to the userspace.

		4.44 KVM_SET_XCRS

		Capability: KVM_CAP_XCRS
		Architectures: x86
		Type: vcpu ioctl
		Parameters: struct kvm_xcrs (in)
		Returns: 0 on success, -1 on error

		struct kvm_xcr {
		__u32 xcr;
		__u32 reserved;
		__u64 value;
		};

		struct kvm_xcrs {
		__u32 nr_xcrs;
		__u32 flags;
		struct kvm_xcr xcrs[KVM_MAX_XCRS];
		__u64 padding[16];
		};

		This ioctl would set vcpu's xcr to the value userspace specified.

		4.45 KVM_GET_SUPPORTED_CPUID

		Capability: KVM_CAP_EXT_CPUID
		Architectures: x86
		Type: system ioctl
		Parameters: struct kvm_cpuid2 (in/out)
		Returns: 0 on success, -1 on error

		struct kvm_cpuid2 {
		__u32 nent;
		__u32 padding;
		struct kvm_cpuid_entry2 entries[0];
		};

		#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1
		#define KVM_CPUID_FLAG_STATEFUL_FUNC 2
		#define KVM_CPUID_FLAG_STATE_READ_NEXT 4

		struct kvm_cpuid_entry2 {
		__u32 function;
		__u32 index;
		__u32 flags;
		__u32 eax;
		__u32 ebx;
		__u32 ecx;
		__u32 edx;
		__u32 padding[3];
		};

		This ioctl returns x86 cpuid features which are supported by both the hardware
		and kvm. Userspace can use the information returned by this ioctl to
		construct cpuid information (for KVM_SET_CPUID2) that is consistent with
		hardware, kernel, and userspace capabilities, and with user requirements (for
		example, the user may wish to constrain cpuid to emulate older hardware,
		or for feature consistency across a cluster).

		Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
		with the 'nent' field indicating the number of entries in the variable-size
		array 'entries'. If the number of entries is too low to describe the cpu
		capabilities, an error (E2BIG) is returned. If the number is too high,
		the 'nent' field is adjusted and an error (ENOMEM) is returned. If the
		number is just right, the 'nent' field is adjusted to the number of valid
		entries in the 'entries' array, which is then filled.

		The entries returned are the host cpuid as returned by the cpuid instruction,
		with unknown or unsupported features masked out. The fields in each entry
		are defined as follows:

		function: the eax value used to obtain the entry
		index: the ecx value used to obtain the entry (for entries that are
		affected by ecx)
		flags: an OR of zero or more of the following:
		KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
		if the index field is valid
		KVM_CPUID_FLAG_STATEFUL_FUNC:
		if cpuid for this function returns different values for successive
		invocations; there will be several entries with the same function,
		all with this flag set
		KVM_CPUID_FLAG_STATE_READ_NEXT:
		for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
		the first entry to be read by a cpu
		eax, ebx, ecx, edx: the values returned by the cpuid instruction for
		this function/index combination

		5. The kvm_run structure

		Application code obtains a pointer to the kvm_run structure by

Documentation/kvm/mmu.txt

+48 −4

Original line number	Diff line number	Diff line
		@@ -77,10 +77,10 @@ Memory

		Guest memory (gpa) is part of the user address space of the process that is
		using kvm. Userspace defines the translation between guest addresses and user
		addresses (gpa->hva); note that two gpas may alias to the same gva, but not
		addresses (gpa->hva); note that two gpas may alias to the same hva, but not
		vice versa.

		These gvas may be backed using any method available to the host: anonymous
		These hvas may be backed using any method available to the host: anonymous
		memory, file backed memory, and device memory. Memory might be paged by the
		host at any time.

		@@ -161,7 +161,7 @@ Shadow pages contain the following information:
		role.cr4_pae:
		Contains the value of cr4.pae for which the page is valid (e.g. whether
		32-bit or 64-bit gptes are in use).
		role.cr4_nxe:
		role.nxe:
		Contains the value of efer.nxe for which the page is valid.
		role.cr0_wp:
		Contains the value of cr0.wp for which the page is valid.
		@@ -180,7 +180,9 @@ Shadow pages contain the following information:
		guest pages as leaves.
		gfns:
		An array of 512 guest frame numbers, one for each present pte. Used to
		perform a reverse map from a pte to a gfn.
		perform a reverse map from a pte to a gfn. When role.direct is set, any
		element of this array can be calculated from the gfn field when used, in
		this case, the array of gfns is not allocated. See role.direct and gfn.
		slot_bitmap:
		A bitmap containing one bit per memory slot. If the page contains a pte
		mapping a page from memory slot n, then bit n of slot_bitmap will be set
		@@ -296,6 +298,48 @@ Host translation updates:
		- look up affected sptes through reverse map
		- drop (or update) translations

		Emulating cr0.wp
		================

		If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
		works for the guest kernel, not guest guest userspace. When the guest
		cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
		we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
		semantics require allowing any guest kernel access plus user read access).

		We handle this by mapping the permissions to two possible sptes, depending
		on fault type:

		- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
		disallows user access)
		- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
		write access)

		(user write faults generate a #PF)

		Large pages
		===========

		The mmu supports all combinations of large and small guest and host pages.
		Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as
		two separate 2M pages, on both guest and host, since the mmu always uses PAE
		paging.

		To instantiate a large spte, four constraints must be satisfied:

		- the spte must point to a large host page
		- the guest pte must be a large pte of at least equivalent size (if tdp is
		enabled, there is no guest pte and this condition is satisified)
		- if the spte will be writeable, the large page frame may not overlap any
		write-protected pages
		- the guest page must be wholly contained by a single memory slot

		To check the last two conditions, the mmu maintains a ->write_count set of
		arrays for each memory slot and large page size. Every write protected page
		causes its write_count to be incremented, thus preventing instantiation of
		a large spte. The frames at the end of an unaligned memory slot have
		artificically inflated ->write_counts so they can never be instantiated.

		Further reading
		===============

Documentation/kvm/msr.txt

0 → 100644

+153 −0

Original line number	Diff line number	Diff line
		KVM-specific MSRs.
		Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
		=====================================================

		KVM makes use of some custom MSRs to service some requests.
		At present, this facility is only used by kvmclock.

		Custom MSRs have a range reserved for them, that goes from
		0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
		but they are deprecated and their use is discouraged.

		Custom MSR list
		--------

		The current supported Custom MSR list is:

		MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00

		data: 4-byte alignment physical address of a memory area which must be
		in guest RAM. This memory is expected to hold a copy of the following
		structure:

		struct pvclock_wall_clock {
		u32 version;
		u32 sec;
		u32 nsec;
		} __attribute__((__packed__));

		whose data will be filled in by the hypervisor. The hypervisor is only
		guaranteed to update this data at the moment of MSR write.
		Users that want to reliably query this information more than once have
		to write more than once to this MSR. Fields have the following meanings:

		version: guest has to check version before and after grabbing
		time information and check that they are both equal and even.
		An odd version indicates an in-progress update.

		sec: number of seconds for wallclock.

		nsec: number of nanoseconds for wallclock.

		Note that although MSRs are per-CPU entities, the effect of this
		particular MSR is global.

		Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
		leaf prior to usage.

		MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01

		data: 4-byte aligned physical address of a memory area which must be in
		guest RAM, plus an enable bit in bit 0. This memory is expected to hold
		a copy of the following structure:

		struct pvclock_vcpu_time_info {
		u32 version;
		u32 pad0;
		u64 tsc_timestamp;
		u64 system_time;
		u32 tsc_to_system_mul;
		s8 tsc_shift;
		u8 flags;
		u8 pad[2];
		} __attribute__((__packed__)); /* 32 bytes */

		whose data will be filled in by the hypervisor periodically. Only one
		write, or registration, is needed for each VCPU. The interval between
		updates of this structure is arbitrary and implementation-dependent.
		The hypervisor may update this structure at any time it sees fit until
		anything with bit0 == 0 is written to it.

		Fields have the following meanings:

		version: guest has to check version before and after grabbing
		time information and check that they are both equal and even.
		An odd version indicates an in-progress update.

		tsc_timestamp: the tsc value at the current VCPU at the time
		of the update of this structure. Guests can subtract this value
		from current tsc to derive a notion of elapsed time since the
		structure update.

		system_time: a host notion of monotonic time, including sleep
		time at the time this structure was last updated. Unit is
		nanoseconds.

		tsc_to_system_mul: a function of the tsc frequency. One has
		to multiply any tsc-related quantity by this value to get
		a value in nanoseconds, besides dividing by 2^tsc_shift

		tsc_shift: cycle to nanosecond divider, as a power of two, to
		allow for shift rights. One has to shift right any tsc-related
		quantity by this value to get a value in nanoseconds, besides
		multiplying by tsc_to_system_mul.

		With this information, guests can derive per-CPU time by
		doing:

		time = (current_tsc - tsc_timestamp)
		time = (time * tsc_to_system_mul) >> tsc_shift
		time = time + system_time

		flags: bits in this field indicate extended capabilities
		coordinated between the guest and the hypervisor. Availability
		of specific flags has to be checked in 0x40000001 cpuid leaf.
		Current flags are:

		flag bit \| cpuid bit \| meaning
		-------------------------------------------------------------
		\| \| time measures taken across
		0 \| 24 \| multiple cpus are guaranteed to
		\| \| be monotonic
		-------------------------------------------------------------

		Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
		leaf prior to usage.


		MSR_KVM_WALL_CLOCK: 0x11

		data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.

		This MSR falls outside the reserved KVM range and may be removed in the
		future. Its usage is deprecated.

		Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
		leaf prior to usage.

		MSR_KVM_SYSTEM_TIME: 0x12

		data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.

		This MSR falls outside the reserved KVM range and may be removed in the
		future. Its usage is deprecated.

		Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
		leaf prior to usage.

		The suggested algorithm for detecting kvmclock presence is then:

		if (!kvm_para_available()) /* refer to cpuid.txt */
		return NON_PRESENT;

		flags = cpuid_eax(0x40000001);
		if (flags & 3) {
		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
		return PRESENT;
		} else if (flags & 0) {
		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
		return PRESENT;
		} else
		return NON_PRESENT;

Documentation/kvm/review-checklist.txt

0 → 100644

+38 −0

Original line number	Diff line number	Diff line
		Review checklist for kvm patches
		================================

		1. The patch must follow Documentation/CodingStyle and
		Documentation/SubmittingPatches.

		2. Patches should be against kvm.git master branch.

		3. If the patch introduces or modifies a new userspace API:
		- the API must be documented in Documentation/kvm/api.txt
		- the API must be discoverable using KVM_CHECK_EXTENSION

		4. New state must include support for save/restore.

		5. New features must default to off (userspace should explicitly request them).
		Performance improvements can and should default to on.

		6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2

		7. Emulator changes should be accompanied by unit tests for qemu-kvm.git
		kvm/test directory.

		8. Changes should be vendor neutral when possible. Changes to common code
		are better than duplicating changes to vendor code.

		9. Similarly, prefer changes to arch independent code than to arch dependent
		code.

		10. User/kernel interfaces and guest/host interfaces must be 64-bit clean
		(all variables and sizes naturally aligned on 64-bit; use specific types
		only - u64 rather than ulong).

		11. New guest visible features must either be documented in a hardware manual
		or be accompanied by documentation.

		12. Features must be robust against reset and kexec - for example, shared
		host/guest memory must be unshared to prevent the host from writing to
		guest memory that the guest has not reserved for this purpose.