Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit dcb2674a authored by Greg Kroah-Hartman's avatar Greg Kroah-Hartman
Browse files

Merge 4.9.77 into android-4.9-o



Changes in 4.9.77
	dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
	mac80211: Add RX flag to indicate ICV stripped
	ath10k: rebuild crypto header in rx data frames
	KVM: Fix stack-out-of-bounds read in write_mmio
	can: gs_usb: fix return value of the "set_bittiming" callback
	IB/srpt: Disable RDMA access by the initiator
	MIPS: Validate PR_SET_FP_MODE prctl(2) requests against the ABI of the task
	MIPS: Factor out NT_PRFPREG regset access helpers
	MIPS: Guard against any partial write attempt with PTRACE_SETREGSET
	MIPS: Consistently handle buffer counter with PTRACE_SETREGSET
	MIPS: Fix an FCSR access API regression with NT_PRFPREG and MSA
	MIPS: Also verify sizeof `elf_fpreg_t' with PTRACE_SETREGSET
	MIPS: Disallow outsized PTRACE_SETREGSET NT_PRFPREG regset accesses
	kvm: vmx: Scrub hardware GPRs at VM-exit
	platform/x86: wmi: Call acpi_wmi_init() later
	x86/acpi: Handle SCI interrupts above legacy space gracefully
	ALSA: pcm: Remove incorrect snd_BUG_ON() usages
	ALSA: pcm: Add missing error checks in OSS emulation plugin builder
	ALSA: pcm: Abort properly at pending signal in OSS read/write loops
	ALSA: pcm: Allow aborting mutex lock at OSS read/write loops
	ALSA: aloop: Release cable upon open error path
	ALSA: aloop: Fix inconsistent format due to incomplete rule
	ALSA: aloop: Fix racy hw constraints adjustment
	x86/acpi: Reduce code duplication in mp_override_legacy_irq()
	zswap: don't param_set_charp while holding spinlock
	lan78xx: use skb_cow_head() to deal with cloned skbs
	sr9700: use skb_cow_head() to deal with cloned skbs
	smsc75xx: use skb_cow_head() to deal with cloned skbs
	cx82310_eth: use skb_cow_head() to deal with cloned skbs
	xhci: Fix ring leak in failure path of xhci_alloc_virt_device()
	8021q: fix a memory leak for VLAN 0 device
	ip6_tunnel: disable dst caching if tunnel is dual-stack
	net: core: fix module type in sock_diag_bind
	RDS: Heap OOB write in rds_message_alloc_sgs()
	RDS: null pointer dereference in rds_atomic_free_op
	sh_eth: fix TSU resource handling
	sh_eth: fix SH7757 GEther initialization
	net: stmmac: enable EEE in MII, GMII or RGMII only
	ipv6: fix possible mem leaks in ipv6_make_skb()
	ethtool: do not print warning for applications using legacy API
	mlxsw: spectrum_router: Fix NULL pointer deref
	net/sched: Fix update of lastuse in act modules implementing stats_update
	crypto: algapi - fix NULL dereference in crypto_remove_spawns()
	rbd: set max_segments to USHRT_MAX
	x86/microcode/intel: Extend BDW late-loading with a revision check
	KVM: x86: Add memory barrier on vmcs field lookup
	drm/vmwgfx: Potential off by one in vmw_view_add()
	kaiser: Set _PAGE_NX only if supported
	iscsi-target: Make TASK_REASSIGN use proper se_cmd->cmd_kref
	target: Avoid early CMD_T_PRE_EXECUTE failures during ABORT_TASK
	bpf: move fixup_bpf_calls() function
	bpf: refactor fixup_bpf_calls()
	bpf: prevent out-of-bounds speculation
	bpf, array: fix overflow in max_entries and undefined behavior in index_mask
	USB: serial: cp210x: add IDs for LifeScan OneTouch Verio IQ
	USB: serial: cp210x: add new device ID ELV ALC 8xxx
	usb: misc: usb3503: make sure reset is low for at least 100us
	USB: fix usbmon BUG trigger
	usbip: remove kernel addresses from usb device and urb debug msgs
	usbip: fix vudc_rx: harden CMD_SUBMIT path to handle malicious input
	usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer
	staging: android: ashmem: fix a race condition in ASHMEM_SET_SIZE ioctl
	Bluetooth: Prevent stack info leak from the EFS element.
	uas: ignore UAS for Norelsys NS1068(X) chips
	e1000e: Fix e1000_check_for_copper_link_ich8lan return value.
	x86/Documentation: Add PTI description
	x86/cpu: Factor out application of forced CPU caps
	x86/cpufeatures: Make CPU bugs sticky
	x86/cpufeatures: Add X86_BUG_CPU_INSECURE
	x86/pti: Rename BUG_CPU_INSECURE to BUG_CPU_MELTDOWN
	x86/cpufeatures: Add X86_BUG_SPECTRE_V[12]
	x86/cpu: Merge bugs.c and bugs_64.c
	sysfs/cpu: Add vulnerability folder
	x86/cpu: Implement CPU vulnerabilites sysfs functions
	x86/cpu/AMD: Make LFENCE a serializing instruction
	x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
	sysfs/cpu: Fix typos in vulnerability documentation
	x86/alternatives: Fix optimize_nops() checking
	x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm
	x86/mm/32: Move setup_clear_cpu_cap(X86_FEATURE_PCID) earlier
	objtool, modules: Discard objtool annotation sections for modules
	objtool: Detect jumps to retpoline thunks
	objtool: Allow alternatives to be ignored
	x86/asm: Use register variable to get stack pointer value
	x86/retpoline: Add initial retpoline support
	x86/spectre: Add boot time option to select Spectre v2 mitigation
	x86/retpoline/crypto: Convert crypto assembler indirect jumps
	x86/retpoline/entry: Convert entry assembler indirect jumps
	x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
	x86/retpoline/hyperv: Convert assembler indirect jumps
	x86/retpoline/xen: Convert Xen hypercall indirect jumps
	x86/retpoline/checksum32: Convert assembler indirect jumps
	x86/retpoline/irq32: Convert assembler indirect jumps
	x86/retpoline: Fill return stack buffer on vmexit
	selftests/x86: Add test_vsyscall
	x86/retpoline: Remove compile time warning
	objtool: Fix retpoline support for pre-ORC objtool
	x86/pti/efi: broken conversion from efi to kernel page table
	Linux 4.9.77

Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@google.com>
parents b73dcc7a b8cf9ff7
Loading
Loading
Loading
Loading
+16 −0
Original line number Diff line number Diff line
@@ -350,3 +350,19 @@ Contact: Linux ARM Kernel Mailing list <linux-arm-kernel@lists.infradead.org>
Description:	AArch64 CPU registers
		'identification' directory exposes the CPU ID registers for
		 identifying model and revision of the CPU.

What:		/sys/devices/system/cpu/vulnerabilities
		/sys/devices/system/cpu/vulnerabilities/meltdown
		/sys/devices/system/cpu/vulnerabilities/spectre_v1
		/sys/devices/system/cpu/vulnerabilities/spectre_v2
Date:		January 2018
Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
Description:	Information about CPU vulnerabilities

		The files are named after the code names of CPU
		vulnerabilities. The output of those files reflects the
		state of the CPUs in the system. Possible output values:

		"Not affected"	  CPU is not affected by the vulnerability
		"Vulnerable"	  CPU is affected and no mitigation in effect
		"Mitigation: $M"  CPU is affected and mitigation $M is in effect
+42 −7
Original line number Diff line number Diff line
@@ -2697,6 +2697,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).
			Equivalent to smt=1.

	nospectre_v2	[X86] Disable all mitigations for the Spectre variant 2
			(indirect branch prediction) vulnerability. System may
			allow data leaks with this option, which is equivalent
			to spectre_v2=off.

	noxsave		[BUGS=X86] Disables x86 extended register state save
			and restore using xsave. The kernel will fallback to
			enabling legacy floating-point and sse state.
@@ -2769,8 +2774,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted.

	nojitter	[IA-64] Disables jitter checking for ITC timers.

	nopti		[X86-64] Disable KAISER isolation of kernel from user.

	no-kvmclock	[X86,KVM] Disable paravirtualized KVM clock driver

	no-kvmapf	[X86,KVM] Disable paravirtualized asynchronous page
@@ -3333,11 +3336,20 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
	pt.		[PARIDE]
			See Documentation/blockdev/paride.txt.

	pti=		[X86_64]
			Control KAISER user/kernel address space isolation:
			on - enable
			off - disable
			auto - default setting
	pti=		[X86_64] Control Page Table Isolation of user and
			kernel address spaces.  Disabling this feature
			removes hardening, but improves performance of
			system calls and interrupts.

			on   - unconditionally enable
			off  - unconditionally disable
			auto - kernel detects whether your CPU model is
			       vulnerable to issues that PTI mitigates

			Not specifying this option is equivalent to pti=auto.

	nopti		[X86_64]
			Equivalent to pti=off

	pty.legacy_count=
			[KNL] Number of legacy pty's. Overwrites compiled-in
@@ -3943,6 +3955,29 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
	sonypi.*=	[HW] Sony Programmable I/O Control Device driver
			See Documentation/laptops/sonypi.txt

	spectre_v2=	[X86] Control mitigation of Spectre variant 2
			(indirect branch speculation) vulnerability.

			on   - unconditionally enable
			off  - unconditionally disable
			auto - kernel detects whether your CPU model is
			       vulnerable

			Selecting 'on' will, and 'auto' may, choose a
			mitigation method at run time according to the
			CPU, the available microcode, the setting of the
			CONFIG_RETPOLINE configuration option, and the
			compiler with which the kernel was built.

			Specific mitigations can also be selected manually:

			retpoline	  - replace indirect branches
			retpoline,generic - google's original retpoline
			retpoline,amd     - AMD-specific minimal thunk

			Not specifying this option is equivalent to
			spectre_v2=auto.

	spia_io_base=	[HW,MTD]
	spia_fio_base=
	spia_pedr=
+186 −0
Original line number Diff line number Diff line
Overview
========

Page Table Isolation (pti, previously known as KAISER[1]) is a
countermeasure against attacks on the shared user/kernel address
space such as the "Meltdown" approach[2].

To mitigate this class of attacks, we create an independent set of
page tables for use only when running userspace applications.  When
the kernel is entered via syscalls, interrupts or exceptions, the
page tables are switched to the full "kernel" copy.  When the system
switches back to user mode, the user copy is used again.

The userspace page tables contain only a minimal amount of kernel
data: only what is needed to enter/exit the kernel such as the
entry/exit functions themselves and the interrupt descriptor table
(IDT).  There are a few strictly unnecessary things that get mapped
such as the first C function when entering an interrupt (see
comments in pti.c).

This approach helps to ensure that side-channel attacks leveraging
the paging structures do not function when PTI is enabled.  It can be
enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
Once enabled at compile-time, it can be disabled at boot with the
'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).

Page Table Management
=====================

When PTI is enabled, the kernel manages two sets of page tables.
The first set is very similar to the single set which is present in
kernels without PTI.  This includes a complete mapping of userspace
that the kernel can use for things like copy_to_user().

Although _complete_, the user portion of the kernel page tables is
crippled by setting the NX bit in the top level.  This ensures
that any missed kernel->user CR3 switch will immediately crash
userspace upon executing its first instruction.

The userspace page tables map only the kernel data needed to enter
and exit the kernel.  This data is entirely contained in the 'struct
cpu_entry_area' structure which is placed in the fixmap which gives
each CPU's copy of the area a compile-time-fixed virtual address.

For new userspace mappings, the kernel makes the entries in its
page tables like normal.  The only difference is when the kernel
makes entries in the top (PGD) level.  In addition to setting the
entry in the main kernel PGD, a copy of the entry is made in the
userspace page tables' PGD.

This sharing at the PGD level also inherently shares all the lower
layers of the page tables.  This leaves a single, shared set of
userspace page tables to manage.  One PTE to lock, one set of
accessed bits, dirty bits, etc...

Overhead
========

Protection against side-channel attacks is important.  But,
this protection comes at a cost:

1. Increased Memory Use
  a. Each process now needs an order-1 PGD instead of order-0.
     (Consumes an additional 4k per process).
  b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
     aligned so that it can be mapped by setting a single PMD
     entry.  This consumes nearly 2MB of RAM once the kernel
     is decompressed, but no space in the kernel image itself.

2. Runtime Cost
  a. CR3 manipulation to switch between the page table copies
     must be done at interrupt, syscall, and exception entry
     and exit (it can be skipped when the kernel is interrupted,
     though.)  Moves to CR3 are on the order of a hundred
     cycles, and are required at every entry and exit.
  b. A "trampoline" must be used for SYSCALL entry.  This
     trampoline depends on a smaller set of resources than the
     non-PTI SYSCALL entry code, so requires mapping fewer
     things into the userspace page tables.  The downside is
     that stacks must be switched at entry time.
  d. Global pages are disabled for all kernel structures not
     mapped into both kernel and userspace page tables.  This
     feature of the MMU allows different processes to share TLB
     entries mapping the kernel.  Losing the feature means more
     TLB misses after a context switch.  The actual loss of
     performance is very small, however, never exceeding 1%.
  d. Process Context IDentifiers (PCID) is a CPU feature that
     allows us to skip flushing the entire TLB when switching page
     tables by setting a special bit in CR3 when the page tables
     are changed.  This makes switching the page tables (at context
     switch, or kernel entry/exit) cheaper.  But, on systems with
     PCID support, the context switch code must flush both the user
     and kernel entries out of the TLB.  The user PCID TLB flush is
     deferred until the exit to userspace, minimizing the cost.
     See intel.com/sdm for the gory PCID/INVPCID details.
  e. The userspace page tables must be populated for each new
     process.  Even without PTI, the shared kernel mappings
     are created by copying top-level (PGD) entries into each
     new process.  But, with PTI, there are now *two* kernel
     mappings: one in the kernel page tables that maps everything
     and one for the entry/exit structures.  At fork(), we need to
     copy both.
  f. In addition to the fork()-time copying, there must also
     be an update to the userspace PGD any time a set_pgd() is done
     on a PGD used to map userspace.  This ensures that the kernel
     and userspace copies always map the same userspace
     memory.
  g. On systems without PCID support, each CR3 write flushes
     the entire TLB.  That means that each syscall, interrupt
     or exception flushes the TLB.
  h. INVPCID is a TLB-flushing instruction which allows flushing
     of TLB entries for non-current PCIDs.  Some systems support
     PCIDs, but do not support INVPCID.  On these systems, addresses
     can only be flushed from the TLB for the current PCID.  When
     flushing a kernel address, we need to flush all PCIDs, so a
     single kernel address flush will require a TLB-flushing CR3
     write upon the next use of every PCID.

Possible Future Work
====================
1. We can be more careful about not actually writing to CR3
   unless its value is actually changed.
2. Allow PTI to be enabled/disabled at runtime in addition to the
   boot-time switching.

Testing
========

To test stability of PTI, the following test procedure is recommended,
ideally doing all of these in parallel:

1. Set CONFIG_DEBUG_ENTRY=y
2. Run several copies of all of the tools/testing/selftests/x86/ tests
   (excluding MPX and protection_keys) in a loop on multiple CPUs for
   several minutes.  These tests frequently uncover corner cases in the
   kernel entry code.  In general, old kernels might cause these tests
   themselves to crash, but they should never crash the kernel.
3. Run the 'perf' tool in a mode (top or record) that generates many
   frequent performance monitoring non-maskable interrupts (see "NMI"
   in /proc/interrupts).  This exercises the NMI entry/exit code which
   is known to trigger bugs in code paths that did not expect to be
   interrupted, including nested NMIs.  Using "-c" boosts the rate of
   NMIs, and using two -c with separate counters encourages nested NMIs
   and less deterministic behavior.

	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done

4. Launch a KVM virtual machine.
5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
   This has been a lightly-tested code path and needs extra scrutiny.

Debugging
=========

Bugs in PTI cause a few different signatures of crashes
that are worth noting here.

 * Failures of the selftests/x86 code.  Usually a bug in one of the
   more obscure corners of entry_64.S
 * Crashes in early boot, especially around CPU bringup.  Bugs
   in the trampoline code or mappings cause these.
 * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
   like screwing up a page table switch.  Also caused by
   incorrectly mapping the IRQ handler entry code.
 * Crashes at the first NMI.  The NMI code is separate from main
   interrupt handlers and can have bugs that do not affect
   normal interrupts.  Also caused by incorrectly mapping NMI
   code.  NMIs that interrupt the entry code must be very
   careful and can be the cause of crashes that show up when
   running perf.
 * Kernel crashes at the first exit to userspace.  entry_64.S
   bugs, or failing to map some of the exit code.
 * Crashes at first interrupt that interrupts userspace. The paths
   in entry_64.S that return to userspace are sometimes separate
   from the ones that return to the kernel.
 * Double faults: overflowing the kernel stack because of page
   faults upon page faults.  Caused by touching non-pti-mapped
   data in the entry code, or forgetting to switch to kernel
   CR3 before calling into C functions which are not pti-mapped.
 * Userspace segfaults early in boot, sometimes manifesting
   as mount(8) failing to mount the rootfs.  These have
   tended to be TLB invalidation issues.  Usually invalidating
   the wrong PCID, or otherwise missing an invalidation.

1. https://gruss.cc/files/kaiser.pdf
2. https://meltdownattack.com/meltdown.pdf
+1 −1
Original line number Diff line number Diff line
VERSION = 4
PATCHLEVEL = 9
SUBLEVEL = 76
SUBLEVEL = 77
EXTRAVERSION =
NAME = Roaring Lionus

+3 −3
Original line number Diff line number Diff line
@@ -112,7 +112,7 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu, struct kvm_run *run)
		}

		trace_kvm_mmio(KVM_TRACE_MMIO_READ, len, run->mmio.phys_addr,
			       data);
			       &data);
		data = vcpu_data_host_to_guest(vcpu, data, len);
		vcpu_set_reg(vcpu, vcpu->arch.mmio_decode.rt, data);
	}
@@ -182,14 +182,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
		data = vcpu_data_guest_to_host(vcpu, vcpu_get_reg(vcpu, rt),
					       len);

		trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, len, fault_ipa, data);
		trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, len, fault_ipa, &data);
		kvm_mmio_write_buf(data_buf, len, data);

		ret = kvm_io_bus_write(vcpu, KVM_MMIO_BUS, fault_ipa, len,
				       data_buf);
	} else {
		trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, len,
			       fault_ipa, 0);
			       fault_ipa, NULL);

		ret = kvm_io_bus_read(vcpu, KVM_MMIO_BUS, fault_ipa, len,
				      data_buf);
Loading