Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit b1b6f83a authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm changes from Ingo Molnar:
 "PCID support, 5-level paging support, Secure Memory Encryption support

  The main changes in this cycle are support for three new, complex
  hardware features of x86 CPUs:

   - Add 5-level paging support, which is a new hardware feature on
     upcoming Intel CPUs allowing up to 128 PB of virtual address space
     and 4 PB of physical RAM space - a 512-fold increase over the old
     limits. (Supercomputers of the future forecasting hurricanes on an
     ever warming planet can certainly make good use of more RAM.)

     Many of the necessary changes went upstream in previous cycles,
     v4.14 is the first kernel that can enable 5-level paging.

     This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
     default.

     (By Kirill A. Shutemov)

   - Add 'encrypted memory' support, which is a new hardware feature on
     upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
     RAM to be encrypted and decrypted (mostly) transparently by the
     CPU, with a little help from the kernel to transition to/from
     encrypted RAM. Such RAM should be more secure against various
     attacks like RAM access via the memory bus and should make the
     radio signature of memory bus traffic harder to intercept (and
     decrypt) as well.

     This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
     by default.

     (By Tom Lendacky)

   - Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
     hardware feature that attaches an address space tag to TLB entries
     and thus allows to skip TLB flushing in many cases, even if we
     switch mm's.

     (By Andy Lutomirski)

  All three of these features were in the works for a long time, and
  it's coincidence of the three independent development paths that they
  are all enabled in v4.14 at once"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
  x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
  x86/mm: Use pr_cont() in dump_pagetable()
  x86/mm: Fix SME encryption stack ptr handling
  kvm/x86: Avoid clearing the C-bit in rsvd_bits()
  x86/CPU: Align CR3 defines
  x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
  acpi, x86/mm: Remove encryption mask from ACPI page protection type
  x86/mm, kexec: Fix memory corruption with SME on successive kexecs
  x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
  x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
  x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
  x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
  x86/mm: Allow userspace have mappings above 47-bit
  x86/mm: Prepare to expose larger address space to userspace
  x86/mpx: Do not allow MPX if we have mappings above 47-bit
  x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
  x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
  x86/mm/dump_pagetables: Fix printout of p4d level
  x86/mm/dump_pagetables: Generalize address normalization
  x86/boot: Fix memremap() related build failure
  ...
parents 5f82e71a 9e52fc2b
Loading
Loading
Loading
Loading
+13 −0
Original line number Diff line number Diff line
@@ -2233,6 +2233,17 @@
			memory contents and reserves bad memory
			regions that are detected.

	mem_encrypt=	[X86-64] AMD Secure Memory Encryption (SME) control
			Valid arguments: on, off
			Default (depends on kernel configuration option):
			  on  (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y)
			  off (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n)
			mem_encrypt=on:		Activate SME
			mem_encrypt=off:	Do not activate SME

			Refer to Documentation/x86/amd-memory-encryption.txt
			for details on when memory encryption can be activated.

	mem_sleep_default=	[SUSPEND] Default system suspend mode:
			s2idle  - Suspend-To-Idle
			shallow - Power-On Suspend or equivalent (if supported)
@@ -2697,6 +2708,8 @@
	nopat		[X86] Disable PAT (page attribute table extension of
			pagetables) support.

	nopcid		[X86-64] Disable the PCID cpu feature.

	norandmaps	Don't use address space randomization.  Equivalent to
			echo 0 > /proc/sys/kernel/randomize_va_space

+68 −0
Original line number Diff line number Diff line
Secure Memory Encryption (SME) is a feature found on AMD processors.

SME provides the ability to mark individual pages of memory as encrypted using
the standard x86 page tables.  A page that is marked encrypted will be
automatically decrypted when read from DRAM and encrypted when written to
DRAM.  SME can therefore be used to protect the contents of DRAM from physical
attacks on the system.

A page is encrypted when a page table entry has the encryption bit set (see
below on how to determine its position).  The encryption bit can also be
specified in the cr3 register, allowing the PGD table to be encrypted. Each
successive level of page tables can also be encrypted by setting the encryption
bit in the page table entry that points to the next table. This allows the full
page table hierarchy to be encrypted. Note, this means that just because the
encryption bit is set in cr3, doesn't imply the full hierarchy is encyrpted.
Each page table entry in the hierarchy needs to have the encryption bit set to
achieve that. So, theoretically, you could have the encryption bit set in cr3
so that the PGD is encrypted, but not set the encryption bit in the PGD entry
for a PUD which results in the PUD pointed to by that entry to not be
encrypted.

Support for SME can be determined through the CPUID instruction. The CPUID
function 0x8000001f reports information related to SME:

	0x8000001f[eax]:
		Bit[0] indicates support for SME
	0x8000001f[ebx]:
		Bits[5:0]  pagetable bit number used to activate memory
			   encryption
		Bits[11:6] reduction in physical address space, in bits, when
			   memory encryption is enabled (this only affects
			   system physical addresses, not guest physical
			   addresses)

If support for SME is present, MSR 0xc00100010 (MSR_K8_SYSCFG) can be used to
determine if SME is enabled and/or to enable memory encryption:

	0xc0010010:
		Bit[23]   0 = memory encryption features are disabled
			  1 = memory encryption features are enabled

Linux relies on BIOS to set this bit if BIOS has determined that the reduction
in the physical address space as a result of enabling memory encryption (see
CPUID information above) will not conflict with the address space resource
requirements for the system.  If this bit is not set upon Linux startup then
Linux itself will not set it and memory encryption will not be possible.

The state of SME in the Linux kernel can be documented as follows:
	- Supported:
	  The CPU supports SME (determined through CPUID instruction).

	- Enabled:
	  Supported and bit 23 of MSR_K8_SYSCFG is set.

	- Active:
	  Supported, Enabled and the Linux kernel is actively applying
	  the encryption bit to page table entries (the SME mask in the
	  kernel is non-zero).

SME can also be enabled and activated in the BIOS. If SME is enabled and
activated in the BIOS, then all memory accesses will be encrypted and it will
not be necessary to activate the Linux memory encryption support.  If the BIOS
merely enables SME (sets bit 23 of the MSR_K8_SYSCFG), then Linux can activate
memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or
by supplying mem_encrypt=on on the kernel command line.  However, if BIOS does
not enable SME, then Linux will not be able to activate memory encryption, even
if configured to do so by default or the mem_encrypt=on command line parameter
is specified.
+3 −3
Original line number Diff line number Diff line
@@ -34,7 +34,7 @@ with a key. In this example WRPKRU is wrapped by a C function
called pkey_set().

	int real_prot = PROT_READ|PROT_WRITE;
	pkey = pkey_alloc(0, PKEY_DENY_WRITE);
	pkey = pkey_alloc(0, PKEY_DISABLE_WRITE);
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
	... application runs here
@@ -42,9 +42,9 @@ called pkey_set().
Now, if the application needs to update the data at 'ptr', it can
gain access, do the update, then remove its write access:

	pkey_set(pkey, 0); // clear PKEY_DENY_WRITE
	pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE
	*ptr = foo; // assign something
	pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again
	pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again

Now when it frees the memory, it will also free the pkey since it
is no longer in use:
+64 −0
Original line number Diff line number Diff line
== Overview ==

Original x86-64 was limited by 4-level paing to 256 TiB of virtual address
space and 64 TiB of physical address space. We are already bumping into
this limit: some vendors offers servers with 64 TiB of memory today.

To overcome the limitation upcoming hardware will introduce support for
5-level paging. It is a straight-forward extension of the current page
table structure adding one more layer of translation.

It bumps the limits to 128 PiB of virtual address space and 4 PiB of
physical address space. This "ought to be enough for anybody" ©.

QEMU 2.9 and later support 5-level paging.

Virtual memory layout for 5-level paging is described in
Documentation/x86/x86_64/mm.txt

== Enabling 5-level paging ==

CONFIG_X86_5LEVEL=y enables the feature.

So far, a kernel compiled with the option enabled will be able to boot
only on machines that supports the feature -- see for 'la57' flag in
/proc/cpuinfo.

The plan is to implement boot-time switching between 4- and 5-level paging
in the future.

== User-space and large virtual address space ==

On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

A high hint address would only affect the allocation in question, but not
any future mmap()s.

Specifying high hint address on older kernel or on machine without 5-level
paging support is safe. The hint will be ignored and kernel will fall back
to allocation from 47-bit address space.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA extension) cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.
+0 −2
Original line number Diff line number Diff line
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
	buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
}

#define acpi_unlazy_tlb(x)

#ifdef CONFIG_ACPI_NUMA
extern cpumask_t early_cpu_possible_map;
#define for_each_possible_early_cpu(cpu)  \
Loading