Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 5beb4930 authored by Rik van Riel's avatar Rik van Riel Committed by Linus Torvalds
Browse files

mm: change anon_vma linking to fix multi-process server scalability issue



The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: default avatarRik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 648bcc77
Loading
Loading
Loading
Loading
+1 −0
Original line number Original line Diff line number Diff line
@@ -2315,6 +2315,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t
		DPRINT(("Cannot allocate vma\n"));
		DPRINT(("Cannot allocate vma\n"));
		goto error_kmem;
		goto error_kmem;
	}
	}
	INIT_LIST_HEAD(&vma->anon_vma_chain);


	/*
	/*
	 * partially initialize the vma for the sampling buffer
	 * partially initialize the vma for the sampling buffer
+2 −0
Original line number Original line Diff line number Diff line
@@ -117,6 +117,7 @@ ia64_init_addr_space (void)
	 */
	 */
	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
	if (vma) {
	if (vma) {
		INIT_LIST_HEAD(&vma->anon_vma_chain);
		vma->vm_mm = current->mm;
		vma->vm_mm = current->mm;
		vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
		vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
		vma->vm_end = vma->vm_start + PAGE_SIZE;
		vma->vm_end = vma->vm_start + PAGE_SIZE;
@@ -135,6 +136,7 @@ ia64_init_addr_space (void)
	if (!(current->personality & MMAP_PAGE_ZERO)) {
	if (!(current->personality & MMAP_PAGE_ZERO)) {
		vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
		vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
		if (vma) {
		if (vma) {
			INIT_LIST_HEAD(&vma->anon_vma_chain);
			vma->vm_mm = current->mm;
			vma->vm_mm = current->mm;
			vma->vm_end = PAGE_SIZE;
			vma->vm_end = PAGE_SIZE;
			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
+4 −2
Original line number Original line Diff line number Diff line
@@ -246,6 +246,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
	vma->vm_start = vma->vm_end - PAGE_SIZE;
	vma->vm_start = vma->vm_end - PAGE_SIZE;
	vma->vm_flags = VM_STACK_FLAGS;
	vma->vm_flags = VM_STACK_FLAGS;
	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
	INIT_LIST_HEAD(&vma->anon_vma_chain);
	err = insert_vm_struct(mm, vma);
	err = insert_vm_struct(mm, vma);
	if (err)
	if (err)
		goto err;
		goto err;
@@ -516,7 +517,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
	/*
	/*
	 * cover the whole range: [new_start, old_end)
	 * cover the whole range: [new_start, old_end)
	 */
	 */
	vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
		return -ENOMEM;


	/*
	/*
	 * move the page tables downwards, on failure we rely on
	 * move the page tables downwards, on failure we rely on
@@ -547,7 +549,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
	tlb_finish_mmu(tlb, new_end, old_end);
	tlb_finish_mmu(tlb, new_end, old_end);


	/*
	/*
	 * shrink the vma to just the new range.
	 * Shrink the vma to just the new range.  Always succeeds.
	 */
	 */
	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);


+5 −1
Original line number Original line Diff line number Diff line
@@ -97,7 +97,11 @@ extern unsigned int kobjsize(const void *objp);
#define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
#define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
#define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
#define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
#define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
#define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
#ifdef CONFIG_MMU
#define VM_LOCK_RMAP	0x01000000	/* Do not follow this rmap (mmu mmap) */
#else
#define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
#define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
#endif
#define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
#define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
#define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
#define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */


@@ -1216,7 +1220,7 @@ static inline void vma_nonlinear_insert(struct vm_area_struct *vma,


/* mmap.c */
/* mmap.c */
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
extern void vma_adjust(struct vm_area_struct *vma, unsigned long start,
extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
extern struct vm_area_struct *vma_merge(struct mm_struct *,
extern struct vm_area_struct *vma_merge(struct mm_struct *,
	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+2 −1
Original line number Original line Diff line number Diff line
@@ -163,7 +163,8 @@ struct vm_area_struct {
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	 */
	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
	struct list_head anon_vma_chain; /* Serialized by mmap_sem &
					  * page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */


	/* Function pointers to deal with this struct. */
	/* Function pointers to deal with this struct. */
Loading