Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 73154383 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'akpm' (incoming from Andrew)

Merge first batch of fixes from Andrew Morton:

 - A couple of kthread changes

 - A few minor audit patches

 - A number of fbdev patches.  Florian remains AWOL so I'm picking up
   some of these.

 - A few kbuild things

 - ocfs2 updates

 - Almost all of the MM queue

(And in the meantime, I already have the second big batch from Andrew
pending in my mailbox ;^)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (149 commits)
  memcg: take reference before releasing rcu_read_lock
  mem hotunplug: fix kfree() of bootmem memory
  mmKconfig: add an option to disable bounce
  mm, nobootmem: do memset() after memblock_reserve()
  mm, nobootmem: clean-up of free_low_memory_core_early()
  fs/buffer.c: remove unnecessary init operation after allocating buffer_head.
  numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU
  mm: fix memory_hotplug.c printk format warning
  mm: swap: mark swap pages writeback before queueing for direct IO
  swap: redirty page if page write fails on swap file
  mm, memcg: give exiting processes access to memory reserves
  thp: fix huge zero page logic for page with pfn == 0
  memcg: avoid accessing memcg after releasing reference
  fs: fix fsync() error reporting
  memblock: fix missing comment of memblock_insert_region()
  mm: Remove unused parameter of pages_correctly_reserved()
  firmware, memmap: fix firmware_map_entry leak
  mm/vmstat: add note on safety of drain_zonestat
  mm: thp: add split tail pages to shrink page list in page reclaim
  mm: allow for outstanding swap writeback accounting
  ...
parents 362ed48d ca0dde97
Loading
Loading
Loading
Loading
+69 −1
Original line number Diff line number Diff line
@@ -40,6 +40,7 @@ Features:
 - soft limit
 - moving (recharging) account at moving a task is selectable.
 - usage threshold notifier
 - memory pressure notifier
 - oom-killer disable knob and oom-notifier
 - Root cgroup has no limit controls.

@@ -65,6 +66,7 @@ Brief summary of control files.
 memory.stat			 # show various statistics
 memory.use_hierarchy		 # set/show hierarchical account enabled
 memory.force_empty		 # trigger forced move charge to parent
 memory.pressure_level		 # set memory pressure notifications
 memory.swappiness		 # set/show swappiness parameter of vmscan
				 (See sysctl's vm.swappiness)
 memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
	under_oom	 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
				 be stopped.)

11. TODO
11. Memory Pressure

The pressure level notifications can be used to monitor the memory
allocation cost; based on the pressure, applications can implement
different strategies of managing their memory resources. The pressure
levels are defined as following:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
and C, and suppose group C experiences some pressure. In this situation,
only group C will receive the notification, i.e. groups A and B will not
receive it. This is done to avoid excessive "broadcasting" of messages,
which disturbs the system and which is especially bad if we are low on
memory or thrashing. So, organize the cgroups wisely, or propagate the
events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)

The file memory.pressure_level is only used to setup an eventfd. To
register a notification, an application must:

- create an eventfd using eventfd(2);
- open memory.pressure_level;
- write string like "<event_fd> <fd of memory.pressure_level> <level>"
  to cgroup.event_control.

Application will be notified through eventfd when memory pressure is at
the specific level (or higher). Read/write operations to
memory.pressure_level are no implemented.

Test:

   Here is a small script example that makes a new cgroup, sets up a
   memory limit, sets up a notification in the cgroup and then makes child
   cgroup experience a critical pressure:

   # cd /sys/fs/cgroup/memory/
   # mkdir foo
   # cd foo
   # cgroup_event_listener memory.pressure_level low &
   # echo 8000000 > memory.limit_in_bytes
   # echo 8000000 > memory.memsw.limit_in_bytes
   # echo $$ > tasks
   # dd if=/dev/zero | read x

   (Expect a bunch of notifications, and eventually, the oom-killer will
   trigger.)

12. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
+50 −0
Original line number Diff line number Diff line
@@ -18,6 +18,7 @@ files can be found in mm/swap.c.

Currently, these files are in /proc/sys/vm:

- admin_reserve_kbytes
- block_dump
- compact_memory
- dirty_background_bytes
@@ -53,11 +54,41 @@ Currently, these files are in /proc/sys/vm:
- percpu_pagelist_fraction
- stat_interval
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
- zone_reclaim_mode

==============================================================

admin_reserve_kbytes

The amount of free memory in the system that should be reserved for users
with the capability cap_sys_admin.

admin_reserve_kbytes defaults to min(3% of free pages, 8MB)

That should provide enough for the admin to log in and kill a process,
if necessary, under the default overcommit 'guess' mode.

Systems running under overcommit 'never' should increase this to account
for the full Virtual Memory Size of programs used to recover. Otherwise,
root may not be able to log in to recover the system.

How do you calculate a minimum useful reserve?

sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

For overcommit 'guess', we can sum resident set sizes (RSS).
On x86_64 this is about 8MB.

For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS.
On x86_64 this is about 128MB.

Changing this takes effect whenever an application requests memory.

==============================================================

block_dump

block_dump enables block I/O debugging when set to a nonzero value. More
@@ -542,6 +573,7 @@ memory until it actually runs out.

When this flag is 2, the kernel uses a "never overcommit"
policy that attempts to prevent any overcommit of memory.
Note that user_reserve_kbytes affects this policy.

This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
@@ -645,6 +677,24 @@ The default value is 60.

==============================================================

- user_reserve_kbytes

When overcommit_memory is set to 2, "never overommit" mode, reserve
min(3% of current process size, user_reserve_kbytes) of free memory.
This is intended to prevent a user from starting a single memory hogging
process, such that they cannot recover (kill the hog).

user_reserve_kbytes defaults to min(3% of the current process size, 128MB).

If this is reduced to zero, then the user will be allowed to allocate
all free memory with a single process, minus admin_reserve_kbytes.
Any subsequent attempts to execute a command will result in
"fork: Cannot allocate memory".

Changing this takes effect whenever an application requests memory.

==============================================================

vfs_cache_pressure
------------------

+7 −1
Original line number Diff line number Diff line
@@ -8,7 +8,9 @@ The Linux kernel supports the following overcommit handling modes
		default.

1	-	Always overcommit. Appropriate for some scientific
		applications.
		applications. Classic example is code using sparse arrays
		and just relying on the virtual memory consisting almost
		entirely of zero pages.

2	-	Don't overcommit. The total address space commit
		for the system is not permitted to exceed swap + a
@@ -18,6 +20,10 @@ The Linux kernel supports the following overcommit handling modes
		pages but will receive errors on memory allocation as
		appropriate.

		Useful for applications that want to guarantee their
		memory allocations will be available in the future
		without having to initialize every page.

The overcommit policy is set via the sysctl `vm.overcommit_memory'.

The overcommit percentage is set via `vm.overcommit_ratio'.
+2 −3
Original line number Diff line number Diff line
@@ -185,7 +185,6 @@ nautilus_machine_check(unsigned long vector, unsigned long la_ptr)
	mb();
}

extern void free_reserved_mem(void *, void *);
extern void pcibios_claim_one_bus(struct pci_bus *);

static struct resource irongate_io = {
@@ -239,8 +238,8 @@ nautilus_init_pci(void)
	if (pci_mem < memtop)
		memtop = pci_mem;
	if (memtop > alpha_mv.min_mem_address) {
		free_reserved_mem(__va(alpha_mv.min_mem_address),
				  __va(memtop));
		free_reserved_area((unsigned long)__va(alpha_mv.min_mem_address),
				   (unsigned long)__va(memtop), 0, NULL);
		printk("nautilus_init_pci: %ldk freed\n",
			(memtop - alpha_mv.min_mem_address) >> 10);
	}
+3 −21
Original line number Diff line number Diff line
@@ -31,6 +31,7 @@
#include <asm/console.h>
#include <asm/tlb.h>
#include <asm/setup.h>
#include <asm/sections.h>

extern void die_if_kernel(char *,struct pt_regs *,long);

@@ -281,8 +282,6 @@ printk_memory_info(void)
{
	unsigned long codesize, reservedpages, datasize, initsize, tmp;
	extern int page_is_ram(unsigned long) __init;
	extern char _text, _etext, _data, _edata;
	extern char __init_begin, __init_end;

	/* printk all informations */
	reservedpages = 0;
@@ -317,33 +316,16 @@ mem_init(void)
}
#endif /* CONFIG_DISCONTIGMEM */

void
free_reserved_mem(void *start, void *end)
{
	void *__start = start;
	for (; __start < end; __start += PAGE_SIZE) {
		ClearPageReserved(virt_to_page(__start));
		init_page_count(virt_to_page(__start));
		free_page((long)__start);
		totalram_pages++;
	}
}

void
free_initmem(void)
{
	extern char __init_begin, __init_end;

	free_reserved_mem(&__init_begin, &__init_end);
	printk ("Freeing unused kernel memory: %ldk freed\n",
		(&__init_end - &__init_begin) >> 10);
	free_initmem_default(0);
}

#ifdef CONFIG_BLK_DEV_INITRD
void
free_initrd_mem(unsigned long start, unsigned long end)
{
	free_reserved_mem((void *)start, (void *)end);
	printk ("Freeing initrd memory: %ldk freed\n", (end - start) >> 10);
	free_reserved_area(start, end, 0, "initrd");
}
#endif
Loading