Merge branch 'akpm' (incoming from Andrew) (73154383) · Commits · e / devices / android_kernel_teracube_2e

Documentation/cgroups/memory.txt

+69 −1

Original line number	Diff line number	Diff line
		@@ -40,6 +40,7 @@ Features:
		- soft limit
		- moving (recharging) account at moving a task is selectable.
		- usage threshold notifier
		- memory pressure notifier
		- oom-killer disable knob and oom-notifier
		- Root cgroup has no limit controls.

		@@ -65,6 +66,7 @@ Brief summary of control files.
		memory.stat # show various statistics
		memory.use_hierarchy # set/show hierarchical account enabled
		memory.force_empty # trigger forced move charge to parent
		memory.pressure_level # set memory pressure notifications
		memory.swappiness # set/show swappiness parameter of vmscan
		(See sysctl's vm.swappiness)
		memory.move_charge_at_immigrate # set/show controls of moving charges
		@@ -762,7 +764,73 @@ At reading, current status of OOM is shown.
		under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
		be stopped.)

		11. TODO
		11. Memory Pressure

		The pressure level notifications can be used to monitor the memory
		allocation cost; based on the pressure, applications can implement
		different strategies of managing their memory resources. The pressure
		levels are defined as following:

		The "low" level means that the system is reclaiming memory for new
		allocations. Monitoring this reclaiming activity might be useful for
		maintaining cache level. Upon notification, the program (typically
		"Activity Manager") might analyze vmstat and act in advance (i.e.
		prematurely shutdown unimportant services).

		The "medium" level means that the system is experiencing medium memory
		pressure, the system might be making swap, paging out active file caches,
		etc. Upon this event applications may decide to further analyze
		vmstat/zoneinfo/memcg or internal memory usage statistics and free any
		resources that can be easily reconstructed or re-read from a disk.

		The "critical" level means that the system is actively thrashing, it is
		about to out of memory (OOM) or even the in-kernel OOM killer is on its
		way to trigger. Applications should do whatever they can to help the
		system. It might be too late to consult with vmstat or any other
		statistics, so it's advisable to take an immediate action.

		The events are propagated upward until the event is handled, i.e. the
		events are not pass-through. Here is what this means: for example you have
		three cgroups: A->B->C. Now you set up an event listener on cgroups A, B
		and C, and suppose group C experiences some pressure. In this situation,
		only group C will receive the notification, i.e. groups A and B will not
		receive it. This is done to avoid excessive "broadcasting" of messages,
		which disturbs the system and which is especially bad if we are low on
		memory or thrashing. So, organize the cgroups wisely, or propagate the
		events manually (or, ask us to implement the pass-through events,
		explaining why would you need them.)

		The file memory.pressure_level is only used to setup an eventfd. To
		register a notification, an application must:

		- create an eventfd using eventfd(2);
		- open memory.pressure_level;
		- write string like "<event_fd> <fd of memory.pressure_level> <level>"
		to cgroup.event_control.

		Application will be notified through eventfd when memory pressure is at
		the specific level (or higher). Read/write operations to
		memory.pressure_level are no implemented.

		Test:

		Here is a small script example that makes a new cgroup, sets up a
		memory limit, sets up a notification in the cgroup and then makes child
		cgroup experience a critical pressure:

		# cd /sys/fs/cgroup/memory/
		# mkdir foo
		# cd foo
		# cgroup_event_listener memory.pressure_level low &
		# echo 8000000 > memory.limit_in_bytes
		# echo 8000000 > memory.memsw.limit_in_bytes
		# echo $$ > tasks
		# dd if=/dev/zero \| read x

		(Expect a bunch of notifications, and eventually, the oom-killer will
		trigger.)

		12. TODO

		1. Add support for accounting huge pages (as a separate controller)
		2. Make per-cgroup scanner reclaim not-shared pages first

Documentation/sysctl/vm.txt

+50 −0

Original line number	Diff line number	Diff line
		@@ -18,6 +18,7 @@ files can be found in mm/swap.c.

		Currently, these files are in /proc/sys/vm:

		- admin_reserve_kbytes
		- block_dump
		- compact_memory
		- dirty_background_bytes
		@@ -53,11 +54,41 @@ Currently, these files are in /proc/sys/vm:
		- percpu_pagelist_fraction
		- stat_interval
		- swappiness
		- user_reserve_kbytes
		- vfs_cache_pressure
		- zone_reclaim_mode

		==============================================================

		admin_reserve_kbytes

		The amount of free memory in the system that should be reserved for users
		with the capability cap_sys_admin.

		admin_reserve_kbytes defaults to min(3% of free pages, 8MB)

		That should provide enough for the admin to log in and kill a process,
		if necessary, under the default overcommit 'guess' mode.

		Systems running under overcommit 'never' should increase this to account
		for the full Virtual Memory Size of programs used to recover. Otherwise,
		root may not be able to log in to recover the system.

		How do you calculate a minimum useful reserve?

		sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

		For overcommit 'guess', we can sum resident set sizes (RSS).
		On x86_64 this is about 8MB.

		For overcommit 'never', we can take the max of their virtual sizes (VSZ)
		and add the sum of their RSS.
		On x86_64 this is about 128MB.

		Changing this takes effect whenever an application requests memory.

		==============================================================

		block_dump

		block_dump enables block I/O debugging when set to a nonzero value. More
		@@ -542,6 +573,7 @@ memory until it actually runs out.

		When this flag is 2, the kernel uses a "never overcommit"
		policy that attempts to prevent any overcommit of memory.
		Note that user_reserve_kbytes affects this policy.

		This feature can be very useful because there are a lot of
		programs that malloc() huge amounts of memory "just-in-case"
		@@ -645,6 +677,24 @@ The default value is 60.

		==============================================================

		- user_reserve_kbytes

		When overcommit_memory is set to 2, "never overommit" mode, reserve
		min(3% of current process size, user_reserve_kbytes) of free memory.
		This is intended to prevent a user from starting a single memory hogging
		process, such that they cannot recover (kill the hog).

		user_reserve_kbytes defaults to min(3% of the current process size, 128MB).

		If this is reduced to zero, then the user will be allowed to allocate
		all free memory with a single process, minus admin_reserve_kbytes.
		Any subsequent attempts to execute a command will result in
		"fork: Cannot allocate memory".

		Changing this takes effect whenever an application requests memory.

		==============================================================

		vfs_cache_pressure
		------------------

Documentation/vm/overcommit-accounting

+7 −1

Original line number	Diff line number	Diff line
		@@ -8,7 +8,9 @@ The Linux kernel supports the following overcommit handling modes
		default.

		1 - Always overcommit. Appropriate for some scientific
		applications.
		applications. Classic example is code using sparse arrays
		and just relying on the virtual memory consisting almost
		entirely of zero pages.

		2 - Don't overcommit. The total address space commit
		for the system is not permitted to exceed swap + a
		@@ -18,6 +20,10 @@ The Linux kernel supports the following overcommit handling modes
		pages but will receive errors on memory allocation as
		appropriate.

		Useful for applications that want to guarantee their
		memory allocations will be available in the future
		without having to initialize every page.

		The overcommit policy is set via the sysctl `vm.overcommit_memory'.

		The overcommit percentage is set via `vm.overcommit_ratio'.

arch/alpha/kernel/sys_nautilus.c

+2 −3

Original line number	Diff line number	Diff line
		@@ -185,7 +185,6 @@ nautilus_machine_check(unsigned long vector, unsigned long la_ptr)
		mb();
		}

		extern void free_reserved_mem(void , void );
		extern void pcibios_claim_one_bus(struct pci_bus *);

		static struct resource irongate_io = {
		@@ -239,8 +238,8 @@ nautilus_init_pci(void)
		if (pci_mem < memtop)
		memtop = pci_mem;
		if (memtop > alpha_mv.min_mem_address) {
		free_reserved_mem(__va(alpha_mv.min_mem_address),
		__va(memtop));
		free_reserved_area((unsigned long)__va(alpha_mv.min_mem_address),
		(unsigned long)__va(memtop), 0, NULL);
		printk("nautilus_init_pci: %ldk freed\n",
		(memtop - alpha_mv.min_mem_address) >> 10);
		}

arch/alpha/mm/init.c

+3 −21

Original line number	Diff line number	Diff line
		@@ -31,6 +31,7 @@
		#include <asm/console.h>
		#include <asm/tlb.h>
		#include <asm/setup.h>
		#include <asm/sections.h>

		extern void die_if_kernel(char ,struct pt_regs ,long);

		@@ -281,8 +282,6 @@ printk_memory_info(void)
		{
		unsigned long codesize, reservedpages, datasize, initsize, tmp;
		extern int page_is_ram(unsigned long) __init;
		extern char _text, _etext, _data, _edata;
		extern char __init_begin, __init_end;

		/* printk all informations */
		reservedpages = 0;
		@@ -317,33 +316,16 @@ mem_init(void)
		}
		#endif /* CONFIG_DISCONTIGMEM */

		void
		free_reserved_mem(void start, void end)
		{
		void *__start = start;
		for (; __start < end; __start += PAGE_SIZE) {
		ClearPageReserved(virt_to_page(__start));
		init_page_count(virt_to_page(__start));
		free_page((long)__start);
		totalram_pages++;
		}
		}

		void
		free_initmem(void)
		{
		extern char __init_begin, __init_end;

		free_reserved_mem(&__init_begin, &__init_end);
		printk ("Freeing unused kernel memory: %ldk freed\n",
		(&__init_end - &__init_begin) >> 10);
		free_initmem_default(0);
		}

		#ifdef CONFIG_BLK_DEV_INITRD
		void
		free_initrd_mem(unsigned long start, unsigned long end)
		{
		free_reserved_mem((void )start, (void )end);
		printk ("Freeing initrd memory: %ldk freed\n", (end - start) >> 10);
		free_reserved_area(start, end, 0, "initrd");
		}
		#endif