Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (9c2b957d) · Commits · e / devices / android_kernel_fairphone_FP3

Documentation/lockup-watchdogs.txt

0 → 100644

+63 −0

Original line number	Diff line number	Diff line
		===============================================================
		Softlockup detector and hardlockup detector (aka nmi_watchdog)
		===============================================================

		The Linux kernel can act as a watchdog to detect both soft and hard
		lockups.

		A 'softlockup' is defined as a bug that causes the kernel to loop in
		kernel mode for more than 20 seconds (see "Implementation" below for
		details), without giving other tasks a chance to run. The current
		stack trace is displayed upon detection and, by default, the system
		will stay locked up. Alternatively, the kernel can be configured to
		panic; a sysctl, "kernel.softlockup_panic", a kernel parameter,
		"softlockup_panic" (see "Documentation/kernel-parameters.txt" for
		details), and a compile option, "BOOTPARAM_HARDLOCKUP_PANIC", are
		provided for this.

		A 'hardlockup' is defined as a bug that causes the CPU to loop in
		kernel mode for more than 10 seconds (see "Implementation" below for
		details), without letting other interrupts have a chance to run.
		Similarly to the softlockup case, the current stack trace is displayed
		upon detection and the system will stay locked up unless the default
		behavior is changed, which can be done through a compile time knob,
		"BOOTPARAM_HARDLOCKUP_PANIC", and a kernel parameter, "nmi_watchdog"
		(see "Documentation/kernel-parameters.txt" for details).

		The panic option can be used in combination with panic_timeout (this
		timeout is set through the confusingly named "kernel.panic" sysctl),
		to cause the system to reboot automatically after a specified amount
		of time.

		=== Implementation ===

		The soft and hard lockup detectors are built on top of the hrtimer and
		perf subsystems, respectively. A direct consequence of this is that,
		in principle, they should work in any architecture where these
		subsystems are present.

		A periodic hrtimer runs to generate interrupts and kick the watchdog
		task. An NMI perf event is generated every "watchdog_thresh"
		(compile-time initialized to 10 and configurable through sysctl of the
		same name) seconds to check for hardlockups. If any CPU in the system
		does not receive any hrtimer interrupt during that time the
		'hardlockup detector' (the handler for the NMI perf event) will
		generate a kernel warning or call panic, depending on the
		configuration.

		The watchdog task is a high priority kernel thread that updates a
		timestamp every time it is scheduled. If that timestamp is not updated
		for 2*watchdog_thresh seconds (the softlockup threshold) the
		'softlockup detector' (coded inside the hrtimer callback function)
		will dump useful debug information to the system log, after which it
		will call panic if it was instructed to do so or resume execution of
		other kernel code.

		The period of the hrtimer is 2*watchdog_thresh/5, which means it has
		two or three chances to generate an interrupt before the hardlockup
		detector kicks in.

		As explained above, a kernel knob is provided that allows
		administrators to configure the period of the hrtimer and the perf
		event. The right value for a particular environment is a trade-off
		between fast response to lockups and detection overhead.

Documentation/nmi_watchdog.txt

deleted100644 → 0

+0 −83

Original line number	Diff line number	Diff line

		[NMI watchdog is available for x86 and x86-64 architectures]

		Is your system locking up unpredictably? No keyboard activity, just
		a frustrating complete hard lockup? Do you want to help us debugging
		such lockups? If all yes then this document is definitely for you.

		On many x86/x86-64 type hardware there is a feature that enables
		us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt
		which get executed even if the system is otherwise locked up hard).
		This can be used to debug hard kernel lockups. By executing periodic
		NMI interrupts, the kernel can monitor whether any CPU has locked up,
		and print out debugging messages if so.

		In order to use the NMI watchdog, you need to have APIC support in your
		kernel. For SMP kernels, APIC support gets compiled in automatically. For
		UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
		APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
		features -> IO-APIC support on uniprocessors) in your kernel config.
		CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
		CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
		kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
		may implicitly disable the NMI watchdog.]

		For x86-64, the needed APIC is always compiled in.

		Using local APIC (nmi_watchdog=2) needs the first performance register, so
		you can't use it for other purposes (such as high precision performance
		profiling.) However, at least oprofile and the perfctr driver disable the
		local APIC NMI watchdog automatically.

		To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
		parameter. Eg. the relevant lilo.conf entry:

		append="nmi_watchdog=1"

		For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
		For UP machines without an IO-APIC use nmi_watchdog=2, this only works
		for some processor types. If in doubt, boot with nmi_watchdog=1 and
		check the NMI count in /proc/interrupts; if the count is zero then
		reboot with nmi_watchdog=2 and check the NMI count. If it is still
		zero then log a problem, you probably have a processor that needs to be
		added to the nmi code.

		A 'lockup' is the following scenario: if any CPU in the system does not
		execute the period local timer interrupt for more than 5 seconds, then
		the NMI handler generates an oops and kills the process. This
		'controlled crash' (and the resulting kernel messages) can be used to
		debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
		the oops will show up automatically. If the kernel produces no messages
		then the system has crashed so hard (eg. hardware-wise) that either it
		cannot even accept NMI interrupts, or the crash has made the kernel
		unable to print messages.

		Be aware that when using local APIC, the frequency of NMI interrupts
		it generates, depends on the system load. The local APIC NMI watchdog,
		lacking a better source, uses the "cycles unhalted" event. As you may
		guess it doesn't tick when the CPU is in the halted state (which happens
		when the system is idle), but if your system locks up on anything but the
		"hlt" processor instruction, the watchdog will trigger very soon as the
		"cycles unhalted" event will happen every clock tick. If it locks up on
		"hlt", then you are out of luck -- the event will not happen at all and the
		watchdog won't trigger. This is a shortcoming of the local APIC watchdog
		-- unfortunately there is no "clock ticks" event that would work all the
		time. The I/O APIC watchdog is driven externally and has no such shortcoming.
		But its NMI frequency is much higher, resulting in a more significant hit
		to the overall system performance.

		On x86 nmi_watchdog is disabled by default so you have to enable it with
		a boot time parameter.

		It's possible to disable the NMI watchdog in run-time by writing "0" to
		/proc/sys/kernel/nmi_watchdog. Writing "1" to the same file will re-enable
		the NMI watchdog. Notice that you still need to use "nmi_watchdog=" parameter
		at boot time.

		NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally
		on x86 SMP boxes.

		[ feel free to send bug reports, suggestions and patches to
		Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
		list at <linux-smp@vger.kernel.org> ]

Documentation/static-keys.txt

0 → 100644

+286 −0

Original line number	Diff line number	Diff line
		Static Keys
		-----------

		By: Jason Baron <jbaron@redhat.com>

		0) Abstract

		Static keys allows the inclusion of seldom used features in
		performance-sensitive fast-path kernel code, via a GCC feature and a code
		patching technique. A quick example:

		struct static_key key = STATIC_KEY_INIT_FALSE;

		...

		if (static_key_false(&key))
		do unlikely code
		else
		do likely code

		...
		static_key_slow_inc();
		...
		static_key_slow_inc();
		...

		The static_key_false() branch will be generated into the code with as little
		impact to the likely code path as possible.


		1) Motivation


		Currently, tracepoints are implemented using a conditional branch. The
		conditional check requires checking a global variable for each tracepoint.
		Although the overhead of this check is small, it increases when the memory
		cache comes under pressure (memory cache lines for these global variables may
		be shared with other memory accesses). As we increase the number of tracepoints
		in the kernel this overhead may become more of an issue. In addition,
		tracepoints are often dormant (disabled) and provide no direct kernel
		functionality. Thus, it is highly desirable to reduce their impact as much as
		possible. Although tracepoints are the original motivation for this work, other
		kernel code paths should be able to make use of the static keys facility.


		2) Solution


		gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:

		http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html

		Using the 'asm goto', we can create branches that are either taken or not taken
		by default, without the need to check memory. Then, at run-time, we can patch
		the branch site to change the branch direction.

		For example, if we have a simple branch that is disabled by default:

		if (static_key_false(&key))
		printk("I am the true branch\n");

		Thus, by default the 'printk' will not be emitted. And the code generated will
		consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
		straight-line code path. When the branch is 'flipped', we will patch the
		'no-op' in the straight-line codepath with a 'jump' instruction to the
		out-of-line true branch. Thus, changing branch direction is expensive but
		branch selection is basically 'free'. That is the basic tradeoff of this
		optimization.

		This lowlevel patching mechanism is called 'jump label patching', and it gives
		the basis for the static keys facility.

		3) Static key label API, usage and examples:


		In order to make use of this optimization you must first define a key:

		struct static_key key;

		Which is initialized as:

		struct static_key key = STATIC_KEY_INIT_TRUE;

		or:

		struct static_key key = STATIC_KEY_INIT_FALSE;

		If the key is not initialized, it is default false. The 'struct static_key',
		must be a 'global'. That is, it can't be allocated on the stack or dynamically
		allocated at run-time.

		The key is then used in code as:

		if (static_key_false(&key))
		do unlikely code
		else
		do likely code

		Or:

		if (static_key_true(&key))
		do likely code
		else
		do unlikely code

		A key that is initialized via 'STATIC_KEY_INIT_FALSE', must be used in a
		'static_key_false()' construct. Likewise, a key initialized via
		'STATIC_KEY_INIT_TRUE' must be used in a 'static_key_true()' construct. A
		single key can be used in many branches, but all the branches must match the
		way that the key has been initialized.

		The branch(es) can then be switched via:

		static_key_slow_inc(&key);
		...
		static_key_slow_dec(&key);

		Thus, 'static_key_slow_inc()' means 'make the branch true', and
		'static_key_slow_dec()' means 'make the the branch false' with appropriate
		reference counting. For example, if the key is initialized true, a
		static_key_slow_dec(), will switch the branch to false. And a subsequent
		static_key_slow_inc(), will change the branch back to true. Likewise, if the
		key is initialized false, a 'static_key_slow_inc()', will change the branch to
		true. And then a 'static_key_slow_dec()', will again make the branch false.

		An example usage in the kernel is the implementation of tracepoints:

		static inline void trace_##name(proto) \
		{ \
		if (static_key_false(&__tracepoint_##name.key)) \
		__DO_TRACE(&__tracepoint_##name, \
		TP_PROTO(data_proto), \
		TP_ARGS(data_args), \
		TP_CONDITION(cond)); \
		}

		Tracepoints are disabled by default, and can be placed in performance critical
		pieces of the kernel. Thus, by using a static key, the tracepoints can have
		absolutely minimal impact when not in use.


		4) Architecture level code patching interface, 'jump labels'


		There are a few functions and macros that architectures must implement in order
		to take advantage of this optimization. If there is no architecture support, we
		simply fall back to a traditional, load, test, and jump sequence.

		* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig

		* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h

		* __always_inline bool arch_static_branch(struct static_key *key), see:
		arch/x86/include/asm/jump_label.h

		* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type),
		see: arch/x86/kernel/jump_label.c

		* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type),
		see: arch/x86/kernel/jump_label.c


		* struct jump_entry, see: arch/x86/include/asm/jump_label.h


		5) Static keys / jump label analysis, results (x86_64):


		As an example, let's add the following branch to 'getppid()', such that the
		system call now looks like:

		SYSCALL_DEFINE0(getppid)
		{
		int pid;

		+ if (static_key_false(&key))
		+ printk("I am the true branch\n");

		rcu_read_lock();
		pid = task_tgid_vnr(rcu_dereference(current->real_parent));
		rcu_read_unlock();

		return pid;
		}

		The resulting instructions with jump labels generated by GCC is:

		ffffffff81044290 <sys_getppid>:
		ffffffff81044290: 55 push %rbp
		ffffffff81044291: 48 89 e5 mov %rsp,%rbp
		ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9>
		ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
		ffffffff810442a0: 00 00
		ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
		ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
		ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
		ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr>
		ffffffff810442bc: 5d pop %rbp
		ffffffff810442bd: 48 98 cltq
		ffffffff810442bf: c3 retq
		ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi
		ffffffff810442c7: 31 c0 xor %eax,%eax
		ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk>
		ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9>

		Without the jump label optimization it looks like:

		ffffffff810441f0 <sys_getppid>:
		ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key>
		ffffffff810441f6: 55 push %rbp
		ffffffff810441f7: 48 89 e5 mov %rsp,%rbp
		ffffffff810441fa: 85 c0 test %eax,%eax
		ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35>
		ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
		ffffffff81044205: 00 00
		ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
		ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
		ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
		ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr>
		ffffffff81044221: 5d pop %rbp
		ffffffff81044222: 48 98 cltq
		ffffffff81044224: c3 retq
		ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi
		ffffffff8104422c: 31 c0 xor %eax,%eax
		ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk>
		ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe>
		ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)
		ffffffff8104423c: 00 00 00 00

		Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
		vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
		to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
		label case adds:

		6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.

		If we then include the padding bytes, the jump label code saves, 16 total bytes
		of instruction memory for this small fucntion. In this case the non-jump label
		function is 80 bytes long. Thus, we have have saved 20% of the instruction
		footprint. We can in fact improve this even further, since the 5-byte no-op
		really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
		However, we have not yet implemented optimal no-op sizes (they are currently
		hard-coded).

		Since there are a number of static key API uses in the scheduler paths,
		'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
		performance improvement. Testing done on 3.3.0-rc2:

		jump label disabled:

		Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):

		855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% )
		200,003 context-switches # 0.234 M/sec ( +- 0.00% )
		0 CPU-migrations # 0.000 M/sec ( +- 39.58% )
		487 page-faults # 0.001 M/sec ( +- 0.02% )
		1,474,374,262 cycles # 1.723 GHz ( +- 0.17% )
		<not supported> stalled-cycles-frontend
		<not supported> stalled-cycles-backend
		1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% )
		208,368,926 branches # 243.507 M/sec ( +- 0.06% )
		5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% )

		1.601607384 seconds time elapsed ( +- 0.07% )

		jump label enabled:

		Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):

		841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% )
		200,004 context-switches # 0.238 M/sec ( +- 0.00% )
		0 CPU-migrations # 0.000 M/sec ( +- 40.87% )
		487 page-faults # 0.001 M/sec ( +- 0.05% )
		1,432,559,428 cycles # 1.703 GHz ( +- 0.18% )
		<not supported> stalled-cycles-frontend
		<not supported> stalled-cycles-backend
		1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% )
		206,859,359 branches # 245.956 M/sec ( +- 0.04% )
		4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% )

		1.579384366 seconds time elapsed

		The percentage of saved branches is .7%, and we've saved 12% on
		'branch-misses'. This is where we would expect to get the most savings, since
		this optimization is about reducing the number of branches. In addition, we've
		saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.

Documentation/trace/ftrace.txt

+7 −0

Original line number	Diff line number	Diff line
		@@ -226,6 +226,13 @@ Here is the list of current tracers that may be configured.
		Traces and records the max latency that it takes for
		the highest priority task to get scheduled after
		it has been woken up.
		Traces all tasks as an average developer would expect.

		"wakeup_rt"

		Traces and records the max latency that it takes for just
		RT tasks (as the current "wakeup" does). This is useful
		for those interested in wake up timings of RT tasks.

		"hw-branch-tracer"

arch/Kconfig

+20 −9

Original line number	Diff line number	Diff line
		@@ -47,18 +47,29 @@ config KPROBES
		If in doubt, say "N".

		config JUMP_LABEL
		bool "Optimize trace point call sites"
		bool "Optimize very unlikely/likely branches"
		depends on HAVE_ARCH_JUMP_LABEL
		help
		This option enables a transparent branch optimization that
		makes certain almost-always-true or almost-always-false branch
		conditions even cheaper to execute within the kernel.

		Certain performance-sensitive kernel code, such as trace points,
		scheduler functionality, networking code and KVM have such
		branches and include support for this optimization technique.

		If it is detected that the compiler has support for "asm goto",
		the kernel will compile trace point locations with just a
		nop instruction. When trace points are enabled, the nop will
		be converted to a jump to the trace function. This technique
		lowers overhead and stress on the branch prediction of the
		processor.

		On i386, options added to the compiler flags may increase
		the size of the kernel slightly.
		the kernel will compile such branches with just a nop
		instruction. When the condition flag is toggled to true, the
		nop will be converted to a jump instruction to execute the
		conditional block of instructions.

		This technique lowers overhead and stress on the branch prediction
		of the processor and generally makes the kernel faster. The update
		of the condition is slower, but those are always very rare.

		( On 32-bit x86, the necessary options added to the compiler
		flags may increase the size of the kernel slightly. )

		config OPTPROBES
		def_bool y