Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 968ea6d8 authored by Rusty Russell's avatar Rusty Russell
Browse files

Merge ../linux-2.6-x86

Conflicts:

	arch/x86/kernel/io_apic.c
	kernel/sched.c
	kernel/sched_stats.h
parents 7be75853 8299608f
Loading
Loading
Loading
Loading
+32 −0
Original line number Diff line number Diff line
CPU Accounting Controller
-------------------------

The CPU accounting controller is used to group tasks using cgroups and
account the CPU usage of these groups of tasks.

The CPU accounting controller supports multi-hierarchy groups. An accounting
group accumulates the CPU usage of all of its child groups and the tasks
directly present in its group.

Accounting groups can be created by first mounting the cgroup filesystem.

# mkdir /cgroups
# mount -t cgroup -ocpuacct none /cgroups

With the above step, the initial or the parent accounting group
becomes visible at /cgroups. At bootup, this group includes all the
tasks in the system. /cgroups/tasks lists the tasks in this cgroup.
/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by
this group which is essentially the CPU time obtained by all the tasks
in the system.

New accounting groups can be created under the parent group /cgroups.

# cd /cgroups
# mkdir g1
# echo $$ > g1

The above steps create a new group g1 and move the current shell
process (bash) into it. CPU time consumed by this bash and its children
can be obtained from g1/cpuacct.usage and the same is accumulated in
/cgroups/cpuacct.usage also.
+117 −32
Original line number Diff line number Diff line
@@ -82,7 +82,7 @@ of ftrace. Here is a list of some of the key files:
		tracer is not adding more data, they will display
		the same information every time they are read.

  iter_ctrl: This file lets the user control the amount of data
  trace_options: This file lets the user control the amount of data
		that is displayed in one of the above output
		files.

@@ -94,7 +94,7 @@ of ftrace. Here is a list of some of the key files:
		only be recorded if the latency is greater than
		the value in this file. (in microseconds)

  trace_entries: This sets or displays the number of bytes each CPU
  buffer_size_kb: This sets or displays the number of kilobytes each CPU
		buffer can hold. The tracer buffers are the same size
		for each CPU. The displayed number is the size of the
		CPU buffer and not total size of all buffers. The
@@ -127,6 +127,8 @@ of ftrace. Here is a list of some of the key files:
		be traced. If a function exists in both set_ftrace_filter
		and set_ftrace_notrace,	the function will _not_ be traced.

  set_ftrace_pid: Have the function tracer only trace a single thread.

  available_filter_functions: This lists the functions that ftrace
		has processed and can trace. These are the function
		names that you can pass to "set_ftrace_filter" or
@@ -316,23 +318,23 @@ The above is mostly meaningful for kernel developers.
  The rest is the same as the 'trace' file.


iter_ctrl
---------
trace_options
-------------

The iter_ctrl file is used to control what gets printed in the trace
The trace_options file is used to control what gets printed in the trace
output. To see what is available, simply cat the file:

  cat /debug/tracing/iter_ctrl
  cat /debug/tracing/trace_options
  print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
 noblock nostacktrace nosched-tree
 noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj

To disable one of the options, echo in the option prepended with "no".

  echo noprint-parent > /debug/tracing/iter_ctrl
  echo noprint-parent > /debug/tracing/trace_options

To enable an option, leave off the "no".

  echo sym-offset > /debug/tracing/iter_ctrl
  echo sym-offset > /debug/tracing/trace_options

Here are the available options:

@@ -378,6 +380,20 @@ Here are the available options:
		When a trace is recorded, so is the stack of functions.
		This allows for back traces of trace sites.

  userstacktrace - This option changes the trace.
		   It records a stacktrace of the current userspace thread.

  sym-userobj - when user stacktrace are enabled, look up which object the
		address belongs to, and print a relative address
		This is especially useful when ASLR is on, otherwise you don't
		get a chance to resolve the address to object/file/line after the app is no
		longer running

		The lookup is performed when you read trace,trace_pipe,latency_trace. Example:

		a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]

  sched-tree - TBD (any users??)


@@ -1059,6 +1075,83 @@ For simple one time traces, the above is sufficent. For anything else,
a search through /proc/mounts may be needed to find where the debugfs
file-system is mounted.


Single thread tracing
---------------------

By writing into /debug/tracing/set_ftrace_pid you can trace a
single thread. For example:

# cat /debug/tracing/set_ftrace_pid
no pid
# echo 3111 > /debug/tracing/set_ftrace_pid
# cat /debug/tracing/set_ftrace_pid
3111
# echo function > /debug/tracing/current_tracer
# cat /debug/tracing/trace | head
 # tracer: function
 #
 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
 #              | |       |          |         |
     yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
     yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
     yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
     yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
     yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
     yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
# echo -1 > /debug/tracing/set_ftrace_pid
# cat /debug/tracing/trace |head
 # tracer: function
 #
 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
 #              | |       |          |         |
 ##### CPU 3 buffer started ####
     yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
     yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
     yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
     yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
     yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit

If you want to trace a function when executing, you could use
something like this simple program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main (int argc, char **argv)
{
        if (argc < 1)
                exit(-1);

        if (fork() > 0) {
                int fd, ffd;
                char line[64];
                int s;

                ffd = open("/debug/tracing/current_tracer", O_WRONLY);
                if (ffd < 0)
                        exit(-1);
                write(ffd, "nop", 3);

                fd = open("/debug/tracing/set_ftrace_pid", O_WRONLY);
                s = sprintf(line, "%d\n", getpid());
                write(fd, line, s);

                write(ffd, "function", 8);

                close(fd);
                close(ffd);

                execvp(argv[1], argv+1);
        }

        return 0;
}

dynamic ftrace
--------------

@@ -1158,7 +1251,11 @@ These are the only wild cards which are supported.

  <match>*<match> will not work.

 # echo hrtimer_* > /debug/tracing/set_ftrace_filter
Note: It is better to use quotes to enclose the wild cards, otherwise
  the shell may expand the parameters into names of files in the local
  directory.

 # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter

Produces:

@@ -1213,7 +1310,7 @@ Again, now we want to append.
 # echo sys_nanosleep > /debug/tracing/set_ftrace_filter
 # cat /debug/tracing/set_ftrace_filter
sys_nanosleep
 # echo hrtimer_* >> /debug/tracing/set_ftrace_filter
 # echo 'hrtimer_*' >> /debug/tracing/set_ftrace_filter
 # cat /debug/tracing/set_ftrace_filter
hrtimer_run_queues
hrtimer_run_pending
@@ -1299,41 +1396,29 @@ trace entries
-------------

Having too much or not enough data can be troublesome in diagnosing
an issue in the kernel. The file trace_entries is used to modify
an issue in the kernel. The file buffer_size_kb is used to modify
the size of the internal trace buffers. The number listed
is the number of entries that can be recorded per CPU. To know
the full size, multiply the number of possible CPUS with the
number of entries.

 # cat /debug/tracing/trace_entries
65620
 # cat /debug/tracing/buffer_size_kb
1408 (units kilobytes)

Note, to modify this, you must have tracing completely disabled. To do that,
echo "nop" into the current_tracer. If the current_tracer is not set
to "nop", an EINVAL error will be returned.

 # echo nop > /debug/tracing/current_tracer
 # echo 100000 > /debug/tracing/trace_entries
 # cat /debug/tracing/trace_entries
100045


Notice that we echoed in 100,000 but the size is 100,045. The entries
are held in individual pages. It allocates the number of pages it takes
to fulfill the request. If more entries may fit on the last page
then they will be added.

 # echo 1 > /debug/tracing/trace_entries
 # cat /debug/tracing/trace_entries
85

This shows us that 85 entries can fit in a single page.
 # echo 10000 > /debug/tracing/buffer_size_kb
 # cat /debug/tracing/buffer_size_kb
10000 (units kilobytes)

The number of pages which will be allocated is limited to a percentage
of available memory. Allocating too much will produce an error.

 # echo 1000000000000 > /debug/tracing/trace_entries
 # echo 1000000000000 > /debug/tracing/buffer_size_kb
-bash: echo: write error: Cannot allocate memory
 # cat /debug/tracing/trace_entries
 # cat /debug/tracing/buffer_size_kb
85
+8 −0
Original line number Diff line number Diff line
@@ -750,6 +750,14 @@ and is between 256 and 4096 characters. It is defined in the file
			parameter will force ia64_sal_cache_flush to call
			ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.

	ftrace=[tracer]
			[ftrace] will set and start the specified tracer
			as early as possible in order to facilitate early
			boot debugging.

	ftrace_dump_on_oops
			[ftrace] will dump the trace buffers on oops.

	gamecon.map[2|3]=
			[HW,JOY] Multisystem joystick and NES/SNES/PSX pad
			support via parallel port (up to 5 devices per port)
+33 −18
Original line number Diff line number Diff line
@@ -71,35 +71,50 @@ Look at the current lock statistics:

# less /proc/lock_stat

01 lock_stat version 0.2
01 lock_stat version 0.3
02 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
03                               class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
04 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
05
06               &inode->i_data.tree_lock-W:            15          21657           0.18     1093295.30 11547131054.85             58          10415           0.16          87.51        6387.60
07               &inode->i_data.tree_lock-R:             0              0           0.00           0.00           0.00          23302         231198           0.25           8.45       98023.38
08               --------------------------
09                 &inode->i_data.tree_lock              0          [<ffffffff8027c08f>] add_to_page_cache+0x5f/0x190
10
11 ...............................................................................................................................................................................................
12
13                              dcache_lock:          1037           1161           0.38          45.32         774.51           6611         243371           0.15         306.48       77387.24
14                              -----------
15                              dcache_lock            180          [<ffffffff802c0d7e>] sys_getcwd+0x11e/0x230
16                              dcache_lock            165          [<ffffffff802c002a>] d_alloc+0x15a/0x210
17                              dcache_lock             33          [<ffffffff8035818d>] _atomic_dec_and_lock+0x4d/0x70
18                              dcache_lock              1          [<ffffffff802beef8>] shrink_dcache_parent+0x18/0x130
06                          &mm->mmap_sem-W:           233            538 18446744073708       22924.27      607243.51           1342          45806           1.71        8595.89     1180582.34
07                          &mm->mmap_sem-R:           205            587 18446744073708       28403.36      731975.00           1940         412426           0.58      187825.45     6307502.88
08                          ---------------
09                            &mm->mmap_sem            487          [<ffffffff8053491f>] do_page_fault+0x466/0x928
10                            &mm->mmap_sem            179          [<ffffffff802a6200>] sys_mprotect+0xcd/0x21d
11                            &mm->mmap_sem            279          [<ffffffff80210a57>] sys_mmap+0x75/0xce
12                            &mm->mmap_sem             76          [<ffffffff802a490b>] sys_munmap+0x32/0x59
13                          ---------------
14                            &mm->mmap_sem            270          [<ffffffff80210a57>] sys_mmap+0x75/0xce
15                            &mm->mmap_sem            431          [<ffffffff8053491f>] do_page_fault+0x466/0x928
16                            &mm->mmap_sem            138          [<ffffffff802a490b>] sys_munmap+0x32/0x59
17                            &mm->mmap_sem            145          [<ffffffff802a6200>] sys_mprotect+0xcd/0x21d
18
19 ...............................................................................................................................................................................................
20
21                              dcache_lock:           621            623           0.52         118.26        1053.02           6745          91930           0.29         316.29      118423.41
22                              -----------
23                              dcache_lock            179          [<ffffffff80378274>] _atomic_dec_and_lock+0x34/0x54
24                              dcache_lock            113          [<ffffffff802cc17b>] d_alloc+0x19a/0x1eb
25                              dcache_lock             99          [<ffffffff802ca0dc>] d_rehash+0x1b/0x44
26                              dcache_lock            104          [<ffffffff802cbca0>] d_instantiate+0x36/0x8a
27                              -----------
28                              dcache_lock            192          [<ffffffff80378274>] _atomic_dec_and_lock+0x34/0x54
29                              dcache_lock             98          [<ffffffff802ca0dc>] d_rehash+0x1b/0x44
30                              dcache_lock             72          [<ffffffff802cc17b>] d_alloc+0x19a/0x1eb
31                              dcache_lock            112          [<ffffffff802cbca0>] d_instantiate+0x36/0x8a

This excerpt shows the first two lock class statistics. Line 01 shows the
output version - each time the format changes this will be updated. Line 02-04
show the header with column descriptions. Lines 05-10 and 13-18 show the actual
show the header with column descriptions. Lines 05-18 and 20-31 show the actual
statistics. These statistics come in two parts; the actual stats separated by a
short separator (line 08, 14) from the contention points.
short separator (line 08, 13) from the contention points.

The first lock (05-10) is a read/write lock, and shows two lines above the
The first lock (05-18) is a read/write lock, and shows two lines above the
short separator. The contention points don't match the column descriptors,
they have two: contentions and [<IP>] symbol.
they have two: contentions and [<IP>] symbol. The second set of contention
points are the points we're contending with.

The integer part of the time values is in us.

View the top contending locks:

+24 −5
Original line number Diff line number Diff line
@@ -51,11 +51,16 @@ to call) for the specific marker through marker_probe_register() and can be
activated by calling marker_arm(). Marker deactivation can be done by calling
marker_disarm() as many times as marker_arm() has been called. Removing a probe
is done through marker_probe_unregister(); it will disarm the probe.
marker_synchronize_unregister() must be called before the end of the module exit
function to make sure there is no caller left using the probe. This, and the
fact that preemption is disabled around the probe call, make sure that probe
removal and module unload are safe. See the "Probe example" section below for a
sample probe module.

marker_synchronize_unregister() must be called between probe unregistration and
the first occurrence of
- the end of module exit function,
  to make sure there is no caller left using the probe;
- the free of any resource used by the probes,
  to make sure the probes wont be accessing invalid data.
This, and the fact that preemption is disabled around the probe call, make sure
that probe removal and module unload are safe. See the "Probe example" section
below for a sample probe module.

The marker mechanism supports inserting multiple instances of the same marker.
Markers can be put in inline functions, inlined static functions, and
@@ -70,6 +75,20 @@ a printk warning which identifies the inconsistency:

"Format mismatch for probe probe_name (format), marker (format)"

Another way to use markers is to simply define the marker without generating any
function call to actually call into the marker. This is useful in combination
with tracepoint probes in a scheme like this :

void probe_tracepoint_name(unsigned int arg1, struct task_struct *tsk);

DEFINE_MARKER_TP(marker_eventname, tracepoint_name, probe_tracepoint_name,
	"arg1 %u pid %d");

notrace void probe_tracepoint_name(unsigned int arg1, struct task_struct *tsk)
{
	struct marker *marker = &GET_MARKER(kernel_irq_entry);
	/* write data to trace buffers ... */
}

* Probe / marker example

Loading