Merge "Merge android-4.9-172 (2dbf78b) into msm-4.9" (ed76534e) · Commits · e / devices / android_kernel_fairphone_FP3

Documentation/ABI/testing/sysfs-fs-f2fs

+7 −0

Original line number	Diff line number	Diff line
		@@ -86,6 +86,13 @@ Description:
		The unit size is one block, now only support configuring in range
		of [1, 512].

		What: /sys/fs/f2fs/<disk>/umount_discard_timeout
		Date: January 2019
		Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>
		Description:
		Set timeout to issue discard commands during umount.
		Default: 5 secs

		What: /sys/fs/f2fs/<disk>/max_victim_search
		Date: January 2014
		Contact: "Jaegeuk Kim" <jaegeuk.kim@samsung.com>

Documentation/accounting/psi.txt

0 → 100644

+180 −0

Original line number	Diff line number	Diff line
		================================
		PSI - Pressure Stall Information
		================================

		:Date: April, 2018
		:Author: Johannes Weiner <hannes@cmpxchg.org>

		When CPU, memory or IO devices are contended, workloads experience
		latency spikes, throughput losses, and run the risk of OOM kills.

		Without an accurate measure of such contention, users are forced to
		either play it safe and under-utilize their hardware resources, or
		roll the dice and frequently suffer the disruptions resulting from
		excessive overcommit.

		The psi feature identifies and quantifies the disruptions caused by
		such resource crunches and the time impact it has on complex workloads
		or even entire systems.

		Having an accurate measure of productivity losses caused by resource
		scarcity aids users in sizing workloads to hardware--or provisioning
		hardware according to workload demand.

		As psi aggregates this information in realtime, systems can be managed
		dynamically using techniques such as load shedding, migrating jobs to
		other systems or data centers, or strategically pausing or killing low
		priority or restartable batch jobs.

		This allows maximizing hardware utilization without sacrificing
		workload health or risking major disruptions such as OOM kills.

		Pressure interface
		==================

		Pressure information for each resource is exported through the
		respective file in /proc/pressure/ -- cpu, memory, and io.

		The format for CPU is as such:

		some avg10=0.00 avg60=0.00 avg300=0.00 total=0

		and for memory and IO:

		some avg10=0.00 avg60=0.00 avg300=0.00 total=0
		full avg10=0.00 avg60=0.00 avg300=0.00 total=0

		The "some" line indicates the share of time in which at least some
		tasks are stalled on a given resource.

		The "full" line indicates the share of time in which all non-idle
		tasks are stalled on a given resource simultaneously. In this state
		actual CPU cycles are going to waste, and a workload that spends
		extended time in this state is considered to be thrashing. This has
		severe impact on performance, and it's useful to distinguish this
		situation from a state where some tasks are stalled but the CPU is
		still doing productive work. As such, time spent in this subset of the
		stall state is tracked separately and exported in the "full" averages.

		The ratios are tracked as recent trends over ten, sixty, and three
		hundred second windows, which gives insight into short term events as
		well as medium and long term trends. The total absolute stall time is
		tracked and exported as well, to allow detection of latency spikes
		which wouldn't necessarily make a dent in the time averages, or to
		average trends over custom time frames.

		Monitoring for pressure thresholds
		==================================

		Users can register triggers and use poll() to be woken up when resource
		pressure exceeds certain thresholds.

		A trigger describes the maximum cumulative stall time over a specific
		time window, e.g. 100ms of total stall time within any 500ms window to
		generate a wakeup event.

		To register a trigger user has to open psi interface file under
		/proc/pressure/ representing the resource to be monitored and write the
		desired threshold and time window. The open file descriptor should be
		used to wait for trigger events using select(), poll() or epoll().
		The following format is used:

		<some\|full> <stall amount in us> <time window in us>

		For example writing "some 150000 1000000" into /proc/pressure/memory
		would add 150ms threshold for partial memory stall measured within
		1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
		would add 50ms threshold for full io stall measured within 1sec time window.

		Triggers can be set on more than one psi metric and more than one trigger
		for the same psi metric can be specified. However for each trigger a separate
		file descriptor is required to be able to poll it separately from others,
		therefore for each trigger a separate open() syscall should be made even
		when opening the same psi interface file.

		Monitors activate only when system enters stall state for the monitored
		psi metric and deactivates upon exit from the stall state. While system is
		in the stall state psi signal growth is monitored at a rate of 10 times per
		tracking window.

		The kernel accepts window sizes ranging from 500ms to 10s, therefore min
		monitoring update interval is 50ms and max is 1s. Min limit is set to
		prevent overly frequent polling. Max limit is chosen as a high enough number
		after which monitors are most likely not needed and psi averages can be used
		instead.

		When activated, psi monitor stays active for at least the duration of one
		tracking window to avoid repeated activations/deactivations when system is
		bouncing in and out of the stall state.

		Notifications to the userspace are rate-limited to one per tracking window.

		The trigger will de-register when the file descriptor used to define the
		trigger is closed.

		Userspace monitor usage example
		===============================

		#include <errno.h>
		#include <fcntl.h>
		#include <stdio.h>
		#include <poll.h>
		#include <string.h>
		#include <unistd.h>

		/*
		* Monitor memory partial stall with 1s tracking window size
		* and 150ms threshold.
		*/
		int main() {
		const char trig[] = "some 150000 1000000";
		struct pollfd fds;
		int n;

		fds.fd = open("/proc/pressure/memory", O_RDWR \| O_NONBLOCK);
		if (fds.fd < 0) {
		printf("/proc/pressure/memory open error: %s\n",
		strerror(errno));
		return 1;
		}
		fds.events = POLLPRI;

		if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
		printf("/proc/pressure/memory write error: %s\n",
		strerror(errno));
		return 1;
		}

		printf("waiting for events...\n");
		while (1) {
		n = poll(&fds, 1, -1);
		if (n < 0) {
		printf("poll error: %s\n", strerror(errno));
		return 1;
		}
		if (fds.revents & POLLERR) {
		printf("got POLLERR, event source is gone\n");
		return 0;
		}
		if (fds.revents & POLLPRI) {
		printf("event triggered!\n");
		} else {
		printf("unknown event received: 0x%x\n", fds.revents);
		return 1;
		}
		}

		return 0;
		}

		Cgroup2 interface
		=================

		In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
		mounted, pressure stall information is also tracked for tasks grouped
		into cgroups. Each subdirectory in the cgroupfs mountpoint contains
		cpu.pressure, memory.pressure, and io.pressure files; the format is
		the same as the /proc/pressure/ files.

		Per-cgroup psi monitors can be specified and used the same way as
		system-wide ones.

Documentation/arm/kernel_mode_neon.txt

+2 −2

Original line number	Diff line number	Diff line
		@@ -6,7 +6,7 @@ TL;DR summary
		* Use only NEON instructions, or VFP instructions that don't rely on support
		code
		* Isolate your NEON code in a separate compilation unit, and compile it with
		'-mfpu=neon -mfloat-abi=softfp'
		'-march=armv7-a -mfpu=neon -mfloat-abi=softfp'
		* Put kernel_neon_begin() and kernel_neon_end() calls around the calls into your
		NEON code
		* Don't sleep in your NEON code, and be aware that it will be executed with
		@@ -87,7 +87,7 @@ instructions appearing in unexpected places if no special care is taken.
		Therefore, the recommended and only supported way of using NEON/VFP in the
		kernel is by adhering to the following rules:
		* isolate the NEON code in a separate compilation unit and compile it with
		'-mfpu=neon -mfloat-abi=softfp';
		'-march=armv7-a -mfpu=neon -mfloat-abi=softfp';
		* issue the calls to kernel_neon_begin(), kernel_neon_end() as well as the calls
		into the unit containing the NEON code from a compilation unit which is not
		built with the GCC flag '-mfpu=neon' set.

Documentation/cgroup-v2.txt

+18 −0

Original line number	Diff line number	Diff line
		@@ -717,6 +717,12 @@ All time durations are in microseconds.
		$PERIOD duration. If only one number is written, $MAX is
		updated.

		cpu.pressure
		A read-only nested-key file which exists on non-root cgroups.

		Shows pressure stall information for CPU. See
		Documentation/accounting/psi.txt for details.


		5-2. Memory

		@@ -925,6 +931,12 @@ PAGE_SIZE multiple when read back.
		Swap usage hard limit. If a cgroup's swap usage reaches this
		limit, anonymous meomry of the cgroup will not be swapped out.

		memory.pressure
		A read-only nested-key file which exists on non-root cgroups.

		Shows pressure stall information for memory. See
		Documentation/accounting/psi.txt for details.


		5-2-2. Usage Guidelines

		@@ -1055,6 +1067,12 @@ blk-mq devices.

		8:16 rbps=2097152 wbps=max riops=max wiops=max

		io.pressure
		A read-only nested-key file which exists on non-root cgroups.

		Shows pressure stall information for IO. See
		Documentation/accounting/psi.txt for details.


		5-3-2. Writeback

Documentation/device-mapper/dm-bow.txt

0 → 100644

+99 −0

Original line number	Diff line number	Diff line
		dm_bow (backup on write)
		========================

		dm_bow is a device mapper driver that uses the free space on a device to back up
		data that is overwritten. The changes can then be committed by a simple state
		change, or rolled back by removing the dm_bow device and running a command line
		utility over the underlying device.

		dm_bow has three states, set by writing ‘1’ or ‘2’ to /sys/block/dm-?/bow/state.
		It is only possible to go from state 0 (initial state) to state 1, and then from
		state 1 to state 2.

		State 0: dm_bow collects all trims to the device and assumes that these mark
		free space on the overlying file system that can be safely used. Typically the
		mount code would create the dm_bow device, mount the file system, call the
		FITRIM ioctl on the file system then switch to state 1. These trims are not
		propagated to the underlying device.

		State 1: All writes to the device cause the underlying data to be backed up to
		the free (trimmed) area as needed in such a way as they can be restored.
		However, the writes, with one exception, then happen exactly as they would
		without dm_bow, so the device is always in a good final state. The exception is
		that sector 0 is used to keep a log of the latest changes, both to indicate that
		we are in this state and to allow rollback. See below for all details. If there
		isn't enough free space, writes are failed with -ENOSPC.

		State 2: The transition to state 2 triggers replacing the special sector 0 with
		the normal sector 0, and the freeing of all state information. dm_bow then
		becomes a pass-through driver, allowing the device to continue to be used with
		minimal performance impact.

		Usage
		=====
		dm-bow takes one command line parameter, the name of the underlying device.

		dm-bow will typically be used in the following way. dm-bow will be loaded with a
		suitable underlying device and the resultant device will be mounted. A file
		system trim will be issued via the FITRIM ioctl, then the device will be
		switched to state 1. The file system will now be used as normal. At some point,
		the changes can either be committed by switching to state 2, or rolled back by
		unmounting the file system, removing the dm-bow device and running the command
		line utility. Note that rebooting the device will be equivalent to unmounting
		and removing, but the command line utility must still be run

		Details of operation in state 1
		===============================

		dm_bow maintains a type for all sectors. A sector can be any of:

		SECTOR0
		SECTOR0_CURRENT
		UNCHANGED
		FREE
		CHANGED
		BACKUP

		SECTOR0 is the first sector on the device, and is used to hold the log of
		changes. This is the one exception.

		SECTOR0_CURRENT is a sector picked from the FREE sectors, and is where reads and
		writes from the true sector zero are redirected to. Note that like any backup
		sector, if the sector is written to directly, it must be moved again.

		UNCHANGED means that the sector has not been changed since we entered state 1.
		Thus if it is written to or trimmed, the contents must first be backed up.

		FREE means that the sector was trimmed in state 0 and has not yet been written
		to or used for backup. On being written to, a FREE sector is changed to CHANGED.

		CHANGED means that the sector has been modified, and can be further modified
		without further backup.

		BACKUP means that this is a free sector being used as a backup. On being written
		to, the contents must first be backed up again.

		All backup operations are logged to the first sector. The log sector has the
		format:
		--------------------------------------------------------
		\| Magic \| Count \| Sequence \| Log entry \| Log entry \| …
		--------------------------------------------------------

		Magic is a magic number. Count is the number of log entries. Sequence is 0
		initially. A log entry is

		-----------------------------------
		\| Source \| Dest \| Size \| Checksum \|
		-----------------------------------

		When SECTOR0 is full, the log sector is backed up and another empty log sector
		created with sequence number one higher. The first entry in any log entry with
		sequence > 0 therefore must be the log of the backing up of the previous log
		sector. Note that sequence is not strictly needed, but is a useful sanity check
		and potentially limits the time spent trying to restore a corrupted snapshot.

		On entering state 1, dm_bow has a list of free sectors. All other sectors are
		unchanged. Sector0_current is selected from the free sectors and the contents of
		sector 0 are copied there. The sector 0 is backed up, which triggers the first
		log entry to be written.