Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block (0e9da3fb) · Commits · e / devices / android_kernel_fairphone_FP5

Documentation/ABI/testing/sysfs-block

+10 −2

Original line number	Diff line number	Diff line
		@@ -244,7 +244,7 @@ Description:

		What: /sys/block/<disk>/queue/zoned
		Date: September 2016
		Contact: Damien Le Moal <damien.lemoal@hgst.com>
		Contact: Damien Le Moal <damien.lemoal@wdc.com>
		Description:
		zoned indicates if the device is a zoned block device
		and the zone model of the device if it is indeed zoned.
		@@ -259,6 +259,14 @@ Description:
		zone commands, they will be treated as regular block
		devices and zoned will report "none".

		What: /sys/block/<disk>/queue/nr_zones
		Date: November 2018
		Contact: Damien Le Moal <damien.lemoal@wdc.com>
		Description:
		nr_zones indicates the total number of zones of a zoned block
		device ("host-aware" or "host-managed" zone model). For regular
		block devices, the value is always 0.

		What: /sys/block/<disk>/queue/chunk_sectors
		Date: September 2016
		Contact: Hannes Reinecke <hare@suse.com>
		@@ -268,6 +276,6 @@ Description:
		indicates the size in 512B sectors of the RAID volume
		stripe segment. For a zoned block device, either
		host-aware or host-managed, chunk_sectors indicates the
		size of 512B sectors of the zones of the device, with
		size in 512B sectors of the zones of the device, with
		the eventual exception of the last zone of the device
		which may be smaller.

Documentation/admin-guide/cgroup-v2.rst

+5 −3

Original line number	Diff line number	Diff line
		@@ -1879,8 +1879,10 @@ following two functions.

		wbc_init_bio(@wbc, @bio)
		Should be called for each bio carrying writeback data and
		associates the bio with the inode's owner cgroup. Can be
		called anytime between bio allocation and submission.
		associates the bio with the inode's owner cgroup and the
		corresponding request queue. This must be called after
		a queue (device) has been associated with the bio and
		before submission.

		wbc_account_io(@wbc, @page, @bytes)
		Should be called for each data segment being written out.
		@@ -1899,7 +1901,7 @@ the configuration, the bio may be executed at a lower priority and if
		the writeback session is holding shared resources, e.g. a journal
		entry, may lead to priority inversion. There is no one easy solution
		for the problem. Filesystems can try to work around specific problem
		cases by skipping wbc_init_bio() or using bio_associate_blkcg()
		cases by skipping wbc_init_bio() and using bio_associate_blkg()
		directly.

Documentation/block/biodoc.txt

+0 −88

Original line number	Diff line number	Diff line
		@@ -65,7 +65,6 @@ Description of Contents:
		3.2.3 I/O completion
		3.2.4 Implications for drivers that do not interpret bios (don't handle
		multiple segments)
		3.2.5 Request command tagging
		3.3 I/O submission
		4. The I/O scheduler
		5. Scalability related changes
		@@ -708,93 +707,6 @@ is crossed on completion of a transfer. (The endrequest functions should
		be used if only if the request has come down from block/bio path, not for
		direct access requests which only specify rq->buffer without a valid rq->bio)

		3.2.5 Generic request command tagging

		3.2.5.1 Tag helpers

		Block now offers some simple generic functionality to help support command
		queueing (typically known as tagged command queueing), ie manage more than
		one outstanding command on a queue at any given time.

		blk_queue_init_tags(struct request_queue *q, int depth)

		Initialize internal command tagging structures for a maximum
		depth of 'depth'.

		blk_queue_free_tags((struct request_queue *q)

		Teardown tag info associated with the queue. This will be done
		automatically by block if blk_queue_cleanup() is called on a queue
		that is using tagging.

		The above are initialization and exit management, the main helpers during
		normal operations are:

		blk_queue_start_tag(struct request_queue q, struct request rq)

		Start tagged operation for this request. A free tag number between
		0 and 'depth' is assigned to the request (rq->tag holds this number),
		and 'rq' is added to the internal tag management. If the maximum depth
		for this queue is already achieved (or if the tag wasn't started for
		some other reason), 1 is returned. Otherwise 0 is returned.

		blk_queue_end_tag(struct request_queue q, struct request rq)

		End tagged operation on this request. 'rq' is removed from the internal
		book keeping structures.

		To minimize struct request and queue overhead, the tag helpers utilize some
		of the same request members that are used for normal request queue management.
		This means that a request cannot both be an active tag and be on the queue
		list at the same time. blk_queue_start_tag() will remove the request, but
		the driver must remember to call blk_queue_end_tag() before signalling
		completion of the request to the block layer. This means ending tag
		operations before calling end_that_request_last()! For an example of a user
		of these helpers, see the IDE tagged command queueing support.

		3.2.5.2 Tag info

		Some block functions exist to query current tag status or to go from a
		tag number to the associated request. These are, in no particular order:

		blk_queue_tagged(q)

		Returns 1 if the queue 'q' is using tagging, 0 if not.

		blk_queue_tag_request(q, tag)

		Returns a pointer to the request associated with tag 'tag'.

		blk_queue_tag_depth(q)

		Return current queue depth.

		blk_queue_tag_queue(q)

		Returns 1 if the queue can accept a new queued command, 0 if we are
		at the maximum depth already.

		blk_queue_rq_tagged(rq)

		Returns 1 if the request 'rq' is tagged.

		3.2.5.2 Internal structure

		Internally, block manages tags in the blk_queue_tag structure:

		struct blk_queue_tag {
		struct request *tag_index; / array or pointers to rq */
		unsigned long tag_map; / bitmap of free tags */
		struct list_head busy_list; /* fifo list of busy tags */
		int busy; /* queue depth */
		int max_depth; /* max queue depth */
		};

		Most of the above is simple and straight forward, however busy_list may need
		a bit of explaining. Normally we don't care too much about request ordering,
		but in the event of any barrier requests in the tag queue we need to ensure
		that requests are restarted in the order they were queue.

		3.3 I/O Submission

		The routine submit_bio() is used to submit a single io. Higher level i/o

Documentation/block/cfq-iosched.txt

deleted100644 → 0

+0 −291

Original line number	Diff line number	Diff line
		CFQ (Complete Fairness Queueing)
		===============================

		The main aim of CFQ scheduler is to provide a fair allocation of the disk
		I/O bandwidth for all the processes which requests an I/O operation.

		CFQ maintains the per process queue for the processes which request I/O
		operation(synchronous requests). In case of asynchronous requests, all the
		requests from all the processes are batched together according to their
		process's I/O priority.

		CFQ ioscheduler tunables
		========================

		slice_idle
		----------
		This specifies how long CFQ should idle for next request on certain cfq queues
		(for sequential workloads) and service trees (for random workloads) before
		queue is expired and CFQ selects next queue to dispatch from.

		By default slice_idle is a non-zero value. That means by default we idle on
		queues/service trees. This can be very helpful on highly seeky media like
		single spindle SATA/SAS disks where we can cut down on overall number of
		seeks and see improved throughput.

		Setting slice_idle to 0 will remove all the idling on queues/service tree
		level and one should see an overall improved throughput on faster storage
		devices like multiple SATA/SAS disks in hardware RAID configuration. The down
		side is that isolation provided from WRITES also goes down and notion of
		IO priority becomes weaker.

		So depending on storage and workload, it might be useful to set slice_idle=0.
		In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
		keeping slice_idle enabled should be useful. For any configurations where
		there are multiple spindles behind single LUN (Host based hardware RAID
		controller or for storage arrays), setting slice_idle=0 might end up in better
		throughput and acceptable latencies.

		back_seek_max
		-------------
		This specifies, given in Kbytes, the maximum "distance" for backward seeking.
		The distance is the amount of space from the current head location to the
		sectors that are backward in terms of distance.

		This parameter allows the scheduler to anticipate requests in the "backward"
		direction and consider them as being the "next" if they are within this
		distance from the current head location.

		back_seek_penalty
		-----------------
		This parameter is used to compute the cost of backward seeking. If the
		backward distance of request is just 1/back_seek_penalty from a "front"
		request, then the seeking cost of two requests is considered equivalent.

		So scheduler will not bias toward one or the other request (otherwise scheduler
		will bias toward front request). Default value of back_seek_penalty is 2.

		fifo_expire_async
		-----------------
		This parameter is used to set the timeout of asynchronous requests. Default
		value of this is 248ms.

		fifo_expire_sync
		----------------
		This parameter is used to set the timeout of synchronous requests. Default
		value of this is 124ms. In case to favor synchronous requests over asynchronous
		one, this value should be decreased relative to fifo_expire_async.

		group_idle
		-----------
		This parameter forces idling at the CFQ group level instead of CFQ
		queue level. This was introduced after a bottleneck was observed
		in higher end storage due to idle on sequential queue and allow dispatch
		from a single queue. The idea with this parameter is that it can be run with
		slice_idle=0 and group_idle=8, so that idling does not happen on individual
		queues in the group but happens overall on the group and thus still keeps the
		IO controller working.
		Not idling on individual queues in the group will dispatch requests from
		multiple queues in the group at the same time and achieve higher throughput
		on higher end storage.

		Default value for this parameter is 8ms.

		low_latency
		-----------
		This parameter is used to enable/disable the low latency mode of the CFQ
		scheduler. If enabled, CFQ tries to recompute the slice time for each process
		based on the target_latency set for the system. This favors fairness over
		throughput. Disabling low latency (setting it to 0) ignores target latency,
		allowing each process in the system to get a full time slice.

		By default low latency mode is enabled.

		target_latency
		--------------
		This parameter is used to calculate the time slice for a process if cfq's
		latency mode is enabled. It will ensure that sync requests have an estimated
		latency. But if sequential workload is higher(e.g. sequential read),
		then to meet the latency constraints, throughput may decrease because of less
		time for each process to issue I/O request before the cfq queue is switched.

		Though this can be overcome by disabling the latency_mode, it may increase
		the read latency for some applications. This parameter allows for changing
		target_latency through the sysfs interface which can provide the balanced
		throughput and read latency.

		Default value for target_latency is 300ms.

		slice_async
		-----------
		This parameter is same as of slice_sync but for asynchronous queue. The
		default value is 40ms.

		slice_async_rq
		--------------
		This parameter is used to limit the dispatching of asynchronous request to
		device request queue in queue's slice time. The maximum number of request that
		are allowed to be dispatched also depends upon the io priority. Default value
		for this is 2.

		slice_sync
		----------
		When a queue is selected for execution, the queues IO requests are only
		executed for a certain amount of time(time_slice) before switching to another
		queue. This parameter is used to calculate the time slice of synchronous
		queue.

		time_slice is computed using the below equation:-
		time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
		time_slice of synchronous queue, increase the value of slice_sync. Default
		value is 100ms.

		quantum
		-------
		This specifies the number of request dispatched to the device queue. In a
		queue's time slice, a request will not be dispatched if the number of request
		in the device exceeds this parameter. This parameter is used for synchronous
		request.

		In case of storage with several disk, this setting can limit the parallel
		processing of request. Therefore, increasing the value can improve the
		performance although this can cause the latency of some I/O to increase due
		to more number of requests.

		CFQ Group scheduling
		====================

		CFQ supports blkio cgroup and has "blkio." prefixed files in each
		blkio cgroup directory. It is weight-based and there are four knobs
		for configuration - weight[_device] and leaf_weight[_device].
		Internal cgroup nodes (the ones with children) can also have tasks in
		them, so the former two configure how much proportion the cgroup as a
		whole is entitled to at its parent's level while the latter two
		configure how much proportion the tasks in the cgroup have compared to
		its direct children.

		Another way to think about it is assuming that each internal node has
		an implicit leaf child node which hosts all the tasks whose weight is
		configured by leaf_weight[_device]. Let's assume a blkio hierarchy
		composed of five cgroups - root, A, B, AA and AB - with the following
		weights where the names represent the hierarchy.

		weight leaf_weight
		root : 125 125
		A : 500 750
		B : 250 500
		AA : 500 500
		AB : 1000 500

		root never has a parent making its weight is meaningless. For backward
		compatibility, weight is always kept in sync with leaf_weight. B, AA
		and AB have no child and thus its tasks have no children cgroup to
		compete with. They always get 100% of what the cgroup won at the
		parent level. Considering only the weights which matter, the hierarchy
		looks like the following.

		root
		/ \| \
		A B leaf
		500 250 125
		/ \| \
		AA AB leaf
		500 1000 750

		If all cgroups have active IOs and competing with each other, disk
		time will be distributed like the following.

		Distribution below root. The total active weight at this level is
		A:500 + B:250 + C:125 = 875.

		root-leaf : 125 / 875 =~ 14%
		A : 500 / 875 =~ 57%
		B(-leaf) : 250 / 875 =~ 28%

		A has children and further distributes its 57% among the children and
		the implicit leaf node. The total active weight at this level is
		AA:500 + AB:1000 + A-leaf:750 = 2250.

		A-leaf : ( 750 / 2250) * A =~ 19%
		AA(-leaf) : ( 500 / 2250) * A =~ 12%
		AB(-leaf) : (1000 / 2250) * A =~ 25%

		CFQ IOPS Mode for group scheduling
		===================================
		Basic CFQ design is to provide priority based time slices. Higher priority
		process gets bigger time slice and lower priority process gets smaller time
		slice. Measuring time becomes harder if storage is fast and supports NCQ and
		it would be better to dispatch multiple requests from multiple cfq queues in
		request queue at a time. In such scenario, it is not possible to measure time
		consumed by single queue accurately.

		What is possible though is to measure number of requests dispatched from a
		single queue and also allow dispatch from multiple cfq queue at the same time.
		This effectively becomes the fairness in terms of IOPS (IO operations per
		second).

		If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
		to IOPS mode and starts providing fairness in terms of number of requests
		dispatched. Note that this mode switching takes effect only for group
		scheduling. For non-cgroup users nothing should change.

		CFQ IO scheduler Idling Theory
		===============================
		Idling on a queue is primarily about waiting for the next request to come
		on same queue after completion of a request. In this process CFQ will not
		dispatch requests from other cfq queues even if requests are pending there.

		The rationale behind idling is that it can cut down on number of seeks
		on rotational media. For example, if a process is doing dependent
		sequential reads (next read will come on only after completion of previous
		one), then not dispatching request from other queue should help as we
		did not move the disk head and kept on dispatching sequential IO from
		one queue.

		CFQ has following service trees and various queues are put on these trees.

		sync-idle sync-noidle async

		All cfq queues doing synchronous sequential IO go on to sync-idle tree.
		On this tree we idle on each queue individually.

		All synchronous non-sequential queues go on sync-noidle tree. Also any
		synchronous write request which is not marked with REQ_IDLE goes on this
		service tree. On this tree we do not idle on individual queues instead idle
		on the whole group of queues or the tree. So if there are 4 queues waiting
		for IO to dispatch we will idle only once last queue has dispatched the IO
		and there is no more IO on this service tree.

		All async writes go on async service tree. There is no idling on async
		queues.

		CFQ has some optimizations for SSDs and if it detects a non-rotational
		media which can support higher queue depth (multiple requests at in
		flight at a time), then it cuts down on idling of individual queues and
		all the queues move to sync-noidle tree and only tree idle remains. This
		tree idling provides isolation with buffered write queues on async tree.

		FAQ
		===
		Q1. Why to idle at all on queues not marked with REQ_IDLE.

		A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked
		with REQ_IDLE. This helps in providing isolation with all the sync-idle
		queues. Otherwise in presence of many sequential readers, other
		synchronous IO might not get fair share of disk.

		For example, if there are 10 sequential readers doing IO and they get
		100ms each. If a !REQ_IDLE request comes in, it will be scheduled
		roughly after 1 second. If after completion of !REQ_IDLE request we
		do not idle, and after a couple of milli seconds a another !REQ_IDLE
		request comes in, again it will be scheduled after 1second. Repeat it
		and notice how a workload can lose its disk share and suffer due to
		multiple sequential readers.

		fsync can generate dependent IO where bunch of data is written in the
		context of fsync, and later some journaling data is written. Journaling
		data comes in only after fsync has finished its IO (atleast for ext4
		that seemed to be the case). Now if one decides not to idle on fsync
		thread due to !REQ_IDLE, then next journaling write will not get
		scheduled for another second. A process doing small fsync, will suffer
		badly in presence of multiple sequential readers.

		Hence doing tree idling on threads using !REQ_IDLE flag on requests
		provides isolation from multiple sequential readers and at the same
		time we do not idle on individual threads.

		Q2. When to specify REQ_IDLE
		A2. I would think whenever one is doing synchronous write and expecting
		more writes to be dispatched from same context soon, should be able
		to specify REQ_IDLE on writes and that probably should work well for
		most of the cases.

Documentation/block/queue-sysfs.txt

+28 −1

Original line number	Diff line number	Diff line
		@@ -64,7 +64,7 @@ guess, the kernel will put the process issuing IO to sleep for an amount
		of time, before entering a classic poll loop. This mode might be a
		little slower than pure classic polling, but it will be more efficient.
		If set to a value larger than 0, the kernel will put the process issuing
		IO to sleep for this amont of microseconds before entering classic
		IO to sleep for this amount of microseconds before entering classic
		polling.

		iostats (RW)
		@@ -194,4 +194,31 @@ blk-throttle makes decision based on the samplings. Lower time means cgroups
		have more smooth throughput, but higher CPU overhead. This exists only when
		CONFIG_BLK_DEV_THROTTLING_LOW is enabled.

		zoned (RO)
		----------
		This indicates if the device is a zoned block device and the zone model of the
		device if it is indeed zoned. The possible values indicated by zoned are
		"none" for regular block devices and "host-aware" or "host-managed" for zoned
		block devices. The characteristics of host-aware and host-managed zoned block
		devices are described in the ZBC (Zoned Block Commands) and ZAC
		(Zoned Device ATA Command Set) standards. These standards also define the
		"drive-managed" zone model. However, since drive-managed zoned block devices
		do not support zone commands, they will be treated as regular block devices
		and zoned will report "none".

		nr_zones (RO)
		-------------
		For zoned block devices (zoned attribute indicating "host-managed" or
		"host-aware"), this indicates the total number of zones of the device.
		This is always 0 for regular block devices.

		chunk_sectors (RO)
		------------------
		This has different meaning depending on the type of the block device.
		For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors
		of the RAID volume stripe segment. For a zoned block device, either host-aware
		or host-managed, chunk_sectors indicates the size in 512B sectors of the zones
		of the device, with the eventual exception of the last zone of the device which
		may be smaller.

		Jens Axboe <jens.axboe@oracle.com>, February 2009