Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block (36869cb9) · Commits · e / devices / android_kernel_teracube_emerald

Documentation/ABI/testing/sysfs-block

+42 −0

Original line number	Diff line number	Diff line
		@@ -235,3 +235,45 @@ Description:
		write_same_max_bytes is 0, write same is not supported
		by the device.

		What: /sys/block/<disk>/queue/write_zeroes_max_bytes
		Date: November 2016
		Contact: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
		Description:
		Devices that support write zeroes operation in which a
		single request can be issued to zero out the range of
		contiguous blocks on storage without having any payload
		in the request. This can be used to optimize writing zeroes
		to the devices. write_zeroes_max_bytes indicates how many
		bytes can be written in a single write zeroes command. If
		write_zeroes_max_bytes is 0, write zeroes is not supported
		by the device.

		What: /sys/block/<disk>/queue/zoned
		Date: September 2016
		Contact: Damien Le Moal <damien.lemoal@hgst.com>
		Description:
		zoned indicates if the device is a zoned block device
		and the zone model of the device if it is indeed zoned.
		The possible values indicated by zoned are "none" for
		regular block devices and "host-aware" or "host-managed"
		for zoned block devices. The characteristics of
		host-aware and host-managed zoned block devices are
		described in the ZBC (Zoned Block Commands) and ZAC
		(Zoned Device ATA Command Set) standards. These standards
		also define the "drive-managed" zone model. However,
		since drive-managed zoned block devices do not support
		zone commands, they will be treated as regular block
		devices and zoned will report "none".

		What: /sys/block/<disk>/queue/chunk_sectors
		Date: September 2016
		Contact: Hannes Reinecke <hare@suse.com>
		Description:
		chunk_sectors has different meaning depending on the type
		of the disk. For a RAID device (dm-raid), chunk_sectors
		indicates the size in 512B sectors of the RAID volume
		stripe segment. For a zoned block device, either
		host-aware or host-managed, chunk_sectors indicates the
		size of 512B sectors of the zones of the device, with
		the eventual exception of the last zone of the device
		which may be smaller.

Documentation/block/biodoc.txt

+3 −3

Original line number	Diff line number	Diff line
		@@ -348,7 +348,7 @@ Drivers can now specify a request prepare function (q->prep_rq_fn) that the
		block layer would invoke to pre-build device commands for a given request,
		or perform other preparatory processing for the request. This is routine is
		called by elv_next_request(), i.e. typically just before servicing a request.
		(The prepare function would not be called for requests that have REQ_DONTPREP
		(The prepare function would not be called for requests that have RQF_DONTPREP
		enabled)

		Aside:
		@@ -553,8 +553,8 @@ struct request {
		struct request_list *rl;
		}

		See the rq_flag_bits definitions for an explanation of the various flags
		available. Some bits are used by the block layer or i/o scheduler.
		See the req_ops and req_flag_bits definitions for an explanation of the various
		flags available. Some bits are used by the block layer or i/o scheduler.

		The behaviour of the various sector counts are almost the same as before,
		except that since we have multi-segment bios, current_nr_sectors refers

Documentation/block/cfq-iosched.txt

+16 −16

Original line number	Diff line number	Diff line
		@@ -240,11 +240,11 @@ All cfq queues doing synchronous sequential IO go on to sync-idle tree.
		On this tree we idle on each queue individually.

		All synchronous non-sequential queues go on sync-noidle tree. Also any
		request which are marked with REQ_NOIDLE go on this service tree. On this
		tree we do not idle on individual queues instead idle on the whole group
		of queues or the tree. So if there are 4 queues waiting for IO to dispatch
		we will idle only once last queue has dispatched the IO and there is
		no more IO on this service tree.
		synchronous write request which is not marked with REQ_IDLE goes on this
		service tree. On this tree we do not idle on individual queues instead idle
		on the whole group of queues or the tree. So if there are 4 queues waiting
		for IO to dispatch we will idle only once last queue has dispatched the IO
		and there is no more IO on this service tree.

		All async writes go on async service tree. There is no idling on async
		queues.
		@@ -257,17 +257,17 @@ tree idling provides isolation with buffered write queues on async tree.

		FAQ
		===
		Q1. Why to idle at all on queues marked with REQ_NOIDLE.
		Q1. Why to idle at all on queues not marked with REQ_IDLE.

		A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
		with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
		A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked
		with REQ_IDLE. This helps in providing isolation with all the sync-idle
		queues. Otherwise in presence of many sequential readers, other
		synchronous IO might not get fair share of disk.

		For example, if there are 10 sequential readers doing IO and they get
		100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
		roughly after 1 second. If after completion of REQ_NOIDLE request we
		do not idle, and after a couple of milli seconds a another REQ_NOIDLE
		100ms each. If a !REQ_IDLE request comes in, it will be scheduled
		roughly after 1 second. If after completion of !REQ_IDLE request we
		do not idle, and after a couple of milli seconds a another !REQ_IDLE
		request comes in, again it will be scheduled after 1second. Repeat it
		and notice how a workload can lose its disk share and suffer due to
		multiple sequential readers.
		@@ -276,16 +276,16 @@ A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
		context of fsync, and later some journaling data is written. Journaling
		data comes in only after fsync has finished its IO (atleast for ext4
		that seemed to be the case). Now if one decides not to idle on fsync
		thread due to REQ_NOIDLE, then next journaling write will not get
		thread due to !REQ_IDLE, then next journaling write will not get
		scheduled for another second. A process doing small fsync, will suffer
		badly in presence of multiple sequential readers.

		Hence doing tree idling on threads using REQ_NOIDLE flag on requests
		Hence doing tree idling on threads using !REQ_IDLE flag on requests
		provides isolation from multiple sequential readers and at the same
		time we do not idle on individual threads.

		Q2. When to specify REQ_NOIDLE
		A2. I would think whenever one is doing synchronous write and not expecting
		Q2. When to specify REQ_IDLE
		A2. I would think whenever one is doing synchronous write and expecting
		more writes to be dispatched from same context soon, should be able
		to specify REQ_NOIDLE on writes and that probably should work well for
		to specify REQ_IDLE on writes and that probably should work well for
		most of the cases.

Documentation/block/null_blk.txt

+1 −1

Original line number	Diff line number	Diff line
		@@ -72,4 +72,4 @@ use_per_node_hctx=[0/1]: Default: 0
		queue for each CPU node in the system.

		use_lightnvm=[0/1]: Default: 0
		Register device with LightNVM. Requires blk-mq to be used.
		Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled.

Documentation/block/queue-sysfs.txt

+23 −0

Original line number	Diff line number	Diff line
		@@ -58,6 +58,20 @@ When read, this file shows the total number of block IO polls and how
		many returned success. Writing '0' to this file will disable polling
		for this device. Writing any non-zero value will enable this feature.

		io_poll_delay (RW)
		------------------
		If polling is enabled, this controls what kind of polling will be
		performed. It defaults to -1, which is classic polling. In this mode,
		the CPU will repeatedly ask for completions without giving up any time.
		If set to 0, a hybrid polling mode is used, where the kernel will attempt
		to make an educated guess at when the IO will complete. Based on this
		guess, the kernel will put the process issuing IO to sleep for an amount
		of time, before entering a classic poll loop. This mode might be a
		little slower than pure classic polling, but it will be more efficient.
		If set to a value larger than 0, the kernel will put the process issuing
		IO to sleep for this amont of microseconds before entering classic
		polling.

		iostats (RW)
		-------------
		This file is used to control (on/off) the iostats accounting of the
		@@ -169,5 +183,14 @@ This is the number of bytes the device can write in a single write-same
		command. A value of '0' means write-same is not supported by this
		device.

		wb_lat_usec (RW)
		----------------
		If the device is registered for writeback throttling, then this file shows
		the target minimum read latency. If this latency is exceeded in a given
		window of time (see wb_window_usec), then the writeback throttling will start
		scaling back writes. Writing a value of '0' to this file disables the
		feature. Writing a value of '-1' to this file resets the value to the
		default setting.


		Jens Axboe <jens.axboe@oracle.com>, February 2009