Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 73ba2fb3 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
parents 958f338e b86d865c
Loading
Loading
Loading
Loading
+10 −0
Original line number Original line Diff line number Diff line
@@ -5,6 +5,7 @@ Description:
		The /proc/diskstats file displays the I/O statistics
		The /proc/diskstats file displays the I/O statistics
		of block devices. Each line contains the following 14
		of block devices. Each line contains the following 14
		fields:
		fields:

		 1 - major number
		 1 - major number
		 2 - minor mumber
		 2 - minor mumber
		 3 - device name
		 3 - device name
@@ -19,4 +20,13 @@ Description:
		12 - I/Os currently in progress
		12 - I/Os currently in progress
		13 - time spent doing I/Os (ms)
		13 - time spent doing I/Os (ms)
		14 - weighted time spent doing I/Os (ms)
		14 - weighted time spent doing I/Os (ms)

		Kernel 4.18+ appends four more fields for discard
		tracking putting the total at 18:

		15 - discards completed successfully
		16 - discards merged
		17 - sectors discarded
		18 - time spent discarding

		For more details refer to Documentation/iostats.txt
		For more details refer to Documentation/iostats.txt
+88 −4
Original line number Original line Diff line number Diff line
@@ -51,6 +51,9 @@ v1 is available under Documentation/cgroup-v1/.
     5-3. IO
     5-3. IO
       5-3-1. IO Interface Files
       5-3-1. IO Interface Files
       5-3-2. Writeback
       5-3-2. Writeback
       5-3-3. IO Latency
         5-3-3-1. How IO Latency Throttling Works
         5-3-3-2. IO Latency Interface Files
     5-4. PID
     5-4. PID
       5-4-1. PID Interface Files
       5-4-1. PID Interface Files
     5-5. Device
     5-5. Device
@@ -1314,17 +1317,19 @@ IO Interface Files
	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
	The following nested keys are defined.
	The following nested keys are defined.


	  ======	===================
	  ======	=====================
	  rbytes	Bytes read
	  rbytes	Bytes read
	  wbytes	Bytes written
	  wbytes	Bytes written
	  rios		Number of read IOs
	  rios		Number of read IOs
	  wios		Number of write IOs
	  wios		Number of write IOs
	  ======	===================
	  dbytes	Bytes discarded
	  dios		Number of discard IOs
	  ======	=====================


	An example read output follows:
	An example read output follows:


	  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
	  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
	  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
	  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021


  io.weight
  io.weight
	A read-write flat-keyed file which exists on non-root cgroups.
	A read-write flat-keyed file which exists on non-root cgroups.
@@ -1446,6 +1451,85 @@ writeback as follows.
	vm.dirty[_background]_ratio.
	vm.dirty[_background]_ratio.




IO Latency
~~~~~~~~~~

This is a cgroup v2 controller for IO workload protection.  You provide a group
with a latency target, and if the average latency exceeds that target the
controller will throttle any peers that have a lower latency target than the
protected workload.

The limits are only applied at the peer level in the hierarchy.  This means that
in the diagram below, only groups A, B, and C will influence each other, and
groups D and F will influence each other.  Group G will influence nobody.

			[root]
		/	   |		\
		A	   B		C
	       /  \        |
	      D    F	   G


So the ideal way to configure this is to set io.latency in groups A, B, and C.
Generally you do not want to set a value lower than the latency your device
supports.  Experiment to find the value that works best for your workload.
Start at higher than the expected latency for your device and watch the
avg_lat value in io.stat for your workload group to get an idea of the
latency you see during normal operation.  Use the avg_lat value as a basis for
your real setting, setting at 10-15% higher than the value in io.stat.

How IO Latency Throttling Works
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

io.latency is work conserving; so as long as everybody is meeting their latency
target the controller doesn't do anything.  Once a group starts missing its
target it begins throttling any peer group that has a higher target than itself.
This throttling takes 2 forms:

- Queue depth throttling.  This is the number of outstanding IO's a group is
  allowed to have.  We will clamp down relatively quickly, starting at no limit
  and going all the way down to 1 IO at a time.

- Artificial delay induction.  There are certain types of IO that cannot be
  throttled without possibly adversely affecting higher priority groups.  This
  includes swapping and metadata IO.  These types of IO are allowed to occur
  normally, however they are "charged" to the originating group.  If the
  originating group is being throttled you will see the use_delay and delay
  fields in io.stat increase.  The delay value is how many microseconds that are
  being added to any process that runs in this group.  Because this number can
  grow quite large if there is a lot of swapping or metadata IO occurring we
  limit the individual delay events to 1 second at a time.

Once the victimized group starts meeting its latency target again it will start
unthrottling any peer groups that were throttled previously.  If the victimized
group simply stops doing IO the global counter will unthrottle appropriately.

IO Latency Interface Files
~~~~~~~~~~~~~~~~~~~~~~~~~~

  io.latency
	This takes a similar format as the other controllers.

		"MAJOR:MINOR target=<target time in microseconds"

  io.stat
	If the controller is enabled you will see extra stats in io.stat in
	addition to the normal ones.

	  depth
		This is the current queue depth for the group.

	  avg_lat
		This is an exponential moving average with a decay rate of 1/exp
		bound by the sampling interval.  The decay rate interval can be
		calculated by multiplying the win value in io.stat by the
		corresponding number of samples based on the win value.

	  win
		The sampling window size in milliseconds.  This is the minimum
		duration of time between evaluation events.  Windows only elapse
		with IO activity.  Idle periods extend the most recent window.

PID
PID
---
---


+7 −0
Original line number Original line Diff line number Diff line
@@ -85,3 +85,10 @@ shared_tags=[0/1]: Default: 0
  0: Tag set is not shared.
  0: Tag set is not shared.
  1: Tag set shared between devices for blk-mq. Only makes sense with
  1: Tag set shared between devices for blk-mq. Only makes sense with
     nr_devices > 1, otherwise there's no tag set to share.
     nr_devices > 1, otherwise there's no tag set to share.

zoned=[0/1]: Default: 0
  0: Block device is exposed as a random-access block device.
  1: Block device is exposed as a host-managed zoned block device.

zone_size=[MB]: Default: 256
  Per zone size when exposed as a zoned block device. Must be a power of two.
+16 −12
Original line number Original line Diff line number Diff line
@@ -31,28 +31,32 @@ write ticks milliseconds total wait time for write requests
in_flight       requests      number of I/Os currently in flight
in_flight       requests      number of I/Os currently in flight
io_ticks        milliseconds  total time this block device has been active
io_ticks        milliseconds  total time this block device has been active
time_in_queue   milliseconds  total wait time for all requests
time_in_queue   milliseconds  total wait time for all requests
discard I/Os    requests      number of discard I/Os processed
discard merges  requests      number of discard I/Os merged with in-queue I/O
discard sectors sectors       number of sectors discarded
discard ticks   milliseconds  total wait time for discard requests


read I/Os, write I/Os
read I/Os, write I/Os, discard I/0s
=====================
===================================


These values increment when an I/O request completes.
These values increment when an I/O request completes.


read merges, write merges
read merges, write merges, discard merges
=========================
=========================================


These values increment when an I/O request is merged with an
These values increment when an I/O request is merged with an
already-queued I/O request.
already-queued I/O request.


read sectors, write sectors
read sectors, write sectors, discard_sectors
===========================
============================================


These values count the number of sectors read from or written to this
These values count the number of sectors read from, written to, or
block device.  The "sectors" in question are the standard UNIX 512-byte
discarded from this block device.  The "sectors" in question are the
sectors, not any device- or filesystem-specific block size.  The
standard UNIX 512-byte sectors, not any device- or filesystem-specific
counters are incremented when the I/O completes.
block size.  The counters are incremented when the I/O completes.


read ticks, write ticks
read ticks, write ticks, discard ticks
=======================
======================================


These values count the number of milliseconds that I/O requests have
These values count the number of milliseconds that I/O requests have
waited on this block device.  If there are multiple I/O requests waiting,
waited on this block device.  If there are multiple I/O requests waiting,
+15 −0
Original line number Original line Diff line number Diff line
@@ -31,6 +31,9 @@ Here are examples of these different formats::
      3    0   hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
      3    0   hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
      3    1   hda1 35486 38030 38030 38030
      3    1   hda1 35486 38030 38030 38030


   4.18+ diskstats:
      3    0   hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0

On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.


@@ -101,6 +104,18 @@ Field 11 -- weighted # of milliseconds spent doing I/Os
    last update of this field.  This can provide an easy measure of both
    last update of this field.  This can provide an easy measure of both
    I/O completion time and the backlog that may be accumulating.
    I/O completion time and the backlog that may be accumulating.


Field 12 -- # of discards completed
    This is the total number of discards completed successfully.

Field 13 -- # of discards merged
    See the description of field 2

Field 14 -- # of sectors discarded
    This is the total number of sectors discarded successfully.

Field 15 -- # of milliseconds spent discarding
    This is the total number of milliseconds spent by all discards (as
    measured from __make_request() to end_that_request_last()).


To avoid introducing performance bottlenecks, no locks are held while
To avoid introducing performance bottlenecks, no locks are held while
modifying these counters.  This implies that minor inaccuracies may be
modifying these counters.  This implies that minor inaccuracies may be
Loading