Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 7caa4715 authored by Tejun Heo's avatar Tejun Heo Committed by Jens Axboe
Browse files

blkcg: implement blk-iocost



This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

Signed-off-by: default avatarTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
parent 6f816b4b
Loading
Loading
Loading
Loading
+94 −0
Original line number Diff line number Diff line
@@ -1435,6 +1435,100 @@ IO Interface Files
	  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
	  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021

  io.cost.qos
	A read-write nested-keyed file with exists only on the root
	cgroup.

	This file configures the Quality of Service of the IO cost
	model based controller (CONFIG_BLK_CGROUP_IOCOST) which
	currently implements "io.weight" proportional control.  Lines
	are keyed by $MAJ:$MIN device numbers and not ordered.  The
	line for a given device is populated on the first write for
	the device on "io.cost.qos" or "io.cost.model".  The following
	nested keys are defined.

	  ======	=====================================
	  enable	Weight-based control enable
	  ctrl		"auto" or "user"
	  rpct		Read latency percentile    [0, 100]
	  rlat		Read latency threshold
	  wpct		Write latency percentile   [0, 100]
	  wlat		Write latency threshold
	  min		Minimum scaling percentage [1, 10000]
	  max		Maximum scaling percentage [1, 10000]
	  ======	=====================================

	The controller is disabled by default and can be enabled by
	setting "enable" to 1.  "rpct" and "wpct" parameters default
	to zero and the controller uses internal device saturation
	state to adjust the overall IO rate between "min" and "max".

	When a better control quality is needed, latency QoS
	parameters can be configured.  For example::

	  8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0

	shows that on sdb, the controller is enabled, will consider
	the device saturated if the 95th percentile of read completion
	latencies is above 75ms or write 150ms, and adjust the overall
	IO issue rate between 50% and 150% accordingly.

	The lower the saturation point, the better the latency QoS at
	the cost of aggregate bandwidth.  The narrower the allowed
	adjustment range between "min" and "max", the more conformant
	to the cost model the IO behavior.  Note that the IO issue
	base rate may be far off from 100% and setting "min" and "max"
	blindly can lead to a significant loss of device capacity or
	control quality.  "min" and "max" are useful for regulating
	devices which show wide temporary behavior changes - e.g. a
	ssd which accepts writes at the line speed for a while and
	then completely stalls for multiple seconds.

	When "ctrl" is "auto", the parameters are controlled by the
	kernel and may change automatically.  Setting "ctrl" to "user"
	or setting any of the percentile and latency parameters puts
	it into "user" mode and disables the automatic changes.  The
	automatic mode can be restored by setting "ctrl" to "auto".

  io.cost.model
	A read-write nested-keyed file with exists only on the root
	cgroup.

	This file configures the cost model of the IO cost model based
	controller (CONFIG_BLK_CGROUP_IOCOST) which currently
	implements "io.weight" proportional control.  Lines are keyed
	by $MAJ:$MIN device numbers and not ordered.  The line for a
	given device is populated on the first write for the device on
	"io.cost.qos" or "io.cost.model".  The following nested keys
	are defined.

	  =====		================================
	  ctrl		"auto" or "user"
	  model		The cost model in use - "linear"
	  =====		================================

	When "ctrl" is "auto", the kernel may change all parameters
	dynamically.  When "ctrl" is set to "user" or any other
	parameters are written to, "ctrl" become "user" and the
	automatic changes are disabled.

	When "model" is "linear", the following model parameters are
	defined.

	  =============	========================================
	  [r|w]bps	The maximum sequential IO throughput
	  [r|w]seqiops	The maximum 4k sequential IOs per second
	  [r|w]randiops	The maximum 4k random IOs per second
	  =============	========================================

	From the above, the builtin linear model determines the base
	costs of a sequential and random IO and the cost coefficient
	for the IO size.  While simple, this model can cover most
	common device classes acceptably.

	The IO cost model isn't expected to be accurate in absolute
	sense and is scaled to the device behavior dynamically.

  io.weight
	A read-write flat-keyed file which exists on non-root cgroups.
	The default is "default 100".
+10 −0
Original line number Diff line number Diff line
@@ -135,6 +135,16 @@ config BLK_CGROUP_IOLATENCY

	Note, this is an experimental interface and could be changed someday.

config BLK_CGROUP_IOCOST
	bool "Enable support for cost model based cgroup IO controller"
	depends on BLK_CGROUP=y
	select BLK_RQ_ALLOC_TIME
	---help---
	Enabling this option enables the .weight interface for cost
	model based proportional IO control.  The IO controller
	distributes IO capacity between different groups based on
	their share of the overall weight distribution.

config BLK_WBT_MQ
	bool "Multiqueue writeback throttling"
	default y
+1 −0
Original line number Diff line number Diff line
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o

block/blk-iocost.c

0 → 100644
+2371 −0

File added.

Preview size limit exceeded, changes collapsed.

+3 −0
Original line number Diff line number Diff line
@@ -15,6 +15,7 @@ struct blk_mq_debugfs_attr;
enum rq_qos_id {
	RQ_QOS_WBT,
	RQ_QOS_LATENCY,
	RQ_QOS_COST,
};

struct rq_wait {
@@ -84,6 +85,8 @@ static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
		return "wbt";
	case RQ_QOS_LATENCY:
		return "latency";
	case RQ_QOS_COST:
		return "cost";
	}
	return "unknown";
}
Loading