Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 6597ac8a authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull device mapper updates from Mike Snitzer:

 - DM core cleanups:

     * blk-mq request-based DM no longer uses any mempools now that
       partial completions are no longer handled as part of cloned
       requests

 - DM raid cleanups and support for MD raid0

 - DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy

     * smq is the new default dm-cache policy

 - DM thinp cleanups and much more efficient large discard support

 - DM statistics support for request-based DM and nanosecond resolution
   timestamps

 - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt

* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
  dm stats: add support for request-based DM devices
  dm stats: collect and report histogram of IO latencies
  dm stats: support precise timestamps
  dm stats: fix divide by zero if 'number_of_areas' arg is zero
  dm cache: switch the "default" cache replacement policy from mq to smq
  dm space map metadata: fix occasional leak of a metadata block on resize
  dm thin metadata: fix a race when entering fail mode
  dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
  dm thin: range discard support
  dm thin metadata: add dm_thin_remove_range()
  dm thin metadata: add dm_thin_find_mapped_range()
  dm btree: add dm_btree_remove_leaves()
  dm stats: Use kvfree() in dm_kvfree()
  dm cache: age and write back cache entries even without active IO
  dm cache: prefix all DMERR and DMINFO messages with cache device name
  dm cache: add fail io mode and needs_check flag
  dm cache: wake the worker thread every time we free a migration object
  dm cache: add stochastic-multi-queue (smq) policy
  dm cache: boost promotion of blocks that will be overwritten
  dm cache: defer whole cells
  ...
parents e4bc13ad e262f347
Loading
Loading
Loading
Loading
+64 −3
Original line number Diff line number Diff line
@@ -25,10 +25,10 @@ trying to see when the io scheduler has let the ios run.
Overview of supplied cache replacement policies
===============================================

multiqueue
----------
multiqueue (mq)
---------------

This policy is the default.
This policy has been deprecated in favor of the smq policy (see below).

The multiqueue policy has three sets of 16 queues: one set for entries
waiting for the cache and another two for those in the cache (a set for
@@ -73,6 +73,67 @@ If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion.  Remember to switch them back to
their defaults after the cache fills though.

Stochastic multiqueue (smq)
---------------------------

This policy is the default.

The stochastic multi-queue (smq) policy addresses some of the problems
with the multiqueue (mq) policy.

The smq policy (vs mq) offers the promise of less memory utilization,
improved performance and increased adaptability in the face of changing
workloads.  SMQ also does not have any cumbersome tuning knobs.

Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target.  Doing so will cause all of the
mq policy's hints to be dropped.  Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.

Memory usage:
The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.

SMQ uses 28bit indexes to implement it's data structures rather than
pointers.  It avoids storing an explicit hit count for each block.  It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).

All these mean smq uses ~25bytes per cache block.  Still a lot of
memory, but a substantial improvement nontheless.

Level balancing:
MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)).  This means the bottom
levels generally have the most entries, and the top ones have very
few.  Having unbalanced levels like this reduces the efficacy of the
multiqueue.

SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above.  The over all
ordering being a side effect of this stochastic process.  With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.

Adaptability:
The MQ policy maintains a hit count for each cache block.  For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache.  This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.

SMQ doesn't maintain hit counts, so a lot of this problem just goes
away.  In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote.  If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels.  This lets it adapt to new IO patterns very quickly.

Performance:
Testing SMQ shows substantially better performance than MQ.

cleaner
-------

+7 −2
Original line number Diff line number Diff line
@@ -221,6 +221,7 @@ Status
<#read hits> <#read misses> <#write hits> <#write misses>
<#demotions> <#promotions> <#dirty> <#features> <features>*
<#core args> <core args>* <policy name> <#policy args> <policy args>*
<cache metadata mode>

metadata block size	 : Fixed block size for each metadata block in
			     sectors
@@ -251,8 +252,12 @@ core args : Key/value pairs for tuning the core
			     e.g. migration_threshold
policy name		 : Name of the policy
#policy args		 : Number of policy arguments to follow (must be even)
policy args		 : Key/value pairs
			     e.g. sequential_threshold
policy args		 : Key/value pairs e.g. sequential_threshold
cache metadata mode      : ro if read-only, rw if read-write
	In serious cases where even a read-only mode is deemed unsafe
	no further I/O will be permitted and the status will just
	contain the string 'Fail'.  The userspace recovery tools
	should then be used.

Messages
--------
+2 −0
Original line number Diff line number Diff line
@@ -224,3 +224,5 @@ Version History
	New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
1.5.1   Add ability to restore transiently failed devices on resume.
1.5.2   'mismatch_cnt' is zero unless [last_]sync_action is "check".
1.6.0   Add discard support (and devices_handle_discard_safely module param).
1.7.0   Add support for MD RAID0 mappings.
+37 −4
Original line number Diff line number Diff line
@@ -13,9 +13,14 @@ the range specified.
The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats (see:
Documentation/iostats.txt).  But two extra counters (12 and 13) are
provided: total time spent reading and writing in milliseconds.	 All
these counters may be accessed by sending the @stats_print message to
the appropriate DM device via dmsetup.
provided: total time spent reading and writing.  When the histogram
argument is used, the 14th parameter is reported that represents the
histogram of latencies.  All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.

The reported times are in milliseconds and the granularity depends on
the kernel ticks.  When the option precise_timestamps is used, the
reported times are in nanoseconds.

Each region has a corresponding unique identifier, which we call a
region_id, that is assigned when the region is created.	 The region_id
@@ -33,7 +38,9 @@ memory is used by reading
Messages
========

    @stats_create <range> <step> [<program_id> [<aux_data>]]
    @stats_create <range> <step>
		[<number_of_optional_arguments> <optional_arguments>...]
		[<program_id> [<aux_data>]]

	Create a new region and return the region_id.

@@ -48,6 +55,29 @@ Messages
	  "/<number_of_areas>" - the range is subdivided into the specified
				 number of areas.

	<number_of_optional_arguments>
	  The number of optional arguments

	<optional_arguments>
	  The following optional arguments are supported
	  precise_timestamps - use precise timer with nanosecond resolution
		instead of the "jiffies" variable.  When this argument is
		used, the resulting times are in nanoseconds instead of
		milliseconds.  Precise timestamps are a little bit slower
		to obtain than jiffies-based timestamps.
	  histogram:n1,n2,n3,n4,... - collect histogram of latencies.  The
		numbers n1, n2, etc are times that represent the boundaries
		of the histogram.  If precise_timestamps is not used, the
		times are in milliseconds, otherwise they are in
		nanoseconds.  For each range, the kernel will report the
		number of requests that completed within this range. For
		example, if we use "histogram:10,20,30", the kernel will
		report four numbers a:b:c:d. a is the number of requests
		that took 0-10 ms to complete, b is the number of requests
		that took 10-20 ms to complete, c is the number of requests
		that took 20-30 ms to complete and d is the number of
		requests that took more than 30 ms to complete.

	<program_id>
	  An optional parameter.  A name that uniquely identifies
	  the userspace owner of the range.  This groups ranges together
@@ -55,6 +85,9 @@ Messages
	  created and ignore those created by others.
	  The kernel returns this string back in the output of
	  @stats_list message, but it doesn't use it for anything else.
	  If we omit the number of optional arguments, program id must not
	  be a number, otherwise it would be interpreted as the number of
	  optional arguments.

	<aux_data>
	  An optional parameter.  A word that provides auxiliary data
+12 −0
Original line number Diff line number Diff line
@@ -304,6 +304,18 @@ config DM_CACHE_MQ
         This is meant to be a general purpose policy.  It prioritises
         reads over writes.

config DM_CACHE_SMQ
       tristate "Stochastic MQ Cache Policy (EXPERIMENTAL)"
       depends on DM_CACHE
       default y
       ---help---
         A cache policy that uses a multiqueue ordered by recent hits
         to select which blocks should be promoted and demoted.
         This is meant to be a general purpose policy.  It prioritises
         reads over writes.  This SMQ policy (vs MQ) offers the promise
         of less memory utilization, improved performance and increased
         adaptability in the face of changing workloads.

config DM_CACHE_CLEANER
       tristate "Cleaner Cache Policy (EXPERIMENTAL)"
       depends on DM_CACHE
Loading