Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 7426d628 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull device-mapper updates from Mike Snitzer:
 "Add the ability to collect I/O statistics on user-defined regions of a
  device-mapper device.  This dm-stats code required the reintroduction
  of a div64_u64_rem() helper, but as a separate method that doesn't
  slow down div64_u64() -- especially on 32-bit systems.

  Allow the error target to replace request-based DM devices (e.g.
  multipath) in addition to bio-based DM devices.

  Various other small code fixes and improvements to thin-provisioning,
  DM cache and the DM ioctl interface"

* tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm stripe: silence a couple sparse warnings
  dm: add statistics support
  dm thin: always return -ENOSPC if no_free_space is set
  dm ioctl: cleanup error handling in table_load
  dm ioctl: increase granularity of type_lock when loading table
  dm ioctl: prevent rename to empty name or uuid
  dm thin: set pool read-only if breaking_sharing fails block allocation
  dm thin: prefix pool error messages with pool device name
  dm: allow error target to replace bio-based and request-based targets
  math64: New separate div64_u64_rem helper
  dm space map: optimise sm_ll_dec and sm_ll_inc
  dm btree: prefetch child nodes when walking tree for a dm_btree_del
  dm btree: use pop_frame in dm_btree_del to cleanup code
  dm cache: eliminate holes in cache structure
  dm cache: fix stacking of geometry limits
  dm thin: fix stacking of geometry limits
  dm thin: add data block size limits to Documentation
  dm cache: add data block size limits to code and Documentation
  dm cache: document metadata device is exclussive to a cache
  dm: stop using WQ_NON_REENTRANT
parents 4d7696f1 7fff5e8f
Loading
Loading
Loading
Loading
+4 −2
Original line number Diff line number Diff line
@@ -50,14 +50,16 @@ other parameters detailed later):
   which are dirty, and extra hints for use by the policy object.
   This information could be put on the cache device, but having it
   separate allows the volume manager to configure it differently,
   e.g. as a mirror for extra robustness.
   e.g. as a mirror for extra robustness.  This metadata device may only
   be used by a single cache device.

Fixed block size
----------------

The origin is divided up into blocks of a fixed size.  This block size
is configurable when you first create the cache.  Typically we've been
using block sizes of 256k - 1024k.
using block sizes of 256KB - 1024KB.  The block size must be between 64
(32KB) and 2097152 (1GB) and a multiple of 64 (32KB).

Having a fixed block size simplifies the target a lot.  But it is
something of a compromise.  For instance, a small part of a block may be
+186 −0
Original line number Diff line number Diff line
DM statistics
=============

Device Mapper supports the collection of I/O statistics on user-defined
regions of a DM device.	 If no regions are defined no statistics are
collected so there isn't any performance impact.  Only bio-based DM
devices are currently supported.

Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.

The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats (see:
Documentation/iostats.txt).  But two extra counters (12 and 13) are
provided: total time spent reading and writing in milliseconds.	 All
these counters may be accessed by sending the @stats_print message to
the appropriate DM device via dmsetup.

Each region has a corresponding unique identifier, which we call a
region_id, that is assigned when the region is created.	 The region_id
must be supplied when querying statistics about the region, deleting the
region, etc.  Unique region_ids enable multiple userspace programs to
request and process statistics for the same DM device without stepping
on each other's data.

The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space.  At most, 1/4 of the overall system
memory may be allocated by DM statistics.  The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes

Messages
========

    @stats_create <range> <step> [<program_id> [<aux_data>]]

	Create a new region and return the region_id.

	<range>
	  "-" - whole device
	  "<start_sector>+<length>" - a range of <length> 512-byte sectors
				      starting with <start_sector>.

	<step>
	  "<area_size>" - the range is subdivided into areas each containing
			  <area_size> sectors.
	  "/<number_of_areas>" - the range is subdivided into the specified
				 number of areas.

	<program_id>
	  An optional parameter.  A name that uniquely identifies
	  the userspace owner of the range.  This groups ranges together
	  so that userspace programs can identify the ranges they
	  created and ignore those created by others.
	  The kernel returns this string back in the output of
	  @stats_list message, but it doesn't use it for anything else.

	<aux_data>
	  An optional parameter.  A word that provides auxiliary data
	  that is useful to the client program that created the range.
	  The kernel returns this string back in the output of
	  @stats_list message, but it doesn't use this value for anything.

    @stats_delete <region_id>

	Delete the region with the specified id.

	<region_id>
	  region_id returned from @stats_create

    @stats_clear <region_id>

	Clear all the counters except the in-flight i/o counters.

	<region_id>
	  region_id returned from @stats_create

    @stats_list [<program_id>]

	List all regions registered with @stats_create.

	<program_id>
	  An optional parameter.
	  If this parameter is specified, only matching regions
	  are returned.
	  If it is not specified, all regions are returned.

	Output format:
	  <region_id>: <start_sector>+<length> <step> <program_id> <aux_data>

    @stats_print <region_id> [<starting_line> <number_of_lines>]

	Print counters for each step-sized area of a region.

	<region_id>
	  region_id returned from @stats_create

	<starting_line>
	  The index of the starting line in the output.
	  If omitted, all lines are returned.

	<number_of_lines>
	  The number of lines to include in the output.
	  If omitted, all lines are returned.

	Output format for each step-sized area of a region:

	  <start_sector>+<length> counters

	  The first 11 counters have the same meaning as
	  /sys/block/*/stat or /proc/diskstats.

	  Please refer to Documentation/iostats.txt for details.

	  1. the number of reads completed
	  2. the number of reads merged
	  3. the number of sectors read
	  4. the number of milliseconds spent reading
	  5. the number of writes completed
	  6. the number of writes merged
	  7. the number of sectors written
	  8. the number of milliseconds spent writing
	  9. the number of I/Os currently in progress
	  10. the number of milliseconds spent doing I/Os
	  11. the weighted number of milliseconds spent doing I/Os

	  Additional counters:
	  12. the total time spent reading in milliseconds
	  13. the total time spent writing in milliseconds

    @stats_print_clear <region_id> [<starting_line> <number_of_lines>]

	Atomically print and then clear all the counters except the
	in-flight i/o counters.	 Useful when the client consuming the
	statistics does not want to lose any statistics (those updated
	between printing and clearing).

	<region_id>
	  region_id returned from @stats_create

	<starting_line>
	  The index of the starting line in the output.
	  If omitted, all lines are printed and then cleared.

	<number_of_lines>
	  The number of lines to process.
	  If omitted, all lines are printed and then cleared.

    @stats_set_aux <region_id> <aux_data>

	Store auxiliary data aux_data for the specified region.

	<region_id>
	  region_id returned from @stats_create

	<aux_data>
	  The string that identifies data which is useful to the client
	  program that created the range.  The kernel returns this
	  string back in the output of @stats_list message, but it
	  doesn't use this value for anything.

Examples
========

Subdivide the DM device 'vol' into 100 pieces and start collecting
statistics on them:

  dmsetup message vol 0 @stats_create - /100

Set the auxillary data string to "foo bar baz" (the escape for each
space must also be escaped, otherwise the shell will consume them):

  dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz

List the statistics:

  dmsetup message vol 0 @stats_list

Print the statistics:

  dmsetup message vol 0 @stats_print 0

Delete the statistics:

  dmsetup message vol 0 @stats_delete 0
+8 −7
Original line number Diff line number Diff line
@@ -99,13 +99,14 @@ Using an existing pool device
		 $data_block_size $low_water_mark"

$data_block_size gives the smallest unit of disk space that can be
allocated at a time expressed in units of 512-byte sectors.  People
primarily interested in thin provisioning may want to use a value such
as 1024 (512KB).  People doing lots of snapshotting may want a smaller value
such as 128 (64KB).  If you are not zeroing newly-allocated data,
a larger $data_block_size in the region of 256000 (128MB) is suggested.
$data_block_size must be the same for the lifetime of the
metadata device.
allocated at a time expressed in units of 512-byte sectors.
$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
multiple of 128 (64KB).  $data_block_size cannot be changed after the
thin-pool is created.  People primarily interested in thin provisioning
may want to use a value such as 1024 (512KB).  People doing lots of
snapshotting may want a smaller value such as 128 (64KB).  If you are
not zeroing newly-allocated data, a larger $data_block_size in the
region of 256000 (128MB) is suggested.

$low_water_mark is expressed in blocks of size $data_block_size.  If
free space on the data device drops below this level then a dm event
+1 −1
Original line number Diff line number Diff line
@@ -3,7 +3,7 @@
#

dm-mod-y	+= dm.o dm-table.o dm-target.o dm-linear.o dm-stripe.o \
		   dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o
		   dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o dm-stats.o
dm-multipath-y	+= dm-path-selector.o dm-mpath.o
dm-snapshot-y	+= dm-snap.o dm-exception-store.o dm-snap-transient.o \
		    dm-snap-persistent.o
+35 −24
Original line number Diff line number Diff line
@@ -67,9 +67,11 @@ static void free_bitset(unsigned long *bits)
#define MIGRATION_COUNT_WINDOW 10

/*
 * The block size of the device holding cache data must be >= 32KB
 * The block size of the device holding cache data must be
 * between 32KB and 1GB.
 */
#define DATA_DEV_BLOCK_SIZE_MIN_SECTORS (32 * 1024 >> SECTOR_SHIFT)
#define DATA_DEV_BLOCK_SIZE_MAX_SECTORS (1024 * 1024 * 1024 >> SECTOR_SHIFT)

/*
 * FIXME: the cache is read/write for the time being.
@@ -101,6 +103,8 @@ struct cache {
	struct dm_target *ti;
	struct dm_target_callbacks callbacks;

	struct dm_cache_metadata *cmd;

	/*
	 * Metadata is written to this device.
	 */
@@ -116,11 +120,6 @@ struct cache {
	 */
	struct dm_dev *cache_dev;

	/*
	 * Cache features such as write-through.
	 */
	struct cache_features features;

	/*
	 * Size of the origin device in _complete_ blocks and native sectors.
	 */
@@ -138,8 +137,6 @@ struct cache {
	uint32_t sectors_per_block;
	int sectors_per_block_shift;

	struct dm_cache_metadata *cmd;

	spinlock_t lock;
	struct bio_list deferred_bios;
	struct bio_list deferred_flush_bios;
@@ -148,8 +145,8 @@ struct cache {
	struct list_head completed_migrations;
	struct list_head need_commit_migrations;
	sector_t migration_threshold;
	atomic_t nr_migrations;
	wait_queue_head_t migration_wait;
	atomic_t nr_migrations;

	/*
	 * cache_size entries, dirty if set
@@ -160,9 +157,16 @@ struct cache {
	/*
	 * origin_blocks entries, discarded if set.
	 */
	uint32_t discard_block_size; /* a power of 2 times sectors per block */
	dm_dblock_t discard_nr_blocks;
	unsigned long *discard_bitset;
	uint32_t discard_block_size; /* a power of 2 times sectors per block */

	/*
	 * Rather than reconstructing the table line for the status we just
	 * save it and regurgitate.
	 */
	unsigned nr_ctr_args;
	const char **ctr_args;

	struct dm_kcopyd_client *copier;
	struct workqueue_struct *wq;
@@ -187,14 +191,12 @@ struct cache {
	bool loaded_mappings:1;
	bool loaded_discards:1;

	struct cache_stats stats;

	/*
	 * Rather than reconstructing the table line for the status we just
	 * save it and regurgitate.
	 * Cache features such as write-through.
	 */
	unsigned nr_ctr_args;
	const char **ctr_args;
	struct cache_features features;

	struct cache_stats stats;
};

struct per_bio_data {
@@ -1687,24 +1689,25 @@ static int parse_origin_dev(struct cache_args *ca, struct dm_arg_set *as,
static int parse_block_size(struct cache_args *ca, struct dm_arg_set *as,
			    char **error)
{
	unsigned long tmp;
	unsigned long block_size;

	if (!at_least_one_arg(as, error))
		return -EINVAL;

	if (kstrtoul(dm_shift_arg(as), 10, &tmp) || !tmp ||
	    tmp < DATA_DEV_BLOCK_SIZE_MIN_SECTORS ||
	    tmp & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) {
	if (kstrtoul(dm_shift_arg(as), 10, &block_size) || !block_size ||
	    block_size < DATA_DEV_BLOCK_SIZE_MIN_SECTORS ||
	    block_size > DATA_DEV_BLOCK_SIZE_MAX_SECTORS ||
	    block_size & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) {
		*error = "Invalid data block size";
		return -EINVAL;
	}

	if (tmp > ca->cache_sectors) {
	if (block_size > ca->cache_sectors) {
		*error = "Data block size is larger than the cache device";
		return -EINVAL;
	}

	ca->block_size = tmp;
	ca->block_size = block_size;

	return 0;
}
@@ -2609,9 +2612,17 @@ static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
	struct cache *cache = ti->private;
	uint64_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;

	/*
	 * If the system-determined stacked limits are compatible with the
	 * cache's blocksize (io_opt is a factor) do not override them.
	 */
	if (io_opt_sectors < cache->sectors_per_block ||
	    do_div(io_opt_sectors, cache->sectors_per_block)) {
		blk_limits_io_min(limits, 0);
		blk_limits_io_opt(limits, cache->sectors_per_block << SECTOR_SHIFT);
	}
	set_discard_limits(cache, limits);
}

Loading