Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit a7e546f1 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block-related fixes from Jens Axboe:

 - Improvements to the buffered and direct write IO plugging from
   Fengguang.

 - Abstract out the mapping of a bio in a request, and use that to
   provide a blk_bio_map_sg() helper.  Useful for mapping just a bio
   instead of a full request.

 - Regression fix from Hugh, fixing up a patch that went into the
   previous release cycle (and marked stable, too) attempting to prevent
   a loop in __getblk_slow().

 - Updates to discard requests, fixing up the sizing and how we align
   them.  Also a change to disallow merging of discard requests, since
   that doesn't really work properly yet.

 - A few drbd fixes.

 - Documentation updates.

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: replace __getblk_slow misfix by grow_dev_page fix
  drbd: Write all pages of the bitmap after an online resize
  drbd: Finish requests that completed while IO was frozen
  drbd: fix drbd wire compatibility for empty flushes
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update missing index files in block/00-INDEX
  block: move down direct IO plugging
  block: remove plugging at buffered write time
  block: disable discard request merge temporarily
  bio: Fix potential memory leak in bio_find_or_create_slab()
  block: Don't use static to define "void *p" in show_partition_start()
  block: Add blk_bio_map_sg() helper
  block: Introduce __blk_segment_map_sg() helper
  fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
  block: split discard into aligned requests
  block: reorganize rounding of max_discard_sectors
parents da31ce72 676ce6d5
Loading
Loading
Loading
Loading
+8 −2
Original line number Diff line number Diff line
@@ -3,15 +3,21 @@
biodoc.txt
	- Notes on the Generic Block Layer Rewrite in Linux 2.5
capability.txt
	- Generic Block Device Capability (/sys/block/<disk>/capability)
	- Generic Block Device Capability (/sys/block/<device>/capability)
cfq-iosched.txt
	- CFQ IO scheduler tunables
data-integrity.txt
	- Block data integrity
deadline-iosched.txt
	- Deadline IO scheduler tunables
ioprio.txt
	- Block io priorities (in CFQ scheduler)
queue-sysfs.txt
	- Queue's sysfs entries
request.txt
	- The members of struct request (in include/linux/blkdev.h)
stat.txt
	- Block layer statistics in /sys/block/<dev>/stat
	- Block layer statistics in /sys/block/<device>/stat
switching-sched.txt
	- Switching I/O schedulers at runtime
writeback_cache_control.txt
+77 −0
Original line number Diff line number Diff line
CFQ (Complete Fairness Queueing)
===============================

The main aim of CFQ scheduler is to provide a fair allocation of the disk
I/O bandwidth for all the processes which requests an I/O operation.

CFQ maintains the per process queue for the processes which request I/O
operation(syncronous requests). In case of asynchronous requests, all the
requests from all the processes are batched together according to their
process's I/O priority.

CFQ ioscheduler tunables
========================

@@ -25,6 +36,72 @@ there are multiple spindles behind single LUN (Host based hardware RAID
controller or for storage arrays), setting slice_idle=0 might end up in better
throughput and acceptable latencies.

back_seek_max
-------------
This specifies, given in Kbytes, the maximum "distance" for backward seeking.
The distance is the amount of space from the current head location to the
sectors that are backward in terms of distance.

This parameter allows the scheduler to anticipate requests in the "backward"
direction and consider them as being the "next" if they are within this
distance from the current head location.

back_seek_penalty
-----------------
This parameter is used to compute the cost of backward seeking. If the
backward distance of request is just 1/back_seek_penalty from a "front"
request, then the seeking cost of two requests is considered equivalent.

So scheduler will not bias toward one or the other request (otherwise scheduler
will bias toward front request). Default value of back_seek_penalty is 2.

fifo_expire_async
-----------------
This parameter is used to set the timeout of asynchronous requests. Default
value of this is 248ms.

fifo_expire_sync
----------------
This parameter is used to set the timeout of synchronous requests. Default
value of this is 124ms. In case to favor synchronous requests over asynchronous
one, this value should be decreased relative to fifo_expire_async.

slice_async
-----------
This parameter is same as of slice_sync but for asynchronous queue. The
default value is 40ms.

slice_async_rq
--------------
This parameter is used to limit the dispatching of asynchronous request to
device request queue in queue's slice time. The maximum number of request that
are allowed to be dispatched also depends upon the io priority. Default value
for this is 2.

slice_sync
----------
When a queue is selected for execution, the queues IO requests are only
executed for a certain amount of time(time_slice) before switching to another
queue. This parameter is used to calculate the time slice of synchronous
queue.

time_slice is computed using the below equation:-
time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the
time_slice of synchronous queue, increase the value of slice_sync. Default
value is 100ms.

quantum
-------
This specifies the number of request dispatched to the device queue. In a
queue's time slice, a request will not be dispatched if the number of request
in the device exceeds this parameter. This parameter is used for synchronous
request.

In case of storage with several disk, this setting can limit the parallel
processing of request. Therefore, increasing the value can imporve the
performace although this can cause the latency of some I/O to increase due
to more number of requests.

CFQ IOPS Mode for group scheduling
===================================
Basic CFQ design is to provide priority based time slices. Higher priority
+64 −0
Original line number Diff line number Diff line
@@ -9,20 +9,71 @@ These files are the ones found in the /sys/block/xxx/queue/ directory.
Files denoted with a RO postfix are readonly and the RW postfix means
read-write.

add_random (RW)
----------------
This file allows to trun off the disk entropy contribution. Default
value of this file is '1'(on).

discard_granularity (RO)
-----------------------
This shows the size of internal allocation of the device in bytes, if
reported by the device. A value of '0' means device does not support
the discard functionality.

discard_max_bytes (RO)
----------------------
Devices that support discard functionality may have internal limits on
the number of bytes that can be trimmed or unmapped in a single operation.
The discard_max_bytes parameter is set by the device driver to the maximum
number of bytes that can be discarded in a single operation. Discard
requests issued to the device must not exceed this limit. A discard_max_bytes
value of 0 means that the device does not support discard functionality.

discard_zeroes_data (RO)
------------------------
When read, this file will show if the discarded block are zeroed by the
device or not. If its value is '1' the blocks are zeroed otherwise not.

hw_sector_size (RO)
-------------------
This is the hardware sector size of the device, in bytes.

iostats (RW)
-------------
This file is used to control (on/off) the iostats accounting of the
disk.

logical_block_size (RO)
-----------------------
This is the logcal block size of the device, in bytes.

max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.

max_integrity_segments (RO)
---------------------------
When read, this file shows the max limit of integrity segments as
set by block layer which a hardware controller can handle.

max_sectors_kb (RW)
-------------------
This is the maximum number of kilobytes that the block layer will allow
for a filesystem request. Must be smaller than or equal to the maximum
size allowed by the hardware.

max_segments (RO)
-----------------
Maximum number of segments of the device.

max_segment_size (RO)
---------------------
Maximum segment size of the device.

minimum_io_size (RO)
--------------------
This is the smallest preferred io size reported by the device.

nomerges (RW)
-------------
This enables the user to disable the lookup logic involved with IO
@@ -45,11 +96,24 @@ per-block-cgroup request pool. IOW, if there are N block cgroups,
each request queue may have upto N request pools, each independently
regulated by nr_requests.

optimal_io_size (RO)
--------------------
This is the optimal io size reported by the device.

physical_block_size (RO)
------------------------
This is the physical block size of device, in bytes.

read_ahead_kb (RW)
------------------
Maximum number of kilobytes to read-ahead for filesystems on this block
device.

rotational (RW)
---------------
This file is used to stat if the device is of rotational type or
non-rotational type.

rq_affinity (RW)
----------------
If this option is '1', the block layer will migrate request completions to the
+28 −13
Original line number Diff line number Diff line
@@ -44,6 +44,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
	struct request_queue *q = bdev_get_queue(bdev);
	int type = REQ_WRITE | REQ_DISCARD;
	unsigned int max_discard_sectors;
	unsigned int granularity, alignment, mask;
	struct bio_batch bb;
	struct bio *bio;
	int ret = 0;
@@ -54,18 +55,20 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
	if (!blk_queue_discard(q))
		return -EOPNOTSUPP;

	/* Zero-sector (unknown) and one-sector granularities are the same.  */
	granularity = max(q->limits.discard_granularity >> 9, 1U);
	mask = granularity - 1;
	alignment = (bdev_discard_alignment(bdev) >> 9) & mask;

	/*
	 * Ensure that max_discard_sectors is of the proper
	 * granularity
	 * granularity, so that requests stay aligned after a split.
	 */
	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
	max_discard_sectors = round_down(max_discard_sectors, granularity);
	if (unlikely(!max_discard_sectors)) {
		/* Avoid infinite loop below. Being cautious never hurts. */
		return -EOPNOTSUPP;
	} else if (q->limits.discard_granularity) {
		unsigned int disc_sects = q->limits.discard_granularity >> 9;

		max_discard_sectors &= ~(disc_sects - 1);
	}

	if (flags & BLKDEV_DISCARD_SECURE) {
@@ -79,25 +82,37 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
	bb.wait = &wait;

	while (nr_sects) {
		unsigned int req_sects;
		sector_t end_sect;

		bio = bio_alloc(gfp_mask, 1);
		if (!bio) {
			ret = -ENOMEM;
			break;
		}

		req_sects = min_t(sector_t, nr_sects, max_discard_sectors);

		/*
		 * If splitting a request, and the next starting sector would be
		 * misaligned, stop the discard at the previous aligned sector.
		 */
		end_sect = sector + req_sects;
		if (req_sects < nr_sects && (end_sect & mask) != alignment) {
			end_sect =
				round_down(end_sect - alignment, granularity)
				+ alignment;
			req_sects = end_sect - sector;
		}

		bio->bi_sector = sector;
		bio->bi_end_io = bio_batch_end_io;
		bio->bi_bdev = bdev;
		bio->bi_private = &bb;

		if (nr_sects > max_discard_sectors) {
			bio->bi_size = max_discard_sectors << 9;
			nr_sects -= max_discard_sectors;
			sector += max_discard_sectors;
		} else {
			bio->bi_size = nr_sects << 9;
			nr_sects = 0;
		}
		bio->bi_size = req_sects << 9;
		nr_sects -= req_sects;
		sector = end_sect;

		atomic_inc(&bb.done);
		submit_bio(type, bio);
+82 −35
Original line number Diff line number Diff line
@@ -110,43 +110,28 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
	return 0;
}

/*
 * map a request to scatterlist, return number of sg entries setup. Caller
 * must make sure sg can hold rq->nr_phys_segments entries
 */
int blk_rq_map_sg(struct request_queue *q, struct request *rq,
		  struct scatterlist *sglist)
static void
__blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
		     struct scatterlist *sglist, struct bio_vec **bvprv,
		     struct scatterlist **sg, int *nsegs, int *cluster)
{
	struct bio_vec *bvec, *bvprv;
	struct req_iterator iter;
	struct scatterlist *sg;
	int nsegs, cluster;

	nsegs = 0;
	cluster = blk_queue_cluster(q);

	/*
	 * for each bio in rq
	 */
	bvprv = NULL;
	sg = NULL;
	rq_for_each_segment(bvec, rq, iter) {
	int nbytes = bvec->bv_len;

		if (bvprv && cluster) {
			if (sg->length + nbytes > queue_max_segment_size(q))
	if (*bvprv && *cluster) {
		if ((*sg)->length + nbytes > queue_max_segment_size(q))
			goto new_segment;

			if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
		if (!BIOVEC_PHYS_MERGEABLE(*bvprv, bvec))
			goto new_segment;
			if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
		if (!BIOVEC_SEG_BOUNDARY(q, *bvprv, bvec))
			goto new_segment;

			sg->length += nbytes;
		(*sg)->length += nbytes;
	} else {
new_segment:
			if (!sg)
				sg = sglist;
		if (!*sg)
			*sg = sglist;
		else {
			/*
			 * If the driver previously mapped a shorter
@@ -158,14 +143,39 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
			 * termination bit to avoid doing a full
			 * sg_init_table() in drivers for each command.
			 */
				sg->page_link &= ~0x02;
				sg = sg_next(sg);
			(*sg)->page_link &= ~0x02;
			*sg = sg_next(*sg);
		}

			sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
			nsegs++;
		sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
		(*nsegs)++;
	}
		bvprv = bvec;
	*bvprv = bvec;
}

/*
 * map a request to scatterlist, return number of sg entries setup. Caller
 * must make sure sg can hold rq->nr_phys_segments entries
 */
int blk_rq_map_sg(struct request_queue *q, struct request *rq,
		  struct scatterlist *sglist)
{
	struct bio_vec *bvec, *bvprv;
	struct req_iterator iter;
	struct scatterlist *sg;
	int nsegs, cluster;

	nsegs = 0;
	cluster = blk_queue_cluster(q);

	/*
	 * for each bio in rq
	 */
	bvprv = NULL;
	sg = NULL;
	rq_for_each_segment(bvec, rq, iter) {
		__blk_segment_map_sg(q, bvec, sglist, &bvprv, &sg,
				     &nsegs, &cluster);
	} /* segments in rq */


@@ -199,6 +209,43 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL(blk_rq_map_sg);

/**
 * blk_bio_map_sg - map a bio to a scatterlist
 * @q: request_queue in question
 * @bio: bio being mapped
 * @sglist: scatterlist being mapped
 *
 * Note:
 *    Caller must make sure sg can hold bio->bi_phys_segments entries
 *
 * Will return the number of sg entries setup
 */
int blk_bio_map_sg(struct request_queue *q, struct bio *bio,
		   struct scatterlist *sglist)
{
	struct bio_vec *bvec, *bvprv;
	struct scatterlist *sg;
	int nsegs, cluster;
	unsigned long i;

	nsegs = 0;
	cluster = blk_queue_cluster(q);

	bvprv = NULL;
	sg = NULL;
	bio_for_each_segment(bvec, bio, i) {
		__blk_segment_map_sg(q, bvec, sglist, &bvprv, &sg,
				     &nsegs, &cluster);
	} /* segments in bio */

	if (sg)
		sg_mark_end(sg);

	BUG_ON(bio->bi_phys_segments && nsegs > bio->bi_phys_segments);
	return nsegs;
}
EXPORT_SYMBOL(blk_bio_map_sg);

static inline int ll_new_hw_segment(struct request_queue *q,
				    struct request *req,
				    struct bio *bio)
Loading