Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit e5021876 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull MD updates from Shaohua Li:

 - Add Partial Parity Log (ppl) feature found in Intel IMSM raid array
   by Artur Paszkiewicz. This feature is another way to close RAID5
   writehole. The Linux implementation is also available for normal
   RAID5 array if specific superblock bit is set.

 - A number of md-cluser fixes and enabling md-cluster array resize from
   Guoqing Jiang

 - A bunch of patches from Ming Lei and Neil Brown to rewrite MD bio
   handling related code. Now MD doesn't directly access bio bvec,
   bi_phys_segments and uses modern bio API for bio split.

 - Improve RAID5 IO pattern to improve performance for hard disk based
   RAID5/6 from me.

 - Several patches from Song Liu to speed up raid5-cache recovery and
   allow raid5 cache feature disabling in runtime.

 - Fix a performance regression in raid1 resync from Xiao Ni.

 - Other cleanup and fixes from various people.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (84 commits)
  md/raid10: skip spare disk as 'first' disk
  md/raid1: Use a new variable to count flighting sync requests
  md: clear WantReplacement once disk is removed
  md/raid1/10: remove unused queue
  md: handle read-only member devices better.
  md/raid10: wait up frozen array in handle_write_completed
  uapi: fix linux/raid/md_p.h userspace compilation error
  md-cluster: Fix a memleak in an error handling path
  md: support disabling of create-on-open semantics.
  md: allow creation of mdNNN arrays via md_mod/parameters/new_array
  raid5-ppl: use a single mempool for ppl_io_unit and header_page
  md/raid0: fix up bio splitting.
  md/linear: improve bio splitting.
  md/raid5: make chunk_aligned_read() split bios more cleanly.
  md/raid10: simplify handle_read_error()
  md/raid10: simplify the splitting of requests.
  md/raid1: factor out flush_bio_list()
  md/raid1: simplify handle_read_error().
  Revert "block: introduce bio_copy_data_partial"
  md/raid1: simplify alloc_behind_master_bio()
  ...
parents 46f0537b e265eb3a
Loading
Loading
Loading
Loading
+29 −3
Original line number Diff line number Diff line
@@ -401,7 +401,30 @@ All md devices contain:
     once the array becomes non-degraded, and this fact has been
     recorded in the metadata.

  consistency_policy
     This indicates how the array maintains consistency in case of unexpected
     shutdown. It can be:

     none
       Array has no redundancy information, e.g. raid0, linear.

     resync
       Full resync is performed and all redundancy is regenerated when the
       array is started after unclean shutdown.

     bitmap
       Resync assisted by a write-intent bitmap.

     journal
       For raid4/5/6, journal device is used to log transactions and replay
       after unclean shutdown.

     ppl
       For raid5 only, Partial Parity Log is used to close the write hole and
       eliminate resync.

     The accepted values when writing to this file are ``ppl`` and ``resync``,
     used to enable and disable PPL.


As component devices are added to an md array, they appear in the ``md``
@@ -563,6 +586,9 @@ Each directory contains:
	adds bad blocks without acknowledging them. This is largely
	for testing.

      ppl_sector, ppl_size
        Location and size (in sectors) of the space used for Partial Parity Log
        on this device.


An active md device will also contain an entry for each active device
+1 −1
Original line number Diff line number Diff line
@@ -321,4 +321,4 @@ The algorithm is:

There are somethings which are not supported by cluster MD yet.

- update size and change array_sectors.
- change array_sectors.
+44 −0
Original line number Diff line number Diff line
Partial Parity Log

Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
may become inconsistent with data on other member disks. If the array is also
in degraded state, there is no way to recalculate parity, because one of the
disks is missing. This can lead to silent data corruption when rebuilding the
array or using it is as degraded - data calculated from parity for array blocks
that have not been touched by a write request during the unclean shutdown can
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
this, md by default does not allow starting a dirty degraded array.

Partial parity for a write operation is the XOR of stripe data chunks not
modified by this write. It is just enough data needed for recovering from the
write hole. XORing partial parity with the modified chunks produces parity for
the stripe, consistent with its state before the write operation, regardless of
which chunk writes have completed. If one of the not modified data disks of
this stripe is missing, this updated parity can be used to recover its
contents. PPL recovery is also performed when starting an array after an
unclean shutdown and all disks are available, eliminating the need to resync
the array. Because of this, using write-intent bitmap and PPL together is not
supported.

When handling a write request PPL writes partial parity before new data and
parity are dispatched to disks. PPL is a distributed log - it is stored on
array member drives in the metadata area, on the parity drive of a particular
stripe.  It does not require a dedicated journaling drive. Write performance is
reduced by up to 30%-40% but it scales with the number of drives in the array
and the journaling drive does not become a bottleneck or a single point of
failure.

Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
not a true journal. It does not protect from losing in-flight data, only from
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
performed for this stripe (parity is not updated). So it is possible to have
arbitrary data in the written part of a stripe if that disk is lost. In such
case the behavior is the same as in plain raid5.

PPL is available for md version-1 metadata and external (specifically IMSM)
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.

Currently, volatile write-back cache should be disabled on all member drives
when using PPL. Otherwise it cannot guarantee consistency in case of power
failure.
+13 −48
Original line number Diff line number Diff line
@@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);

static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
				      struct bio_set *bs, int offset,
				      int size)
/**
 * 	bio_clone_bioset - clone a bio
 * 	@bio_src: bio to clone
 *	@gfp_mask: allocation priority
 *	@bs: bio_set to allocate from
 *
 *	Clone bio. Caller will own the returned bio, but not the actual data it
 *	points to. Reference count of returned bio will be one.
 */
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
			     struct bio_set *bs)
{
	struct bvec_iter iter;
	struct bio_vec bv;
	struct bio *bio;
	struct bvec_iter iter_src = bio_src->bi_iter;

	/* for supporting partial clone */
	if (offset || size != bio_src->bi_iter.bi_size) {
		bio_advance_iter(bio_src, &iter_src, offset);
		iter_src.bi_size = size;
	}

	/*
	 * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
	 *    __bio_clone_fast() anyways.
	 */

	bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
			       &iter_src), bs);
	bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
	if (!bio)
		return NULL;
	bio->bi_bdev		= bio_src->bi_bdev;
@@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
		bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
		break;
	default:
		__bio_for_each_segment(bv, bio_src, iter, iter_src)
		bio_for_each_segment(bv, bio_src, iter)
			bio->bi_io_vec[bio->bi_vcnt++] = bv;
		break;
	}
@@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,

	return bio;
}

/**
 * 	bio_clone_bioset - clone a bio
 * 	@bio_src: bio to clone
 *	@gfp_mask: allocation priority
 *	@bs: bio_set to allocate from
 *
 *	Clone bio. Caller will own the returned bio, but not the actual data it
 *	points to. Reference count of returned bio will be one.
 */
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
			     struct bio_set *bs)
{
	return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
				  bio_src->bi_iter.bi_size);
}
EXPORT_SYMBOL(bio_clone_bioset);

/**
 * 	bio_clone_bioset_partial - clone a partial bio
 * 	@bio_src: bio to clone
 *	@gfp_mask: allocation priority
 *	@bs: bio_set to allocate from
 *	@offset: cloned starting from the offset
 *	@size: size for the cloned bio
 *
 *	Clone bio. Caller will own the returned bio, but not the actual data it
 *	points to. Reference count of returned bio will be one.
 */
struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
				     struct bio_set *bs, int offset,
				     int size)
{
	return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
}
EXPORT_SYMBOL(bio_clone_bioset_partial);

/**
 *	bio_add_pc_page	-	attempt to add page to bio
 *	@q: the target queue
+1 −1
Original line number Diff line number Diff line
@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
dm-era-y	+= dm-era-target.o
dm-verity-y	+= dm-verity-target.o
md-mod-y	+= md.o bitmap.o
raid456-y	+= raid5.o raid5-cache.o
raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o

# Note: link order is important.  All raid personalities
# and must come before md.o, as they each initialise 
Loading