Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit e265eb3a authored by Shaohua Li's avatar Shaohua Li
Browse files

Merge branch 'md-next' into md-linus

parents 85724ede b506335e
Loading
Loading
Loading
Loading
+29 −3
Original line number Diff line number Diff line
@@ -401,7 +401,30 @@ All md devices contain:
     once the array becomes non-degraded, and this fact has been
     recorded in the metadata.

  consistency_policy
     This indicates how the array maintains consistency in case of unexpected
     shutdown. It can be:

     none
       Array has no redundancy information, e.g. raid0, linear.

     resync
       Full resync is performed and all redundancy is regenerated when the
       array is started after unclean shutdown.

     bitmap
       Resync assisted by a write-intent bitmap.

     journal
       For raid4/5/6, journal device is used to log transactions and replay
       after unclean shutdown.

     ppl
       For raid5 only, Partial Parity Log is used to close the write hole and
       eliminate resync.

     The accepted values when writing to this file are ``ppl`` and ``resync``,
     used to enable and disable PPL.


As component devices are added to an md array, they appear in the ``md``
@@ -563,6 +586,9 @@ Each directory contains:
	adds bad blocks without acknowledging them. This is largely
	for testing.

      ppl_sector, ppl_size
        Location and size (in sectors) of the space used for Partial Parity Log
        on this device.


An active md device will also contain an entry for each active device
+1 −1
Original line number Diff line number Diff line
@@ -321,4 +321,4 @@ The algorithm is:

There are somethings which are not supported by cluster MD yet.

- update size and change array_sectors.
- change array_sectors.
+44 −0
Original line number Diff line number Diff line
Partial Parity Log

Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
may become inconsistent with data on other member disks. If the array is also
in degraded state, there is no way to recalculate parity, because one of the
disks is missing. This can lead to silent data corruption when rebuilding the
array or using it is as degraded - data calculated from parity for array blocks
that have not been touched by a write request during the unclean shutdown can
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
this, md by default does not allow starting a dirty degraded array.

Partial parity for a write operation is the XOR of stripe data chunks not
modified by this write. It is just enough data needed for recovering from the
write hole. XORing partial parity with the modified chunks produces parity for
the stripe, consistent with its state before the write operation, regardless of
which chunk writes have completed. If one of the not modified data disks of
this stripe is missing, this updated parity can be used to recover its
contents. PPL recovery is also performed when starting an array after an
unclean shutdown and all disks are available, eliminating the need to resync
the array. Because of this, using write-intent bitmap and PPL together is not
supported.

When handling a write request PPL writes partial parity before new data and
parity are dispatched to disks. PPL is a distributed log - it is stored on
array member drives in the metadata area, on the parity drive of a particular
stripe.  It does not require a dedicated journaling drive. Write performance is
reduced by up to 30%-40% but it scales with the number of drives in the array
and the journaling drive does not become a bottleneck or a single point of
failure.

Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
not a true journal. It does not protect from losing in-flight data, only from
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
performed for this stripe (parity is not updated). So it is possible to have
arbitrary data in the written part of a stripe if that disk is lost. In such
case the behavior is the same as in plain raid5.

PPL is available for md version-1 metadata and external (specifically IMSM)
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.

Currently, volatile write-back cache should be disabled on all member drives
when using PPL. Otherwise it cannot guarantee consistency in case of power
failure.
+13 −48
Original line number Diff line number Diff line
@@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
}
EXPORT_SYMBOL(bio_clone_fast);

static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
				      struct bio_set *bs, int offset,
				      int size)
/**
 * 	bio_clone_bioset - clone a bio
 * 	@bio_src: bio to clone
 *	@gfp_mask: allocation priority
 *	@bs: bio_set to allocate from
 *
 *	Clone bio. Caller will own the returned bio, but not the actual data it
 *	points to. Reference count of returned bio will be one.
 */
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
			     struct bio_set *bs)
{
	struct bvec_iter iter;
	struct bio_vec bv;
	struct bio *bio;
	struct bvec_iter iter_src = bio_src->bi_iter;

	/* for supporting partial clone */
	if (offset || size != bio_src->bi_iter.bi_size) {
		bio_advance_iter(bio_src, &iter_src, offset);
		iter_src.bi_size = size;
	}

	/*
	 * Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
	 *    __bio_clone_fast() anyways.
	 */

	bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
			       &iter_src), bs);
	bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
	if (!bio)
		return NULL;
	bio->bi_bdev		= bio_src->bi_bdev;
@@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
		bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
		break;
	default:
		__bio_for_each_segment(bv, bio_src, iter, iter_src)
		bio_for_each_segment(bv, bio_src, iter)
			bio->bi_io_vec[bio->bi_vcnt++] = bv;
		break;
	}
@@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,

	return bio;
}

/**
 * 	bio_clone_bioset - clone a bio
 * 	@bio_src: bio to clone
 *	@gfp_mask: allocation priority
 *	@bs: bio_set to allocate from
 *
 *	Clone bio. Caller will own the returned bio, but not the actual data it
 *	points to. Reference count of returned bio will be one.
 */
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
			     struct bio_set *bs)
{
	return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
				  bio_src->bi_iter.bi_size);
}
EXPORT_SYMBOL(bio_clone_bioset);

/**
 * 	bio_clone_bioset_partial - clone a partial bio
 * 	@bio_src: bio to clone
 *	@gfp_mask: allocation priority
 *	@bs: bio_set to allocate from
 *	@offset: cloned starting from the offset
 *	@size: size for the cloned bio
 *
 *	Clone bio. Caller will own the returned bio, but not the actual data it
 *	points to. Reference count of returned bio will be one.
 */
struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
				     struct bio_set *bs, int offset,
				     int size)
{
	return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
}
EXPORT_SYMBOL(bio_clone_bioset_partial);

/**
 *	bio_add_pc_page	-	attempt to add page to bio
 *	@q: the target queue
+1 −1
Original line number Diff line number Diff line
@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
dm-era-y	+= dm-era-target.o
dm-verity-y	+= dm-verity-target.o
md-mod-y	+= md.o bitmap.o
raid456-y	+= raid5.o raid5-cache.o
raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o

# Note: link order is important.  All raid personalities
# and must come before md.o, as they each initialise 
Loading