Loading Documentation/admin-guide/md.rst +29 −3 Original line number Diff line number Diff line Loading @@ -401,7 +401,30 @@ All md devices contain: once the array becomes non-degraded, and this fact has been recorded in the metadata. consistency_policy This indicates how the array maintains consistency in case of unexpected shutdown. It can be: none Array has no redundancy information, e.g. raid0, linear. resync Full resync is performed and all redundancy is regenerated when the array is started after unclean shutdown. bitmap Resync assisted by a write-intent bitmap. journal For raid4/5/6, journal device is used to log transactions and replay after unclean shutdown. ppl For raid5 only, Partial Parity Log is used to close the write hole and eliminate resync. The accepted values when writing to this file are ``ppl`` and ``resync``, used to enable and disable PPL. As component devices are added to an md array, they appear in the ``md`` Loading Loading @@ -563,6 +586,9 @@ Each directory contains: adds bad blocks without acknowledging them. This is largely for testing. ppl_sector, ppl_size Location and size (in sectors) of the space used for Partial Parity Log on this device. An active md device will also contain an entry for each active device Loading Documentation/md/md-cluster.txt +1 −1 Original line number Diff line number Diff line Loading @@ -321,4 +321,4 @@ The algorithm is: There are somethings which are not supported by cluster MD yet. - update size and change array_sectors. - change array_sectors. Documentation/md/raid5-ppl.txt 0 → 100644 +44 −0 Original line number Diff line number Diff line Partial Parity Log Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue addressed by PPL is that after a dirty shutdown, parity of a particular stripe may become inconsistent with data on other member disks. If the array is also in degraded state, there is no way to recalculate parity, because one of the disks is missing. This can lead to silent data corruption when rebuilding the array or using it is as degraded - data calculated from parity for array blocks that have not been touched by a write request during the unclean shutdown can be incorrect. Such condition is known as the RAID5 Write Hole. Because of this, md by default does not allow starting a dirty degraded array. Partial parity for a write operation is the XOR of stripe data chunks not modified by this write. It is just enough data needed for recovering from the write hole. XORing partial parity with the modified chunks produces parity for the stripe, consistent with its state before the write operation, regardless of which chunk writes have completed. If one of the not modified data disks of this stripe is missing, this updated parity can be used to recover its contents. PPL recovery is also performed when starting an array after an unclean shutdown and all disks are available, eliminating the need to resync the array. Because of this, using write-intent bitmap and PPL together is not supported. When handling a write request PPL writes partial parity before new data and parity are dispatched to disks. PPL is a distributed log - it is stored on array member drives in the metadata area, on the parity drive of a particular stripe. It does not require a dedicated journaling drive. Write performance is reduced by up to 30%-40% but it scales with the number of drives in the array and the journaling drive does not become a bottleneck or a single point of failure. Unlike raid5-cache, the other solution in md for closing the write hole, PPL is not a true journal. It does not protect from losing in-flight data, only from silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is performed for this stripe (parity is not updated). So it is possible to have arbitrary data in the written part of a stripe if that disk is lost. In such case the behavior is the same as in plain raid5. PPL is available for md version-1 metadata and external (specifically IMSM) metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl. Currently, volatile write-back cache should be disabled on all member drives when using PPL. Otherwise it cannot guarantee consistency in case of power failure. block/bio.c +13 −48 Original line number Diff line number Diff line Loading @@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs) } EXPORT_SYMBOL(bio_clone_fast); static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs, int offset, int size) /** * bio_clone_bioset - clone a bio * @bio_src: bio to clone * @gfp_mask: allocation priority * @bs: bio_set to allocate from * * Clone bio. Caller will own the returned bio, but not the actual data it * points to. Reference count of returned bio will be one. */ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs) { struct bvec_iter iter; struct bio_vec bv; struct bio *bio; struct bvec_iter iter_src = bio_src->bi_iter; /* for supporting partial clone */ if (offset || size != bio_src->bi_iter.bi_size) { bio_advance_iter(bio_src, &iter_src, offset); iter_src.bi_size = size; } /* * Pre immutable biovecs, __bio_clone() used to just do a memcpy from Loading @@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, * __bio_clone_fast() anyways. */ bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src, &iter_src), bs); bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs); if (!bio) return NULL; bio->bi_bdev = bio_src->bi_bdev; Loading @@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0]; break; default: __bio_for_each_segment(bv, bio_src, iter, iter_src) bio_for_each_segment(bv, bio_src, iter) bio->bi_io_vec[bio->bi_vcnt++] = bv; break; } Loading @@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, return bio; } /** * bio_clone_bioset - clone a bio * @bio_src: bio to clone * @gfp_mask: allocation priority * @bs: bio_set to allocate from * * Clone bio. Caller will own the returned bio, but not the actual data it * points to. Reference count of returned bio will be one. */ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs) { return __bio_clone_bioset(bio_src, gfp_mask, bs, 0, bio_src->bi_iter.bi_size); } EXPORT_SYMBOL(bio_clone_bioset); /** * bio_clone_bioset_partial - clone a partial bio * @bio_src: bio to clone * @gfp_mask: allocation priority * @bs: bio_set to allocate from * @offset: cloned starting from the offset * @size: size for the cloned bio * * Clone bio. Caller will own the returned bio, but not the actual data it * points to. Reference count of returned bio will be one. */ struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs, int offset, int size) { return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size); } EXPORT_SYMBOL(bio_clone_bioset_partial); /** * bio_add_pc_page - attempt to add page to bio * @q: the target queue Loading drivers/md/Makefile +1 −1 Original line number Diff line number Diff line Loading @@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o dm-era-y += dm-era-target.o dm-verity-y += dm-verity-target.o md-mod-y += md.o bitmap.o raid456-y += raid5.o raid5-cache.o raid456-y += raid5.o raid5-cache.o raid5-ppl.o # Note: link order is important. All raid personalities # and must come before md.o, as they each initialise Loading Loading
Documentation/admin-guide/md.rst +29 −3 Original line number Diff line number Diff line Loading @@ -401,7 +401,30 @@ All md devices contain: once the array becomes non-degraded, and this fact has been recorded in the metadata. consistency_policy This indicates how the array maintains consistency in case of unexpected shutdown. It can be: none Array has no redundancy information, e.g. raid0, linear. resync Full resync is performed and all redundancy is regenerated when the array is started after unclean shutdown. bitmap Resync assisted by a write-intent bitmap. journal For raid4/5/6, journal device is used to log transactions and replay after unclean shutdown. ppl For raid5 only, Partial Parity Log is used to close the write hole and eliminate resync. The accepted values when writing to this file are ``ppl`` and ``resync``, used to enable and disable PPL. As component devices are added to an md array, they appear in the ``md`` Loading Loading @@ -563,6 +586,9 @@ Each directory contains: adds bad blocks without acknowledging them. This is largely for testing. ppl_sector, ppl_size Location and size (in sectors) of the space used for Partial Parity Log on this device. An active md device will also contain an entry for each active device Loading
Documentation/md/md-cluster.txt +1 −1 Original line number Diff line number Diff line Loading @@ -321,4 +321,4 @@ The algorithm is: There are somethings which are not supported by cluster MD yet. - update size and change array_sectors. - change array_sectors.
Documentation/md/raid5-ppl.txt 0 → 100644 +44 −0 Original line number Diff line number Diff line Partial Parity Log Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue addressed by PPL is that after a dirty shutdown, parity of a particular stripe may become inconsistent with data on other member disks. If the array is also in degraded state, there is no way to recalculate parity, because one of the disks is missing. This can lead to silent data corruption when rebuilding the array or using it is as degraded - data calculated from parity for array blocks that have not been touched by a write request during the unclean shutdown can be incorrect. Such condition is known as the RAID5 Write Hole. Because of this, md by default does not allow starting a dirty degraded array. Partial parity for a write operation is the XOR of stripe data chunks not modified by this write. It is just enough data needed for recovering from the write hole. XORing partial parity with the modified chunks produces parity for the stripe, consistent with its state before the write operation, regardless of which chunk writes have completed. If one of the not modified data disks of this stripe is missing, this updated parity can be used to recover its contents. PPL recovery is also performed when starting an array after an unclean shutdown and all disks are available, eliminating the need to resync the array. Because of this, using write-intent bitmap and PPL together is not supported. When handling a write request PPL writes partial parity before new data and parity are dispatched to disks. PPL is a distributed log - it is stored on array member drives in the metadata area, on the parity drive of a particular stripe. It does not require a dedicated journaling drive. Write performance is reduced by up to 30%-40% but it scales with the number of drives in the array and the journaling drive does not become a bottleneck or a single point of failure. Unlike raid5-cache, the other solution in md for closing the write hole, PPL is not a true journal. It does not protect from losing in-flight data, only from silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is performed for this stripe (parity is not updated). So it is possible to have arbitrary data in the written part of a stripe if that disk is lost. In such case the behavior is the same as in plain raid5. PPL is available for md version-1 metadata and external (specifically IMSM) metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl. Currently, volatile write-back cache should be disabled on all member drives when using PPL. Otherwise it cannot guarantee consistency in case of power failure.
block/bio.c +13 −48 Original line number Diff line number Diff line Loading @@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs) } EXPORT_SYMBOL(bio_clone_fast); static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs, int offset, int size) /** * bio_clone_bioset - clone a bio * @bio_src: bio to clone * @gfp_mask: allocation priority * @bs: bio_set to allocate from * * Clone bio. Caller will own the returned bio, but not the actual data it * points to. Reference count of returned bio will be one. */ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs) { struct bvec_iter iter; struct bio_vec bv; struct bio *bio; struct bvec_iter iter_src = bio_src->bi_iter; /* for supporting partial clone */ if (offset || size != bio_src->bi_iter.bi_size) { bio_advance_iter(bio_src, &iter_src, offset); iter_src.bi_size = size; } /* * Pre immutable biovecs, __bio_clone() used to just do a memcpy from Loading @@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, * __bio_clone_fast() anyways. */ bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src, &iter_src), bs); bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs); if (!bio) return NULL; bio->bi_bdev = bio_src->bi_bdev; Loading @@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0]; break; default: __bio_for_each_segment(bv, bio_src, iter, iter_src) bio_for_each_segment(bv, bio_src, iter) bio->bi_io_vec[bio->bi_vcnt++] = bv; break; } Loading @@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, return bio; } /** * bio_clone_bioset - clone a bio * @bio_src: bio to clone * @gfp_mask: allocation priority * @bs: bio_set to allocate from * * Clone bio. Caller will own the returned bio, but not the actual data it * points to. Reference count of returned bio will be one. */ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs) { return __bio_clone_bioset(bio_src, gfp_mask, bs, 0, bio_src->bi_iter.bi_size); } EXPORT_SYMBOL(bio_clone_bioset); /** * bio_clone_bioset_partial - clone a partial bio * @bio_src: bio to clone * @gfp_mask: allocation priority * @bs: bio_set to allocate from * @offset: cloned starting from the offset * @size: size for the cloned bio * * Clone bio. Caller will own the returned bio, but not the actual data it * points to. Reference count of returned bio will be one. */ struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask, struct bio_set *bs, int offset, int size) { return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size); } EXPORT_SYMBOL(bio_clone_bioset_partial); /** * bio_add_pc_page - attempt to add page to bio * @q: the target queue Loading
drivers/md/Makefile +1 −1 Original line number Diff line number Diff line Loading @@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o dm-era-y += dm-era-target.o dm-verity-y += dm-verity-target.o md-mod-y += md.o bitmap.o raid456-y += raid5.o raid5-cache.o raid456-y += raid5.o raid5-cache.o raid5-ppl.o # Note: link order is important. All raid personalities # and must come before md.o, as they each initialise Loading