Loading Documentation/md-cluster.txt 0 → 100644 +176 −0 Original line number Diff line number Diff line The cluster MD is a shared-device RAID for a cluster. 1. On-disk format Separate write-intent-bitmap are used for each cluster node. The bitmaps record all writes that may have been started on that node, and may not yet have finished. The on-disk layout is: 0 4k 8k 12k ------------------------------------------------------------------- | idle | md super | bm super [0] + bits | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | bm bits [3, contd] | | | During "normal" functioning we assume the filesystem ensures that only one node writes to any given block at a time, so a write request will - set the appropriate bit (if not already set) - commit the write to all mirrors - schedule the bit to be cleared after a timeout. Reads are just handled normally. It is up to the filesystem to ensure one node doesn't read from a location where another node (or the same node) is writing. 2. DLM Locks for management There are two locks for managing the device: 2.1 Bitmap lock resource (bm_lockres) The bm_lockres protects individual node bitmaps. They are named in the form bitmap001 for node 1, bitmap002 for node and so on. When a node joins the cluster, it acquires the lock in PW mode and it stays so during the lifetime the node is part of the cluster. The lock resource number is based on the slot number returned by the DLM subsystem. Since DLM starts node count from one and bitmap slots start from zero, one is subtracted from the DLM slot number to arrive at the bitmap slot number. 3. Communication Each node has to communicate with other nodes when starting or ending resync, and metadata superblock updates. 3.1 Message Types There are 3 types, of messages which are passed 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been updated, and the node must re-read the md superblock. This is performed synchronously. 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended so that each node may suspend or resume the region. 3.2 Communication mechanism The DLM LVB is used to communicate within nodes of the cluster. There are three resources used for the purpose: 3.2.1 Token: The resource which protects the entire communication system. The node having the token resource is allowed to communicate. 3.2.2 Message: The lock resource which carries the data to communicate. 3.2.3 Ack: The resource, acquiring which means the message has been acknowledged by all nodes in the cluster. The BAST of the resource is used to inform the receive node that a node wants to communicate. The algorithm is: 1. receive status sender receiver receiver ACK:CR ACK:CR ACK:CR 2. sender get EX of TOKEN sender get EX of MESSAGE sender receiver receiver TOKEN:EX ACK:CR ACK:CR MESSAGE:EX ACK:CR Sender checks that it still needs to send a message. Messages received or other events that happened while waiting for the TOKEN may have made this message inappropriate or redundant. 3. sender write LVB. sender down-convert MESSAGE from EX to CR sender try to get EX of ACK [ wait until all receiver has *processed* the MESSAGE ] [ triggered by bast of ACK ] receiver get CR of MESSAGE receiver read LVB receiver processes the message [ wait finish ] receiver release ACK sender receiver receiver TOKEN:EX MESSAGE:CR MESSAGE:CR MESSAGE:CR ACK:EX 4. triggered by grant of EX on ACK (indicating all receivers have processed message) sender down-convert ACK from EX to CR sender release MESSAGE sender release TOKEN receiver upconvert to EX of MESSAGE receiver get CR of ACK receiver release MESSAGE sender receiver receiver ACK:CR ACK:CR ACK:CR 4. Handling Failures 4.1 Node Failure When a node fails, the DLM informs the cluster with the slot. The node starts a cluster recovery thread. The cluster recovery thread: - acquires the bitmap<number> lock of the failed node - opens the bitmap - reads the bitmap of the failed node - copies the set bitmap to local node - cleans the bitmap of the failed node - releases bitmap<number> lock of the failed node - initiates resync of the bitmap on the current node The resync process, is the regular md resync. However, in a clustered environment when a resync is performed, it needs to tell other nodes of the areas which are suspended. Before a resync starts, the node send out RESYNC_START with the (lo,hi) range of the area which needs to be suspended. Each node maintains a suspend_list, which contains the list of ranges which are currently suspended. On receiving RESYNC_START, the node adds the range to the suspend_list. Similarly, when the node performing resync finishes, it send RESYNC_FINISHED to other nodes and other nodes remove the corresponding entry from the suspend_list. A helper function, should_suspend() can be used to check if a particular I/O range should be suspended or not. 4.2 Device Failure Device failures are handled and communicated with the metadata update routine. 5. Adding a new Device For adding a new device, it is necessary that all nodes "see" the new device to be added. For this, the following algorithm is used: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes get the information whether a disk is added or not by the following METADATA_UPDATED. drivers/md/Kconfig +16 −0 Original line number Diff line number Diff line Loading @@ -175,6 +175,22 @@ config MD_FAULTY In unsure, say N. config MD_CLUSTER tristate "Cluster Support for MD (EXPERIMENTAL)" depends on BLK_DEV_MD depends on DLM default n ---help--- Clustering support for MD devices. This enables locking and synchronization across multiple systems on the cluster, so all nodes in the cluster can access the MD devices simultaneously. This brings the redundancy (and uptime) of RAID levels across the nodes of the cluster. If unsure, say N. source "drivers/md/bcache/Kconfig" config BLK_DEV_DM_BUILTIN Loading drivers/md/Makefile +1 −0 Original line number Diff line number Diff line Loading @@ -30,6 +30,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o obj-$(CONFIG_MD_RAID456) += raid456.o obj-$(CONFIG_MD_MULTIPATH) += multipath.o obj-$(CONFIG_MD_FAULTY) += faulty.o obj-$(CONFIG_MD_CLUSTER) += md-cluster.o obj-$(CONFIG_BCACHE) += bcache/ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o Loading drivers/md/bitmap.c +167 −22 Original line number Diff line number Diff line Loading @@ -205,6 +205,10 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) struct block_device *bdev; struct mddev *mddev = bitmap->mddev; struct bitmap_storage *store = &bitmap->storage; int node_offset = 0; if (mddev_is_clustered(bitmap->mddev)) node_offset = bitmap->cluster_slot * store->file_pages; while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { int size = PAGE_SIZE; Loading Loading @@ -433,6 +437,7 @@ void bitmap_update_sb(struct bitmap *bitmap) /* This might have been changed by a reshape */ sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors); sb->chunksize = cpu_to_le32(bitmap->mddev->bitmap_info.chunksize); sb->nodes = cpu_to_le32(bitmap->mddev->bitmap_info.nodes); sb->sectors_reserved = cpu_to_le32(bitmap->mddev-> bitmap_info.space); kunmap_atomic(sb); Loading Loading @@ -544,6 +549,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) bitmap_super_t *sb; unsigned long chunksize, daemon_sleep, write_behind; unsigned long long events; int nodes = 0; unsigned long sectors_reserved = 0; int err = -EINVAL; struct page *sb_page; Loading @@ -562,6 +568,22 @@ static int bitmap_read_sb(struct bitmap *bitmap) return -ENOMEM; bitmap->storage.sb_page = sb_page; re_read: /* If cluster_slot is set, the cluster is setup */ if (bitmap->cluster_slot >= 0) { sector_t bm_blocks = bitmap->mddev->resync_max_sectors; sector_div(bm_blocks, bitmap->mddev->bitmap_info.chunksize >> 9); /* bits to bytes */ bm_blocks = ((bm_blocks+7) >> 3) + sizeof(bitmap_super_t); /* to 4k blocks */ bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset); } if (bitmap->storage.file) { loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host); int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; Loading @@ -577,12 +599,15 @@ static int bitmap_read_sb(struct bitmap *bitmap) if (err) return err; err = -EINVAL; sb = kmap_atomic(sb_page); chunksize = le32_to_cpu(sb->chunksize); daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ; write_behind = le32_to_cpu(sb->write_behind); sectors_reserved = le32_to_cpu(sb->sectors_reserved); nodes = le32_to_cpu(sb->nodes); strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); /* verify that the bitmap-specific fields are valid */ if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) Loading Loading @@ -619,7 +644,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) goto out; } events = le64_to_cpu(sb->events); if (events < bitmap->mddev->events) { if (!nodes && (events < bitmap->mddev->events)) { printk(KERN_INFO "%s: bitmap file is out of date (%llu < %llu) " "-- forcing full recovery\n", Loading @@ -634,20 +659,40 @@ static int bitmap_read_sb(struct bitmap *bitmap) if (le32_to_cpu(sb->version) == BITMAP_MAJOR_HOSTENDIAN) set_bit(BITMAP_HOSTENDIAN, &bitmap->flags); bitmap->events_cleared = le64_to_cpu(sb->events_cleared); strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); err = 0; out: kunmap_atomic(sb); /* Assiging chunksize is required for "re_read" */ bitmap->mddev->bitmap_info.chunksize = chunksize; if (nodes && (bitmap->cluster_slot < 0)) { err = md_setup_cluster(bitmap->mddev, nodes); if (err) { pr_err("%s: Could not setup cluster service (%d)\n", bmname(bitmap), err); goto out_no_sb; } bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); goto re_read; } out_no_sb: if (test_bit(BITMAP_STALE, &bitmap->flags)) bitmap->events_cleared = bitmap->mddev->events; bitmap->mddev->bitmap_info.chunksize = chunksize; bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep; bitmap->mddev->bitmap_info.max_write_behind = write_behind; bitmap->mddev->bitmap_info.nodes = nodes; if (bitmap->mddev->bitmap_info.space == 0 || bitmap->mddev->bitmap_info.space > sectors_reserved) bitmap->mddev->bitmap_info.space = sectors_reserved; if (err) if (err) { bitmap_print_sb(bitmap); if (bitmap->cluster_slot < 0) md_cluster_stop(bitmap->mddev); } return err; } Loading Loading @@ -692,9 +737,10 @@ static inline struct page *filemap_get_page(struct bitmap_storage *store, } static int bitmap_storage_alloc(struct bitmap_storage *store, unsigned long chunks, int with_super) unsigned long chunks, int with_super, int slot_number) { int pnum; int pnum, offset = 0; unsigned long num_pages; unsigned long bytes; Loading @@ -703,6 +749,7 @@ static int bitmap_storage_alloc(struct bitmap_storage *store, bytes += sizeof(bitmap_super_t); num_pages = DIV_ROUND_UP(bytes, PAGE_SIZE); offset = slot_number * (num_pages - 1); store->filemap = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL); Loading @@ -713,20 +760,22 @@ static int bitmap_storage_alloc(struct bitmap_storage *store, store->sb_page = alloc_page(GFP_KERNEL|__GFP_ZERO); if (store->sb_page == NULL) return -ENOMEM; store->sb_page->index = 0; } pnum = 0; if (store->sb_page) { store->filemap[0] = store->sb_page; pnum = 1; store->sb_page->index = offset; } for ( ; pnum < num_pages; pnum++) { store->filemap[pnum] = alloc_page(GFP_KERNEL|__GFP_ZERO); if (!store->filemap[pnum]) { store->file_pages = pnum; return -ENOMEM; } store->filemap[pnum]->index = pnum; store->filemap[pnum]->index = pnum + offset; } store->file_pages = pnum; Loading Loading @@ -885,6 +934,28 @@ static void bitmap_file_clear_bit(struct bitmap *bitmap, sector_t block) } } static int bitmap_file_test_bit(struct bitmap *bitmap, sector_t block) { unsigned long bit; struct page *page; void *paddr; unsigned long chunk = block >> bitmap->counts.chunkshift; int set = 0; page = filemap_get_page(&bitmap->storage, chunk); if (!page) return -EINVAL; bit = file_page_offset(&bitmap->storage, chunk); paddr = kmap_atomic(page); if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) set = test_bit(bit, paddr); else set = test_bit_le(bit, paddr); kunmap_atomic(paddr); return set; } /* this gets called when the md device is ready to unplug its underlying * (slave) device queues -- before we let any writes go down, we need to * sync the dirty pages of the bitmap file to disk */ Loading Loading @@ -935,7 +1006,7 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n */ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) { unsigned long i, chunks, index, oldindex, bit; unsigned long i, chunks, index, oldindex, bit, node_offset = 0; struct page *page = NULL; unsigned long bit_cnt = 0; struct file *file; Loading Loading @@ -981,6 +1052,9 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) if (!bitmap->mddev->bitmap_info.external) offset = sizeof(bitmap_super_t); if (mddev_is_clustered(bitmap->mddev)) node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); for (i = 0; i < chunks; i++) { int b; index = file_page_index(&bitmap->storage, i); Loading @@ -1001,7 +1075,7 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) bitmap->mddev, bitmap->mddev->bitmap_info.offset, page, index, count); index + node_offset, count); if (ret) goto err; Loading Loading @@ -1207,7 +1281,6 @@ void bitmap_daemon_work(struct mddev *mddev) j < bitmap->storage.file_pages && !test_bit(BITMAP_STALE, &bitmap->flags); j++) { if (test_page_attr(bitmap, j, BITMAP_PAGE_DIRTY)) /* bitmap_unplug will handle the rest */ Loading Loading @@ -1530,11 +1603,13 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n return; } if (!*bmc) { *bmc = 2 | (needed ? NEEDED_MASK : 0); *bmc = 2; bitmap_count_page(&bitmap->counts, offset, 1); bitmap_set_pending(&bitmap->counts, offset); bitmap->allclean = 0; } if (needed) *bmc |= NEEDED_MASK; spin_unlock_irq(&bitmap->counts.lock); } Loading Loading @@ -1591,6 +1666,10 @@ static void bitmap_free(struct bitmap *bitmap) if (!bitmap) /* there was no bitmap */ return; if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info && bitmap->cluster_slot == md_cluster_ops->slot_number(bitmap->mddev)) md_cluster_stop(bitmap->mddev); /* Shouldn't be needed - but just in case.... */ wait_event(bitmap->write_wait, atomic_read(&bitmap->pending_writes) == 0); Loading Loading @@ -1636,7 +1715,7 @@ void bitmap_destroy(struct mddev *mddev) * initialize the bitmap structure * if this returns an error, bitmap_destroy must be called to do clean up */ int bitmap_create(struct mddev *mddev) struct bitmap *bitmap_create(struct mddev *mddev, int slot) { struct bitmap *bitmap; sector_t blocks = mddev->resync_max_sectors; Loading @@ -1650,7 +1729,7 @@ int bitmap_create(struct mddev *mddev) bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL); if (!bitmap) return -ENOMEM; return ERR_PTR(-ENOMEM); spin_lock_init(&bitmap->counts.lock); atomic_set(&bitmap->pending_writes, 0); Loading @@ -1659,6 +1738,7 @@ int bitmap_create(struct mddev *mddev) init_waitqueue_head(&bitmap->behind_wait); bitmap->mddev = mddev; bitmap->cluster_slot = slot; if (mddev->kobj.sd) bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap"); Loading Loading @@ -1706,12 +1786,14 @@ int bitmap_create(struct mddev *mddev) printk(KERN_INFO "created bitmap (%lu pages) for device %s\n", bitmap->counts.pages, bmname(bitmap)); mddev->bitmap = bitmap; return test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; err = test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; if (err) goto error; return bitmap; error: bitmap_free(bitmap); return err; return ERR_PTR(err); } int bitmap_load(struct mddev *mddev) Loading Loading @@ -1765,6 +1847,60 @@ int bitmap_load(struct mddev *mddev) } EXPORT_SYMBOL_GPL(bitmap_load); /* Loads the bitmap associated with slot and copies the resync information * to our bitmap */ int bitmap_copy_from_slot(struct mddev *mddev, int slot, sector_t *low, sector_t *high, bool clear_bits) { int rv = 0, i, j; sector_t block, lo = 0, hi = 0; struct bitmap_counts *counts; struct bitmap *bitmap = bitmap_create(mddev, slot); if (IS_ERR(bitmap)) return PTR_ERR(bitmap); rv = bitmap_read_sb(bitmap); if (rv) goto err; rv = bitmap_init_from_disk(bitmap, 0); if (rv) goto err; counts = &bitmap->counts; for (j = 0; j < counts->chunks; j++) { block = (sector_t)j << counts->chunkshift; if (bitmap_file_test_bit(bitmap, block)) { if (!lo) lo = block; hi = block; bitmap_file_clear_bit(bitmap, block); bitmap_set_memory_bits(mddev->bitmap, block, 1); bitmap_file_set_bit(mddev->bitmap, block); } } if (clear_bits) { bitmap_update_sb(bitmap); /* Setting this for the ev_page should be enough. * And we do not require both write_all and PAGE_DIRT either */ for (i = 0; i < bitmap->storage.file_pages; i++) set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); bitmap_write_all(bitmap); bitmap_unplug(bitmap); } *low = lo; *high = hi; err: bitmap_free(bitmap); return rv; } EXPORT_SYMBOL_GPL(bitmap_copy_from_slot); void bitmap_status(struct seq_file *seq, struct bitmap *bitmap) { unsigned long chunk_kb; Loading Loading @@ -1849,7 +1985,8 @@ int bitmap_resize(struct bitmap *bitmap, sector_t blocks, memset(&store, 0, sizeof(store)); if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file) ret = bitmap_storage_alloc(&store, chunks, !bitmap->mddev->bitmap_info.external); !bitmap->mddev->bitmap_info.external, bitmap->cluster_slot); if (ret) goto err; Loading Loading @@ -2021,14 +2158,19 @@ location_store(struct mddev *mddev, const char *buf, size_t len) return -EINVAL; mddev->bitmap_info.offset = offset; if (mddev->pers) { struct bitmap *bitmap; mddev->pers->quiesce(mddev, 1); rv = bitmap_create(mddev); if (!rv) bitmap = bitmap_create(mddev, -1); if (IS_ERR(bitmap)) rv = PTR_ERR(bitmap); else { mddev->bitmap = bitmap; rv = bitmap_load(mddev); if (rv) { bitmap_destroy(mddev); mddev->bitmap_info.offset = 0; } } mddev->pers->quiesce(mddev, 0); if (rv) return rv; Loading Loading @@ -2186,6 +2328,8 @@ __ATTR(chunksize, S_IRUGO|S_IWUSR, chunksize_show, chunksize_store); static ssize_t metadata_show(struct mddev *mddev, char *page) { if (mddev_is_clustered(mddev)) return sprintf(page, "clustered\n"); return sprintf(page, "%s\n", (mddev->bitmap_info.external ? "external" : "internal")); } Loading @@ -2198,7 +2342,8 @@ static ssize_t metadata_store(struct mddev *mddev, const char *buf, size_t len) return -EBUSY; if (strncmp(buf, "external", 8) == 0) mddev->bitmap_info.external = 1; else if (strncmp(buf, "internal", 8) == 0) else if ((strncmp(buf, "internal", 8) == 0) || (strncmp(buf, "clustered", 9) == 0)) mddev->bitmap_info.external = 0; else return -EINVAL; Loading drivers/md/bitmap.h +7 −3 Original line number Diff line number Diff line Loading @@ -130,8 +130,9 @@ typedef struct bitmap_super_s { __le32 write_behind; /* 60 number of outstanding write-behind writes */ __le32 sectors_reserved; /* 64 number of 512-byte sectors that are * reserved for the bitmap. */ __u8 pad[256 - 68]; /* set to zero */ __le32 nodes; /* 68 the maximum number of nodes in cluster. */ __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */ __u8 pad[256 - 136]; /* set to zero */ } bitmap_super_t; /* notes: Loading Loading @@ -226,12 +227,13 @@ struct bitmap { wait_queue_head_t behind_wait; struct kernfs_node *sysfs_can_clear; int cluster_slot; /* Slot offset for clustered env */ }; /* the bitmap API */ /* these are used only by md/bitmap */ int bitmap_create(struct mddev *mddev); struct bitmap *bitmap_create(struct mddev *mddev, int slot); int bitmap_load(struct mddev *mddev); void bitmap_flush(struct mddev *mddev); void bitmap_destroy(struct mddev *mddev); Loading Loading @@ -260,6 +262,8 @@ void bitmap_daemon_work(struct mddev *mddev); int bitmap_resize(struct bitmap *bitmap, sector_t blocks, int chunksize, int init); int bitmap_copy_from_slot(struct mddev *mddev, int slot, sector_t *lo, sector_t *hi, bool clear_bits); #endif #endif Loading
Documentation/md-cluster.txt 0 → 100644 +176 −0 Original line number Diff line number Diff line The cluster MD is a shared-device RAID for a cluster. 1. On-disk format Separate write-intent-bitmap are used for each cluster node. The bitmaps record all writes that may have been started on that node, and may not yet have finished. The on-disk layout is: 0 4k 8k 12k ------------------------------------------------------------------- | idle | md super | bm super [0] + bits | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | bm bits [3, contd] | | | During "normal" functioning we assume the filesystem ensures that only one node writes to any given block at a time, so a write request will - set the appropriate bit (if not already set) - commit the write to all mirrors - schedule the bit to be cleared after a timeout. Reads are just handled normally. It is up to the filesystem to ensure one node doesn't read from a location where another node (or the same node) is writing. 2. DLM Locks for management There are two locks for managing the device: 2.1 Bitmap lock resource (bm_lockres) The bm_lockres protects individual node bitmaps. They are named in the form bitmap001 for node 1, bitmap002 for node and so on. When a node joins the cluster, it acquires the lock in PW mode and it stays so during the lifetime the node is part of the cluster. The lock resource number is based on the slot number returned by the DLM subsystem. Since DLM starts node count from one and bitmap slots start from zero, one is subtracted from the DLM slot number to arrive at the bitmap slot number. 3. Communication Each node has to communicate with other nodes when starting or ending resync, and metadata superblock updates. 3.1 Message Types There are 3 types, of messages which are passed 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been updated, and the node must re-read the md superblock. This is performed synchronously. 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended so that each node may suspend or resume the region. 3.2 Communication mechanism The DLM LVB is used to communicate within nodes of the cluster. There are three resources used for the purpose: 3.2.1 Token: The resource which protects the entire communication system. The node having the token resource is allowed to communicate. 3.2.2 Message: The lock resource which carries the data to communicate. 3.2.3 Ack: The resource, acquiring which means the message has been acknowledged by all nodes in the cluster. The BAST of the resource is used to inform the receive node that a node wants to communicate. The algorithm is: 1. receive status sender receiver receiver ACK:CR ACK:CR ACK:CR 2. sender get EX of TOKEN sender get EX of MESSAGE sender receiver receiver TOKEN:EX ACK:CR ACK:CR MESSAGE:EX ACK:CR Sender checks that it still needs to send a message. Messages received or other events that happened while waiting for the TOKEN may have made this message inappropriate or redundant. 3. sender write LVB. sender down-convert MESSAGE from EX to CR sender try to get EX of ACK [ wait until all receiver has *processed* the MESSAGE ] [ triggered by bast of ACK ] receiver get CR of MESSAGE receiver read LVB receiver processes the message [ wait finish ] receiver release ACK sender receiver receiver TOKEN:EX MESSAGE:CR MESSAGE:CR MESSAGE:CR ACK:EX 4. triggered by grant of EX on ACK (indicating all receivers have processed message) sender down-convert ACK from EX to CR sender release MESSAGE sender release TOKEN receiver upconvert to EX of MESSAGE receiver get CR of ACK receiver release MESSAGE sender receiver receiver ACK:CR ACK:CR ACK:CR 4. Handling Failures 4.1 Node Failure When a node fails, the DLM informs the cluster with the slot. The node starts a cluster recovery thread. The cluster recovery thread: - acquires the bitmap<number> lock of the failed node - opens the bitmap - reads the bitmap of the failed node - copies the set bitmap to local node - cleans the bitmap of the failed node - releases bitmap<number> lock of the failed node - initiates resync of the bitmap on the current node The resync process, is the regular md resync. However, in a clustered environment when a resync is performed, it needs to tell other nodes of the areas which are suspended. Before a resync starts, the node send out RESYNC_START with the (lo,hi) range of the area which needs to be suspended. Each node maintains a suspend_list, which contains the list of ranges which are currently suspended. On receiving RESYNC_START, the node adds the range to the suspend_list. Similarly, when the node performing resync finishes, it send RESYNC_FINISHED to other nodes and other nodes remove the corresponding entry from the suspend_list. A helper function, should_suspend() can be used to check if a particular I/O range should be suspended or not. 4.2 Device Failure Device failures are handled and communicated with the metadata update routine. 5. Adding a new Device For adding a new device, it is necessary that all nodes "see" the new device to be added. For this, the following algorithm is used: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes get the information whether a disk is added or not by the following METADATA_UPDATED.
drivers/md/Kconfig +16 −0 Original line number Diff line number Diff line Loading @@ -175,6 +175,22 @@ config MD_FAULTY In unsure, say N. config MD_CLUSTER tristate "Cluster Support for MD (EXPERIMENTAL)" depends on BLK_DEV_MD depends on DLM default n ---help--- Clustering support for MD devices. This enables locking and synchronization across multiple systems on the cluster, so all nodes in the cluster can access the MD devices simultaneously. This brings the redundancy (and uptime) of RAID levels across the nodes of the cluster. If unsure, say N. source "drivers/md/bcache/Kconfig" config BLK_DEV_DM_BUILTIN Loading
drivers/md/Makefile +1 −0 Original line number Diff line number Diff line Loading @@ -30,6 +30,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o obj-$(CONFIG_MD_RAID456) += raid456.o obj-$(CONFIG_MD_MULTIPATH) += multipath.o obj-$(CONFIG_MD_FAULTY) += faulty.o obj-$(CONFIG_MD_CLUSTER) += md-cluster.o obj-$(CONFIG_BCACHE) += bcache/ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o Loading
drivers/md/bitmap.c +167 −22 Original line number Diff line number Diff line Loading @@ -205,6 +205,10 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) struct block_device *bdev; struct mddev *mddev = bitmap->mddev; struct bitmap_storage *store = &bitmap->storage; int node_offset = 0; if (mddev_is_clustered(bitmap->mddev)) node_offset = bitmap->cluster_slot * store->file_pages; while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { int size = PAGE_SIZE; Loading Loading @@ -433,6 +437,7 @@ void bitmap_update_sb(struct bitmap *bitmap) /* This might have been changed by a reshape */ sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors); sb->chunksize = cpu_to_le32(bitmap->mddev->bitmap_info.chunksize); sb->nodes = cpu_to_le32(bitmap->mddev->bitmap_info.nodes); sb->sectors_reserved = cpu_to_le32(bitmap->mddev-> bitmap_info.space); kunmap_atomic(sb); Loading Loading @@ -544,6 +549,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) bitmap_super_t *sb; unsigned long chunksize, daemon_sleep, write_behind; unsigned long long events; int nodes = 0; unsigned long sectors_reserved = 0; int err = -EINVAL; struct page *sb_page; Loading @@ -562,6 +568,22 @@ static int bitmap_read_sb(struct bitmap *bitmap) return -ENOMEM; bitmap->storage.sb_page = sb_page; re_read: /* If cluster_slot is set, the cluster is setup */ if (bitmap->cluster_slot >= 0) { sector_t bm_blocks = bitmap->mddev->resync_max_sectors; sector_div(bm_blocks, bitmap->mddev->bitmap_info.chunksize >> 9); /* bits to bytes */ bm_blocks = ((bm_blocks+7) >> 3) + sizeof(bitmap_super_t); /* to 4k blocks */ bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset); } if (bitmap->storage.file) { loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host); int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; Loading @@ -577,12 +599,15 @@ static int bitmap_read_sb(struct bitmap *bitmap) if (err) return err; err = -EINVAL; sb = kmap_atomic(sb_page); chunksize = le32_to_cpu(sb->chunksize); daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ; write_behind = le32_to_cpu(sb->write_behind); sectors_reserved = le32_to_cpu(sb->sectors_reserved); nodes = le32_to_cpu(sb->nodes); strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); /* verify that the bitmap-specific fields are valid */ if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) Loading Loading @@ -619,7 +644,7 @@ static int bitmap_read_sb(struct bitmap *bitmap) goto out; } events = le64_to_cpu(sb->events); if (events < bitmap->mddev->events) { if (!nodes && (events < bitmap->mddev->events)) { printk(KERN_INFO "%s: bitmap file is out of date (%llu < %llu) " "-- forcing full recovery\n", Loading @@ -634,20 +659,40 @@ static int bitmap_read_sb(struct bitmap *bitmap) if (le32_to_cpu(sb->version) == BITMAP_MAJOR_HOSTENDIAN) set_bit(BITMAP_HOSTENDIAN, &bitmap->flags); bitmap->events_cleared = le64_to_cpu(sb->events_cleared); strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); err = 0; out: kunmap_atomic(sb); /* Assiging chunksize is required for "re_read" */ bitmap->mddev->bitmap_info.chunksize = chunksize; if (nodes && (bitmap->cluster_slot < 0)) { err = md_setup_cluster(bitmap->mddev, nodes); if (err) { pr_err("%s: Could not setup cluster service (%d)\n", bmname(bitmap), err); goto out_no_sb; } bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); goto re_read; } out_no_sb: if (test_bit(BITMAP_STALE, &bitmap->flags)) bitmap->events_cleared = bitmap->mddev->events; bitmap->mddev->bitmap_info.chunksize = chunksize; bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep; bitmap->mddev->bitmap_info.max_write_behind = write_behind; bitmap->mddev->bitmap_info.nodes = nodes; if (bitmap->mddev->bitmap_info.space == 0 || bitmap->mddev->bitmap_info.space > sectors_reserved) bitmap->mddev->bitmap_info.space = sectors_reserved; if (err) if (err) { bitmap_print_sb(bitmap); if (bitmap->cluster_slot < 0) md_cluster_stop(bitmap->mddev); } return err; } Loading Loading @@ -692,9 +737,10 @@ static inline struct page *filemap_get_page(struct bitmap_storage *store, } static int bitmap_storage_alloc(struct bitmap_storage *store, unsigned long chunks, int with_super) unsigned long chunks, int with_super, int slot_number) { int pnum; int pnum, offset = 0; unsigned long num_pages; unsigned long bytes; Loading @@ -703,6 +749,7 @@ static int bitmap_storage_alloc(struct bitmap_storage *store, bytes += sizeof(bitmap_super_t); num_pages = DIV_ROUND_UP(bytes, PAGE_SIZE); offset = slot_number * (num_pages - 1); store->filemap = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL); Loading @@ -713,20 +760,22 @@ static int bitmap_storage_alloc(struct bitmap_storage *store, store->sb_page = alloc_page(GFP_KERNEL|__GFP_ZERO); if (store->sb_page == NULL) return -ENOMEM; store->sb_page->index = 0; } pnum = 0; if (store->sb_page) { store->filemap[0] = store->sb_page; pnum = 1; store->sb_page->index = offset; } for ( ; pnum < num_pages; pnum++) { store->filemap[pnum] = alloc_page(GFP_KERNEL|__GFP_ZERO); if (!store->filemap[pnum]) { store->file_pages = pnum; return -ENOMEM; } store->filemap[pnum]->index = pnum; store->filemap[pnum]->index = pnum + offset; } store->file_pages = pnum; Loading Loading @@ -885,6 +934,28 @@ static void bitmap_file_clear_bit(struct bitmap *bitmap, sector_t block) } } static int bitmap_file_test_bit(struct bitmap *bitmap, sector_t block) { unsigned long bit; struct page *page; void *paddr; unsigned long chunk = block >> bitmap->counts.chunkshift; int set = 0; page = filemap_get_page(&bitmap->storage, chunk); if (!page) return -EINVAL; bit = file_page_offset(&bitmap->storage, chunk); paddr = kmap_atomic(page); if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) set = test_bit(bit, paddr); else set = test_bit_le(bit, paddr); kunmap_atomic(paddr); return set; } /* this gets called when the md device is ready to unplug its underlying * (slave) device queues -- before we let any writes go down, we need to * sync the dirty pages of the bitmap file to disk */ Loading Loading @@ -935,7 +1006,7 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n */ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) { unsigned long i, chunks, index, oldindex, bit; unsigned long i, chunks, index, oldindex, bit, node_offset = 0; struct page *page = NULL; unsigned long bit_cnt = 0; struct file *file; Loading Loading @@ -981,6 +1052,9 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) if (!bitmap->mddev->bitmap_info.external) offset = sizeof(bitmap_super_t); if (mddev_is_clustered(bitmap->mddev)) node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); for (i = 0; i < chunks; i++) { int b; index = file_page_index(&bitmap->storage, i); Loading @@ -1001,7 +1075,7 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) bitmap->mddev, bitmap->mddev->bitmap_info.offset, page, index, count); index + node_offset, count); if (ret) goto err; Loading Loading @@ -1207,7 +1281,6 @@ void bitmap_daemon_work(struct mddev *mddev) j < bitmap->storage.file_pages && !test_bit(BITMAP_STALE, &bitmap->flags); j++) { if (test_page_attr(bitmap, j, BITMAP_PAGE_DIRTY)) /* bitmap_unplug will handle the rest */ Loading Loading @@ -1530,11 +1603,13 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n return; } if (!*bmc) { *bmc = 2 | (needed ? NEEDED_MASK : 0); *bmc = 2; bitmap_count_page(&bitmap->counts, offset, 1); bitmap_set_pending(&bitmap->counts, offset); bitmap->allclean = 0; } if (needed) *bmc |= NEEDED_MASK; spin_unlock_irq(&bitmap->counts.lock); } Loading Loading @@ -1591,6 +1666,10 @@ static void bitmap_free(struct bitmap *bitmap) if (!bitmap) /* there was no bitmap */ return; if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info && bitmap->cluster_slot == md_cluster_ops->slot_number(bitmap->mddev)) md_cluster_stop(bitmap->mddev); /* Shouldn't be needed - but just in case.... */ wait_event(bitmap->write_wait, atomic_read(&bitmap->pending_writes) == 0); Loading Loading @@ -1636,7 +1715,7 @@ void bitmap_destroy(struct mddev *mddev) * initialize the bitmap structure * if this returns an error, bitmap_destroy must be called to do clean up */ int bitmap_create(struct mddev *mddev) struct bitmap *bitmap_create(struct mddev *mddev, int slot) { struct bitmap *bitmap; sector_t blocks = mddev->resync_max_sectors; Loading @@ -1650,7 +1729,7 @@ int bitmap_create(struct mddev *mddev) bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL); if (!bitmap) return -ENOMEM; return ERR_PTR(-ENOMEM); spin_lock_init(&bitmap->counts.lock); atomic_set(&bitmap->pending_writes, 0); Loading @@ -1659,6 +1738,7 @@ int bitmap_create(struct mddev *mddev) init_waitqueue_head(&bitmap->behind_wait); bitmap->mddev = mddev; bitmap->cluster_slot = slot; if (mddev->kobj.sd) bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap"); Loading Loading @@ -1706,12 +1786,14 @@ int bitmap_create(struct mddev *mddev) printk(KERN_INFO "created bitmap (%lu pages) for device %s\n", bitmap->counts.pages, bmname(bitmap)); mddev->bitmap = bitmap; return test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; err = test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; if (err) goto error; return bitmap; error: bitmap_free(bitmap); return err; return ERR_PTR(err); } int bitmap_load(struct mddev *mddev) Loading Loading @@ -1765,6 +1847,60 @@ int bitmap_load(struct mddev *mddev) } EXPORT_SYMBOL_GPL(bitmap_load); /* Loads the bitmap associated with slot and copies the resync information * to our bitmap */ int bitmap_copy_from_slot(struct mddev *mddev, int slot, sector_t *low, sector_t *high, bool clear_bits) { int rv = 0, i, j; sector_t block, lo = 0, hi = 0; struct bitmap_counts *counts; struct bitmap *bitmap = bitmap_create(mddev, slot); if (IS_ERR(bitmap)) return PTR_ERR(bitmap); rv = bitmap_read_sb(bitmap); if (rv) goto err; rv = bitmap_init_from_disk(bitmap, 0); if (rv) goto err; counts = &bitmap->counts; for (j = 0; j < counts->chunks; j++) { block = (sector_t)j << counts->chunkshift; if (bitmap_file_test_bit(bitmap, block)) { if (!lo) lo = block; hi = block; bitmap_file_clear_bit(bitmap, block); bitmap_set_memory_bits(mddev->bitmap, block, 1); bitmap_file_set_bit(mddev->bitmap, block); } } if (clear_bits) { bitmap_update_sb(bitmap); /* Setting this for the ev_page should be enough. * And we do not require both write_all and PAGE_DIRT either */ for (i = 0; i < bitmap->storage.file_pages; i++) set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); bitmap_write_all(bitmap); bitmap_unplug(bitmap); } *low = lo; *high = hi; err: bitmap_free(bitmap); return rv; } EXPORT_SYMBOL_GPL(bitmap_copy_from_slot); void bitmap_status(struct seq_file *seq, struct bitmap *bitmap) { unsigned long chunk_kb; Loading Loading @@ -1849,7 +1985,8 @@ int bitmap_resize(struct bitmap *bitmap, sector_t blocks, memset(&store, 0, sizeof(store)); if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file) ret = bitmap_storage_alloc(&store, chunks, !bitmap->mddev->bitmap_info.external); !bitmap->mddev->bitmap_info.external, bitmap->cluster_slot); if (ret) goto err; Loading Loading @@ -2021,14 +2158,19 @@ location_store(struct mddev *mddev, const char *buf, size_t len) return -EINVAL; mddev->bitmap_info.offset = offset; if (mddev->pers) { struct bitmap *bitmap; mddev->pers->quiesce(mddev, 1); rv = bitmap_create(mddev); if (!rv) bitmap = bitmap_create(mddev, -1); if (IS_ERR(bitmap)) rv = PTR_ERR(bitmap); else { mddev->bitmap = bitmap; rv = bitmap_load(mddev); if (rv) { bitmap_destroy(mddev); mddev->bitmap_info.offset = 0; } } mddev->pers->quiesce(mddev, 0); if (rv) return rv; Loading Loading @@ -2186,6 +2328,8 @@ __ATTR(chunksize, S_IRUGO|S_IWUSR, chunksize_show, chunksize_store); static ssize_t metadata_show(struct mddev *mddev, char *page) { if (mddev_is_clustered(mddev)) return sprintf(page, "clustered\n"); return sprintf(page, "%s\n", (mddev->bitmap_info.external ? "external" : "internal")); } Loading @@ -2198,7 +2342,8 @@ static ssize_t metadata_store(struct mddev *mddev, const char *buf, size_t len) return -EBUSY; if (strncmp(buf, "external", 8) == 0) mddev->bitmap_info.external = 1; else if (strncmp(buf, "internal", 8) == 0) else if ((strncmp(buf, "internal", 8) == 0) || (strncmp(buf, "clustered", 9) == 0)) mddev->bitmap_info.external = 0; else return -EINVAL; Loading
drivers/md/bitmap.h +7 −3 Original line number Diff line number Diff line Loading @@ -130,8 +130,9 @@ typedef struct bitmap_super_s { __le32 write_behind; /* 60 number of outstanding write-behind writes */ __le32 sectors_reserved; /* 64 number of 512-byte sectors that are * reserved for the bitmap. */ __u8 pad[256 - 68]; /* set to zero */ __le32 nodes; /* 68 the maximum number of nodes in cluster. */ __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */ __u8 pad[256 - 136]; /* set to zero */ } bitmap_super_t; /* notes: Loading Loading @@ -226,12 +227,13 @@ struct bitmap { wait_queue_head_t behind_wait; struct kernfs_node *sysfs_can_clear; int cluster_slot; /* Slot offset for clustered env */ }; /* the bitmap API */ /* these are used only by md/bitmap */ int bitmap_create(struct mddev *mddev); struct bitmap *bitmap_create(struct mddev *mddev, int slot); int bitmap_load(struct mddev *mddev); void bitmap_flush(struct mddev *mddev); void bitmap_destroy(struct mddev *mddev); Loading Loading @@ -260,6 +262,8 @@ void bitmap_daemon_work(struct mddev *mddev); int bitmap_resize(struct bitmap *bitmap, sector_t blocks, int chunksize, int init); int bitmap_copy_from_slot(struct mddev *mddev, int slot, sector_t *lo, sector_t *hi, bool clear_bits); #endif #endif