Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 8d370595 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull xfs and iomap updates from Dave Chinner:
 "The main things in this update are the iomap-based DAX infrastructure,
  an XFS delalloc rework, and a chunk of fixes to how log recovery
  schedules writeback to prevent spurious corruption detections when
  recovery of certain items was not required.

  The other main chunk of code is some preparation for the upcoming
  reflink functionality. Most of it is generic and cleanups that stand
  alone, but they were ready and reviewed so are in this pull request.

  Speaking of reflink, I'm currently planning to send you another pull
  request next week containing all the new reflink functionality. I'm
  working through a similar process to the last cycle, where I sent the
  reverse mapping code in a separate request because of how large it
  was. The reflink code merge is even bigger than reverse mapping, so
  I'll be doing the same thing again....

  Summary for this update:

   - change of XFS mailing list to linux-xfs@vger.kernel.org

   - iomap-based DAX infrastructure w/ XFS and ext2 support

   - small iomap fixes and additions

   - more efficient XFS delayed allocation infrastructure based on iomap

   - a rework of log recovery writeback scheduling to ensure we don't
     fail recovery when trying to replay items that are already on disk

   - some preparation patches for upcoming reflink support

   - configurable error handling fixes and documentation

   - aio access time update race fixes for XFS and
     generic_file_read_iter"

* tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits)
  fs: update atime before I/O in generic_file_read_iter
  xfs: update atime before I/O in xfs_file_dio_aio_read
  ext2: fix possible integer truncation in ext2_iomap_begin
  xfs: log recovery tracepoints to track current lsn and buffer submission
  xfs: update metadata LSN in buffers during log recovery
  xfs: don't warn on buffers not being recovered due to LSN
  xfs: pass current lsn to log recovery buffer validation
  xfs: rework log recovery to submit buffers on LSN boundaries
  xfs: quiesce the filesystem after recovery on readonly mount
  xfs: remote attribute blocks aren't really userdata
  ext2: use iomap to implement DAX
  ext2: stop passing buffer_head to ext2_get_blocks
  xfs: use iomap to implement DAX
  xfs: refactor xfs_setfilesize
  xfs: take the ilock shared if possible in xfs_file_iomap_begin
  xfs: fix locking for DAX writes
  dax: provide an iomap based fault handler
  dax: provide an iomap based dax read/write path
  dax: don't pass buffer_head to copy_user_dax
  dax: don't pass buffer_head to dax_insert_mapping
  ...
parents d230ec72 155cd433
Loading
Loading
Loading
Loading
+123 −0
Original line number Diff line number Diff line
@@ -348,3 +348,126 @@ Removed Sysctls
  ----				-------
  fs.xfs.xfsbufd_centisec	v4.0
  fs.xfs.age_buffer_centisecs	v4.0


Error handling
==============

XFS can act differently according to the type of error found during its
operation. The implementation introduces the following concepts to the error
handler:

 -failure speed:
	Defines how fast XFS should propagate an error upwards when a specific
	error is found during the filesystem operation. It can propagate
	immediately, after a defined number of retries, after a set time period,
	or simply retry forever.

 -error classes:
	Specifies the subsystem the error configuration will apply to, such as
	metadata IO or memory allocation. Different subsystems will have
	different error handlers for which behaviour can be configured.

 -error handlers:
	Defines the behavior for a specific error.

The filesystem behavior during an error can be set via sysfs files. Each
error handler works independently - the first condition met by an error handler
for a specific class will cause the error to be propagated rather than reset and
retried.

The action taken by the filesystem when the error is propagated is context
dependent - it may cause a shut down in the case of an unrecoverable error,
it may be reported back to userspace, or it may even be ignored because
there's nothing useful we can with the error or anyone we can report it to (e.g.
during unmount).

The configuration files are organized into the following hierarchy for each
mounted filesystem:

  /sys/fs/xfs/<dev>/error/<class>/<error>/

Where:
  <dev>
	The short device name of the mounted filesystem. This is the same device
	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."

  <class>
	The subsystem the error configuration belongs to. As of 4.9, the defined
	classes are:

		- "metadata": applies metadata buffer write IO

  <error>
	The individual error handler configurations.


Each filesystem has "global" error configuration options defined in their top
level directory:

  /sys/fs/xfs/<dev>/error/

  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
	Defines the filesystem error behavior at unmount time.

	If set to a value of 1, XFS will override all other error configurations
	during unmount and replace them with "immediate fail" characteristics.
	i.e. no retries, no retry timeout. This will always allow unmount to
	succeed when there are persistent errors present.

	If set to 0, the configured retry behaviour will continue until all
	retries and/or timeouts have been exhausted. This will delay unmount
	completion when there are persistent errors, and it may prevent the
	filesystem from ever unmounting fully in the case of "retry forever"
	handler configurations.

	Note: there is no guarantee that fail_at_unmount can be set whilst an
	unmount is in progress. It is possible that the sysfs entries are
	removed by the unmounting filesystem before a "retry forever" error
	handler configuration causes unmount to hang, and hence the filesystem
	must be configured appropriately before unmount begins to prevent
	unmount hangs.

Each filesystem has specific error class handlers that define the error
propagation behaviour for specific errors. There is also a "default" error
handler defined, which defines the behaviour for all errors that don't have
specific handlers defined. Where multiple retry constraints are configuredi for
a single error, the first retry configuration that expires will cause the error
to be propagated. The handler configurations are found in the directory:

  /sys/fs/xfs/<dev>/error/<class>/<error>/

  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
	Defines the allowed number of retries of a specific error before
	the filesystem will propagate the error. The retry count for a given
	error context (e.g. a specific metadata buffer) is reset every time
	there is a successful completion of the operation.

	Setting the value to "-1" will cause XFS to retry forever for this
	specific error.

	Setting the value to "0" will cause XFS to fail immediately when the
	specific error is reported.

	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
	operation "N" times before propagating the error.

  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
	Define the amount of time (in seconds) that the filesystem is
	allowed to retry its operations when the specific error is
	found.

	Setting the value to "-1" will allow XFS to retry forever for this
	specific error.

	Setting the value to "0" will cause XFS to fail immediately when the
	specific error is reported.

	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
	operation for up to "N" seconds before propagating the error.

Note: The default behaviour for a specific error handler is dependent on both
the class and error context. For example, the default values for
"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
to "fail immediately" behaviour. This is done because ENODEV is a fatal,
unrecoverable error no matter how many times the metadata IO is retried.
+3 −4
Original line number Diff line number Diff line
@@ -13099,11 +13099,10 @@ F: arch/x86/xen/*swiotlb*
F:	drivers/xen/*swiotlb*

XFS FILESYSTEM
P:	Silicon Graphics Inc
M:	Dave Chinner <david@fromorbit.com>
M:	xfs@oss.sgi.com
L:	xfs@oss.sgi.com
W:	http://oss.sgi.com/projects/xfs
M:	linux-xfs@vger.kernel.org
L:	linux-xfs@vger.kernel.org
W:	http://xfs.org/
T:	git git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git
S:	Supported
F:	Documentation/filesystems/xfs.txt
+240 −12
Original line number Diff line number Diff line
@@ -31,6 +31,8 @@
#include <linux/vmstat.h>
#include <linux/pfn_t.h>
#include <linux/sizes.h>
#include <linux/iomap.h>
#include "internal.h"

/*
 * We use lowest available bit in exceptional entry for locking, other two
@@ -580,14 +582,13 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
	return VM_FAULT_LOCKED;
}

static int copy_user_bh(struct page *to, struct inode *inode,
		struct buffer_head *bh, unsigned long vaddr)
static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
		struct page *to, unsigned long vaddr)
{
	struct blk_dax_ctl dax = {
		.sector = to_sector(bh, inode),
		.size = bh->b_size,
		.sector = sector,
		.size = size,
	};
	struct block_device *bdev = bh->b_bdev;
	void *vto;

	if (dax_map_atomic(bdev, &dax) < 0)
@@ -790,14 +791,13 @@ int dax_writeback_mapping_range(struct address_space *mapping,
EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);

static int dax_insert_mapping(struct address_space *mapping,
			struct buffer_head *bh, void **entryp,
			struct vm_area_struct *vma, struct vm_fault *vmf)
		struct block_device *bdev, sector_t sector, size_t size,
		void **entryp, struct vm_area_struct *vma, struct vm_fault *vmf)
{
	unsigned long vaddr = (unsigned long)vmf->virtual_address;
	struct block_device *bdev = bh->b_bdev;
	struct blk_dax_ctl dax = {
		.sector = to_sector(bh, mapping->host),
		.size = bh->b_size,
		.sector = sector,
		.size = size,
	};
	void *ret;
	void *entry = *entryp;
@@ -868,7 +868,8 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
	if (vmf->cow_page) {
		struct page *new_page = vmf->cow_page;
		if (buffer_written(&bh))
			error = copy_user_bh(new_page, inode, &bh, vaddr);
			error = copy_user_dax(bh.b_bdev, to_sector(&bh, inode),
					bh.b_size, new_page, vaddr);
		else
			clear_user_highpage(new_page, vaddr);
		if (error)
@@ -898,7 +899,8 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,

	/* Filesystem should not return unwritten buffers to us! */
	WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
	error = dax_insert_mapping(mapping, &bh, &entry, vma, vmf);
	error = dax_insert_mapping(mapping, bh.b_bdev, to_sector(&bh, inode),
			bh.b_size, &entry, vma, vmf);
 unlock_entry:
	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 out:
@@ -1241,3 +1243,229 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
	return dax_zero_page_range(inode, from, length, get_block);
}
EXPORT_SYMBOL_GPL(dax_truncate_page);

#ifdef CONFIG_FS_IOMAP
static loff_t
iomap_dax_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
		struct iomap *iomap)
{
	struct iov_iter *iter = data;
	loff_t end = pos + length, done = 0;
	ssize_t ret = 0;

	if (iov_iter_rw(iter) == READ) {
		end = min(end, i_size_read(inode));
		if (pos >= end)
			return 0;

		if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_UNWRITTEN)
			return iov_iter_zero(min(length, end - pos), iter);
	}

	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
		return -EIO;

	while (pos < end) {
		unsigned offset = pos & (PAGE_SIZE - 1);
		struct blk_dax_ctl dax = { 0 };
		ssize_t map_len;

		dax.sector = iomap->blkno +
			(((pos & PAGE_MASK) - iomap->offset) >> 9);
		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
		map_len = dax_map_atomic(iomap->bdev, &dax);
		if (map_len < 0) {
			ret = map_len;
			break;
		}

		dax.addr += offset;
		map_len -= offset;
		if (map_len > end - pos)
			map_len = end - pos;

		if (iov_iter_rw(iter) == WRITE)
			map_len = copy_from_iter_pmem(dax.addr, map_len, iter);
		else
			map_len = copy_to_iter(dax.addr, map_len, iter);
		dax_unmap_atomic(iomap->bdev, &dax);
		if (map_len <= 0) {
			ret = map_len ? map_len : -EFAULT;
			break;
		}

		pos += map_len;
		length -= map_len;
		done += map_len;
	}

	return done ? done : ret;
}

/**
 * iomap_dax_rw - Perform I/O to a DAX file
 * @iocb:	The control block for this I/O
 * @iter:	The addresses to do I/O from or to
 * @ops:	iomap ops passed from the file system
 *
 * This function performs read and write operations to directly mapped
 * persistent memory.  The callers needs to take care of read/write exclusion
 * and evicting any page cache pages in the region under I/O.
 */
ssize_t
iomap_dax_rw(struct kiocb *iocb, struct iov_iter *iter,
		struct iomap_ops *ops)
{
	struct address_space *mapping = iocb->ki_filp->f_mapping;
	struct inode *inode = mapping->host;
	loff_t pos = iocb->ki_pos, ret = 0, done = 0;
	unsigned flags = 0;

	if (iov_iter_rw(iter) == WRITE)
		flags |= IOMAP_WRITE;

	/*
	 * Yes, even DAX files can have page cache attached to them:  A zeroed
	 * page is inserted into the pagecache when we have to serve a write
	 * fault on a hole.  It should never be dirtied and can simply be
	 * dropped from the pagecache once we get real data for the page.
	 *
	 * XXX: This is racy against mmap, and there's nothing we can do about
	 * it. We'll eventually need to shift this down even further so that
	 * we can check if we allocated blocks over a hole first.
	 */
	if (mapping->nrpages) {
		ret = invalidate_inode_pages2_range(mapping,
				pos >> PAGE_SHIFT,
				(pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
		WARN_ON_ONCE(ret);
	}

	while (iov_iter_count(iter)) {
		ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
				iter, iomap_dax_actor);
		if (ret <= 0)
			break;
		pos += ret;
		done += ret;
	}

	iocb->ki_pos += done;
	return done ? done : ret;
}
EXPORT_SYMBOL_GPL(iomap_dax_rw);

/**
 * iomap_dax_fault - handle a page fault on a DAX file
 * @vma: The virtual memory area where the fault occurred
 * @vmf: The description of the fault
 * @ops: iomap ops passed from the file system
 *
 * When a page fault occurs, filesystems may call this helper in their fault
 * or mkwrite handler for DAX files. Assumes the caller has done all the
 * necessary locking for the page fault to proceed successfully.
 */
int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
			struct iomap_ops *ops)
{
	struct address_space *mapping = vma->vm_file->f_mapping;
	struct inode *inode = mapping->host;
	unsigned long vaddr = (unsigned long)vmf->virtual_address;
	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
	sector_t sector;
	struct iomap iomap = { 0 };
	unsigned flags = 0;
	int error, major = 0;
	void *entry;

	/*
	 * Check whether offset isn't beyond end of file now. Caller is supposed
	 * to hold locks serializing us with truncate / punch hole so this is
	 * a reliable test.
	 */
	if (pos >= i_size_read(inode))
		return VM_FAULT_SIGBUS;

	entry = grab_mapping_entry(mapping, vmf->pgoff);
	if (IS_ERR(entry)) {
		error = PTR_ERR(entry);
		goto out;
	}

	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
		flags |= IOMAP_WRITE;

	/*
	 * Note that we don't bother to use iomap_apply here: DAX required
	 * the file system block size to be equal the page size, which means
	 * that we never have to deal with more than a single extent here.
	 */
	error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
	if (error)
		goto unlock_entry;
	if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
		error = -EIO;		/* fs corruption? */
		goto unlock_entry;
	}

	sector = iomap.blkno + (((pos & PAGE_MASK) - iomap.offset) >> 9);

	if (vmf->cow_page) {
		switch (iomap.type) {
		case IOMAP_HOLE:
		case IOMAP_UNWRITTEN:
			clear_user_highpage(vmf->cow_page, vaddr);
			break;
		case IOMAP_MAPPED:
			error = copy_user_dax(iomap.bdev, sector, PAGE_SIZE,
					vmf->cow_page, vaddr);
			break;
		default:
			WARN_ON_ONCE(1);
			error = -EIO;
			break;
		}

		if (error)
			goto unlock_entry;
		if (!radix_tree_exceptional_entry(entry)) {
			vmf->page = entry;
			return VM_FAULT_LOCKED;
		}
		vmf->entry = entry;
		return VM_FAULT_DAX_LOCKED;
	}

	switch (iomap.type) {
	case IOMAP_MAPPED:
		if (iomap.flags & IOMAP_F_NEW) {
			count_vm_event(PGMAJFAULT);
			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
			major = VM_FAULT_MAJOR;
		}
		error = dax_insert_mapping(mapping, iomap.bdev, sector,
				PAGE_SIZE, &entry, vma, vmf);
		break;
	case IOMAP_UNWRITTEN:
	case IOMAP_HOLE:
		if (!(vmf->flags & FAULT_FLAG_WRITE))
			return dax_load_hole(mapping, entry, vmf);
		/*FALLTHRU*/
	default:
		WARN_ON_ONCE(1);
		error = -EIO;
		break;
	}

 unlock_entry:
	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 out:
	if (error == -ENOMEM)
		return VM_FAULT_OOM | major;
	/* -EBUSY is fine, somebody else faulted on the same PTE */
	if (error < 0 && error != -EBUSY)
		return VM_FAULT_SIGBUS | major;
	return VM_FAULT_NOPAGE | major;
}
EXPORT_SYMBOL_GPL(iomap_dax_fault);
#endif /* CONFIG_FS_IOMAP */
+1 −0
Original line number Diff line number Diff line
config EXT2_FS
	tristate "Second extended fs support"
	select FS_IOMAP if FS_DAX
	help
	  Ext2 is a standard Linux file system for hard disks.

+1 −0
Original line number Diff line number Diff line
@@ -814,6 +814,7 @@ extern const struct file_operations ext2_file_operations;
/* inode.c */
extern const struct address_space_operations ext2_aops;
extern const struct address_space_operations ext2_nobh_aops;
extern struct iomap_ops ext2_iomap_ops;

/* namei.c */
extern const struct inode_operations ext2_dir_inode_operations;
Loading