Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 968f3e37 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull btrfs updates from Chris Mason:
 "We have a good sized cleanup of our internal read ahead code, and the
  first series of commits from Chandan to enable PAGE_SIZE > sectorsize

  Otherwise, it's a normal series of cleanups and fixes, with many
  thanks to Dave Sterba for doing most of the patch wrangling this time"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (82 commits)
  btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums
  btrfs: Fix misspellings in comments.
  btrfs: Print Warning only if ENOSPC_DEBUG is enabled
  btrfs: scrub: silence an uninitialized variable warning
  btrfs: move btrfs_compression_type to compression.h
  btrfs: rename btrfs_print_info to btrfs_print_mod_info
  Btrfs: Show a warning message if one of objectid reaches its highest value
  Documentation: btrfs: remove usage specific information
  btrfs: use kbasename in btrfsic_mount
  Btrfs: do not collect ordered extents when logging that inode exists
  Btrfs: fix race when checking if we can skip fsync'ing an inode
  Btrfs: fix listxattrs not listing all xattrs packed in the same item
  Btrfs: fix deadlock between direct IO reads and buffered writes
  Btrfs: fix extent_same allowing destination offset beyond i_size
  Btrfs: fix file loss on log replay after renaming a file and fsync
  Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
  Btrfs: fix lockdep deadlock warning due to dev_replace
  btrfs: drop unused argument in btrfs_ioctl_get_supported_features
  btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
  btrfs: change max_inline default to 2048
  ...
parents e531cdf5 389f239c
Loading
Loading
Loading
Loading
+11 −250
Original line number Diff line number Diff line

BTRFS
=====

Btrfs is a copy on write filesystem for Linux aimed at
implementing advanced features while focusing on fault tolerance,
repair and easy administration. Initially developed by Oracle, Btrfs
is licensed under the GPL and open for contribution from anyone.

Linux has a wealth of filesystems to choose from, but we are facing a
number of challenges with scaling to the large storage subsystems that
are becoming common in today's data centers. Filesystems need to scale
in their ability to address and manage large storage, and also in
their ability to detect, repair and tolerate errors in the data stored
on disk.  Btrfs is under heavy development, and is not suitable for
any uses other than benchmarking and review. The Btrfs disk format is
not yet finalized.
Btrfs is a copy on write filesystem for Linux aimed at implementing advanced
features while focusing on fault tolerance, repair and easy administration.
Jointly developed by several companies, licensed under the GPL and open for
contribution from anyone.

The main Btrfs features include:

@@ -28,243 +18,14 @@ The main Btrfs features include:
    * Checksums on data and metadata (multiple algorithms available)
    * Compression
    * Integrated multiple device support, with several raid algorithms
    * Online filesystem check (not yet implemented)
    * Very fast offline filesystem check
    * Efficient incremental backup and FS mirroring (not yet implemented)
    * Offline filesystem check
    * Efficient incremental backup and FS mirroring
    * Online filesystem defragmentation

For more information please refer to the wiki

Mount Options
=============

When mounting a btrfs filesystem, the following option are accepted.
Options with (*) are default options and will not show in the mount options.

  alloc_start=<bytes>
	Debugging option to force all block allocations above a certain
	byte threshold on each block device.  The value is specified in
	bytes, optionally with a K, M, or G suffix, case insensitive.
	Default is 1MB.

  noautodefrag(*)
  autodefrag
	Disable/enable auto defragmentation.
	Auto defragmentation detects small random writes into files and queue
	them up for the defrag process.  Works best for small files;
	Not well suited for large database workloads.

  check_int
  check_int_data
  check_int_print_mask=<value>
	These debugging options control the behavior of the integrity checking
	module (the BTRFS_FS_CHECK_INTEGRITY config option required).

	check_int enables the integrity checker module, which examines all
	block write requests to ensure on-disk consistency, at a large
	memory and CPU cost.

	check_int_data includes extent data in the integrity checks, and
	implies the check_int option.

	check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values
	as defined in fs/btrfs/check-integrity.c, to control the integrity
	checker module behavior.

	See comments at the top of fs/btrfs/check-integrity.c for more info.

  commit=<seconds>
	Set the interval of periodic commit, 30 seconds by default. Higher
	values defer data being synced to permanent storage with obvious
	consequences when the system crashes. The upper bound is not forced,
	but a warning is printed if it's more than 300 seconds (5 minutes).

  compress
  compress=<type>
  compress-force
  compress-force=<type>
	Control BTRFS file data compression.  Type may be specified as "zlib"
	"lzo" or "no" (for no compression, used for remounting).  If no type
	is specified, zlib is used.  If compress-force is specified,
	all files will be compressed, whether or not they compress well.
	If compression is enabled, nodatacow and nodatasum are disabled.

  degraded
	Allow mounts to continue with missing devices.  A read-write mount may
	fail with too many devices missing, for example if a stripe member
	is completely missing.

  device=<devicepath>
	Specify a device during mount so that ioctls on the control device
	can be avoided.  Especially useful when trying to mount a multi-device
	setup as root.  May be specified multiple times for multiple devices.

  nodiscard(*)
  discard
	Disable/enable discard mount option.
	Discard issues frequent commands to let the block device reclaim space
	freed by the filesystem.
	This is useful for SSD devices, thinly provisioned
	LUNs and virtual machine images, but may have a significant
	performance impact.  (The fstrim command is also available to
	initiate batch trims from userspace).

  noenospc_debug(*)
  enospc_debug
	Disable/enable debugging option to be more verbose in some ENOSPC conditions.

  fatal_errors=<action>
	Action to take when encountering a fatal error:
	  "bug" - BUG() on a fatal error.  This is the default.
	  "panic" - panic() on a fatal error.

  noflushoncommit(*)
  flushoncommit
	The 'flushoncommit' mount option forces any data dirtied by a write in a
	prior transaction to commit as part of the current commit.  This makes
	the committed state a fully consistent view of the file system from the
	application's perspective (i.e., it includes all completed file system
	operations).  This was previously the behavior only when a snapshot is
	created.

  inode_cache
	Enable free inode number caching.   Defaults to off due to an overflow
	problem when the free space crcs don't fit inside a single page.

  max_inline=<bytes>
	Specify the maximum amount of space, in bytes, that can be inlined in
	a metadata B-tree leaf.  The value is specified in bytes, optionally
	with a K, M, or G suffix, case insensitive.  In practice, this value
	is limited by the root sector size, with some space unavailable due
	to leaf headers.  For a 4k sector size, max inline data is ~3900 bytes.

  metadata_ratio=<value>
	Specify that 1 metadata chunk should be allocated after every <value>
	data chunks.  Off by default.

  acl(*)
  noacl
	Enable/disable support for Posix Access Control Lists (ACLs).  See the
	acl(5) manual page for more information about ACLs.

  barrier(*)
  nobarrier
        Enable/disable the use of block layer write barriers.  Write barriers
	ensure that certain IOs make it through the device cache and are on
	persistent storage. If disabled on a device with a volatile
	(non-battery-backed) write-back cache, nobarrier option will lead to
	filesystem corruption on a system crash or power loss.

  datacow(*)
  nodatacow
	Enable/disable data copy-on-write for newly created files.
	Nodatacow implies nodatasum, and disables all compression.

  datasum(*)
  nodatasum
	Enable/disable data checksumming for newly created files.
	Datasum implies datacow.

  treelog(*)
  notreelog
	Enable/disable the tree logging used for fsync and O_SYNC writes.

  recovery
	Enable autorecovery attempts if a bad tree root is found at mount time.
	Currently this scans a list of several previous tree roots and tries to
	use the first readable.

  rescan_uuid_tree
	Force check and rebuild procedure of the UUID tree. This should not
	normally be needed.

  skip_balance
	Skip automatic resume of interrupted balance operation after mount.
	May be resumed with "btrfs balance resume."

  space_cache (*)
	Enable the on-disk freespace cache.
  nospace_cache
	Disable freespace cache loading without clearing the cache.
  clear_cache
	Force clearing and rebuilding of the disk space cache if something
	has gone wrong.

  ssd
  nossd
  ssd_spread
	Options to control ssd allocation schemes.  By default, BTRFS will
	enable or disable ssd allocation heuristics depending on whether a
	rotational or non-rotational disk is in use.  The ssd and nossd options
	can override this autodetection.

	The ssd_spread mount option attempts to allocate into big chunks
	of unused space, and may perform better on low-end ssds.  ssd_spread
	implies ssd, enabling all other ssd heuristics as well.

  subvol=<path>
	Mount subvolume at <path> rather than the root subvolume.  <path> is
	relative to the top level subvolume.

  subvolid=<ID>
	Mount subvolume specified by an ID number rather than the root subvolume.
	This allows mounting of subvolumes which are not in the root of the mounted
	filesystem.
	You can use "btrfs subvolume list" to see subvolume ID numbers.

  subvolrootid=<objectid> (deprecated)
	Mount subvolume specified by <objectid> rather than the root subvolume.
	This allows mounting of subvolumes which are not in the root of the mounted
	filesystem.
	You can use "btrfs subvolume show " to see the object ID for a subvolume.

  thread_pool=<number>
	The number of worker threads to allocate.  The default number is equal
	to the number of CPUs + 2, or 8, whichever is smaller.

  user_subvol_rm_allowed
	Allow subvolumes to be deleted by a non-root user. Use with caution.

MAILING LIST
============

There is a Btrfs mailing list hosted on vger.kernel.org. You can
find details on how to subscribe here:

http://vger.kernel.org/vger-lists.html#linux-btrfs

Mailing list archives are available from gmane:

http://dir.gmane.org/gmane.comp.file-systems.btrfs



IRC
===

Discussion of Btrfs also occurs on the #btrfs channel of the Freenode
IRC network.



	UTILITIES
	=========

Userspace tools for creating and manipulating Btrfs file systems are
available from the git repository at the following location:

 http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git
 git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git

These include the following tools:

* mkfs.btrfs: create a filesystem

* btrfs: a single tool to manage the filesystems, refer to the manpage for more details

* 'btrfsck' or 'btrfs check': do a consistency check of the filesystem

Other tools for specific tasks:

* btrfs-convert: in-place conversion from ext2/3/4 filesystems
  https://btrfs.wiki.kernel.org

* btrfs-image: dump filesystem metadata for debugging
that maintains information about administration tasks, frequently asked
questions, use cases, mount options, comprehensible changelogs, features,
manual pages, source code repositories, contacts etc.
+4 −8
Original line number Diff line number Diff line
@@ -148,7 +148,6 @@ int __init btrfs_prelim_ref_init(void)

void btrfs_prelim_ref_exit(void)
{
	if (btrfs_prelim_ref_cache)
	kmem_cache_destroy(btrfs_prelim_ref_cache);
}

@@ -566,17 +565,14 @@ static void __merge_refs(struct list_head *head, int mode)
		struct __prelim_ref *pos2 = pos1, *tmp;

		list_for_each_entry_safe_continue(pos2, tmp, head, list) {
			struct __prelim_ref *xchg, *ref1 = pos1, *ref2 = pos2;
			struct __prelim_ref *ref1 = pos1, *ref2 = pos2;
			struct extent_inode_elem *eie;

			if (!ref_for_same_block(ref1, ref2))
				continue;
			if (mode == 1) {
				if (!ref1->parent && ref2->parent) {
					xchg = ref1;
					ref1 = ref2;
					ref2 = xchg;
				}
				if (!ref1->parent && ref2->parent)
					swap(ref1, ref2);
			} else {
				if (ref1->parent != ref2->parent)
					continue;
+5 −7
Original line number Diff line number Diff line
@@ -95,6 +95,7 @@
#include <linux/genhd.h>
#include <linux/blkdev.h>
#include <linux/vmalloc.h>
#include <linux/string.h>
#include "ctree.h"
#include "disk-io.h"
#include "hash.h"
@@ -105,6 +106,7 @@
#include "locking.h"
#include "check-integrity.h"
#include "rcu-string.h"
#include "compression.h"

#define BTRFSIC_BLOCK_HASHTABLE_SIZE 0x10000
#define BTRFSIC_BLOCK_LINK_HASHTABLE_SIZE 0x10000
@@ -176,7 +178,7 @@ struct btrfsic_block {
 * Elements of this type are allocated dynamically and required because
 * each block object can refer to and can be ref from multiple blocks.
 * The key to lookup them in the hashtable is the dev_bytenr of
 * the block ref to plus the one from the block refered from.
 * the block ref to plus the one from the block referred from.
 * The fact that they are searchable via a hashtable and that a
 * ref_cnt is maintained is not required for the btrfs integrity
 * check algorithm itself, it is only used to make the output more
@@ -3076,7 +3078,7 @@ int btrfsic_mount(struct btrfs_root *root,

	list_for_each_entry(device, dev_head, dev_list) {
		struct btrfsic_dev_state *ds;
		char *p;
		const char *p;

		if (!device->bdev || !device->name)
			continue;
@@ -3092,11 +3094,7 @@ int btrfsic_mount(struct btrfs_root *root,
		ds->state = state;
		bdevname(ds->bdev, ds->name);
		ds->name[BDEVNAME_SIZE - 1] = '\0';
		for (p = ds->name; *p != '\0'; p++);
		while (p > ds->name && *p != '/')
			p--;
		if (*p == '/')
			p++;
		p = kbasename(ds->name);
		strlcpy(ds->name, p, sizeof(ds->name));
		btrfsic_dev_state_hashtable_add(ds,
						&btrfsic_dev_state_hashtable);
+9 −0
Original line number Diff line number Diff line
@@ -48,6 +48,15 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
void btrfs_clear_biovec_end(struct bio_vec *bvec, int vcnt,
				   unsigned long pg_index,
				   unsigned long pg_offset);

enum btrfs_compression_type {
	BTRFS_COMPRESS_NONE  = 0,
	BTRFS_COMPRESS_ZLIB  = 1,
	BTRFS_COMPRESS_LZO   = 2,
	BTRFS_COMPRESS_TYPES = 2,
	BTRFS_COMPRESS_LAST  = 3,
};

struct btrfs_compress_op {
	struct list_head *(*alloc_workspace)(void);

+18 −18
Original line number Diff line number Diff line
@@ -311,7 +311,7 @@ struct tree_mod_root {

struct tree_mod_elem {
	struct rb_node node;
	u64 index;		/* shifted logical */
	u64 logical;
	u64 seq;
	enum mod_log_op op;

@@ -435,11 +435,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,

/*
 * key order of the log:
 *       index -> sequence
 *       node/leaf start address -> sequence
 *
 * the index is the shifted logical of the *new* root node for root replace
 * operations, or the shifted logical of the affected block for all other
 * operations.
 * The 'start address' is the logical address of the *new* root node
 * for root replace operations, or the logical address of the affected
 * block for all other operations.
 *
 * Note: must be called with write lock (tree_mod_log_write_lock).
 */
@@ -460,9 +460,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm)
	while (*new) {
		cur = container_of(*new, struct tree_mod_elem, node);
		parent = *new;
		if (cur->index < tm->index)
		if (cur->logical < tm->logical)
			new = &((*new)->rb_left);
		else if (cur->index > tm->index)
		else if (cur->logical > tm->logical)
			new = &((*new)->rb_right);
		else if (cur->seq < tm->seq)
			new = &((*new)->rb_left);
@@ -523,7 +523,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot,
	if (!tm)
		return NULL;

	tm->index = eb->start >> PAGE_CACHE_SHIFT;
	tm->logical = eb->start;
	if (op != MOD_LOG_KEY_ADD) {
		btrfs_node_key(eb, &tm->key, slot);
		tm->blockptr = btrfs_node_blockptr(eb, slot);
@@ -588,7 +588,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
		goto free_tms;
	}

	tm->index = eb->start >> PAGE_CACHE_SHIFT;
	tm->logical = eb->start;
	tm->slot = src_slot;
	tm->move.dst_slot = dst_slot;
	tm->move.nr_items = nr_items;
@@ -699,7 +699,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
		goto free_tms;
	}

	tm->index = new_root->start >> PAGE_CACHE_SHIFT;
	tm->logical = new_root->start;
	tm->old_root.logical = old_root->start;
	tm->old_root.level = btrfs_header_level(old_root);
	tm->generation = btrfs_header_generation(old_root);
@@ -739,16 +739,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq,
	struct rb_node *node;
	struct tree_mod_elem *cur = NULL;
	struct tree_mod_elem *found = NULL;
	u64 index = start >> PAGE_CACHE_SHIFT;

	tree_mod_log_read_lock(fs_info);
	tm_root = &fs_info->tree_mod_log;
	node = tm_root->rb_node;
	while (node) {
		cur = container_of(node, struct tree_mod_elem, node);
		if (cur->index < index) {
		if (cur->logical < start) {
			node = node->rb_left;
		} else if (cur->index > index) {
		} else if (cur->logical > start) {
			node = node->rb_right;
		} else if (cur->seq < min_seq) {
			node = node->rb_left;
@@ -1230,9 +1229,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
		return NULL;

	/*
	 * the very last operation that's logged for a root is the replacement
	 * operation (if it is replaced at all). this has the index of the *new*
	 * root, making it the very first operation that's logged for this root.
	 * the very last operation that's logged for a root is the
	 * replacement operation (if it is replaced at all). this has
	 * the logical address of the *new* root, making it the very
	 * first operation that's logged for this root.
	 */
	while (1) {
		tm = tree_mod_log_search_oldest(fs_info, root_logical,
@@ -1336,7 +1336,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
		if (!next)
			break;
		tm = container_of(next, struct tree_mod_elem, node);
		if (tm->index != first_tm->index)
		if (tm->logical != first_tm->logical)
			break;
	}
	tree_mod_log_read_unlock(fs_info);
@@ -5361,7 +5361,7 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
		goto out;
	}

	tmp_buf = kmalloc(left_root->nodesize, GFP_NOFS);
	tmp_buf = kmalloc(left_root->nodesize, GFP_KERNEL);
	if (!tmp_buf) {
		ret = -ENOMEM;
		goto out;
Loading