Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 (2150edc6) · Commits · e / devices / android_kernel_samsung_universal8895

Documentation/filesystems/ext4.txt

+67 −18

Original line number	Original line	Diff line number	Diff line
	@@ -58,13 +58,22 @@ Note: More extensive information for getting started with ext4 can be

	# mount -t ext4 /dev/hda1 /wherever		# mount -t ext4 /dev/hda1 /wherever

	- When comparing performance with other filesystems, remember that		- When comparing performance with other filesystems, it's always
	ext3/4 by default offers higher data integrity guarantees than most.		important to try multiple workloads; very often a subtle change in a
	So when comparing with a metadata-only journalling filesystem, such		workload parameter can completely change the ranking of which
	as ext3, use `mount -o data=writeback'. And you might as well use		filesystems do well compared to others. When comparing versus ext3,
	`mount -o nobh' too along with it. Making the journal larger than		note that ext4 enables write barriers by default, while ext3 does
	the mke2fs default often helps performance with metadata-intensive		not enable write barriers by default. So it is useful to use
	workloads.		explicitly specify whether barriers are enabled or not when via the
			'-o barriers=[0\|1]' mount option for both ext3 and ext4 filesystems
			for a fair comparison. When tuning ext3 for best benchmark numbers,
			it is often worthwhile to try changing the data journaling mode; '-o
			data=writeback,nobh' can be faster for some workloads. (Note
			however that running mounted with data=writeback can potentially
			leave stale data exposed in recently written files in case of an
			unclean shutdown, which could be a security exposure in some
			situations.) Configuring the filesystem with a large journal can
			also be helpful for metadata-intensive workloads.

	2. Features		2. Features
	===========		===========
	@@ -74,7 +83,7 @@ Note: More extensive information for getting started with ext4 can be
	* ability to use filesystems > 16TB (e2fsprogs support not available yet)		* ability to use filesystems > 16TB (e2fsprogs support not available yet)
	* extent format reduces metadata overhead (RAM, IO for access, transactions)		* extent format reduces metadata overhead (RAM, IO for access, transactions)
	* extent format more robust in face of on-disk corruption due to magics,		* extent format more robust in face of on-disk corruption due to magics,
	* internal redunancy in tree		* internal redundancy in tree
	* improved file allocation (multi-block alloc)		* improved file allocation (multi-block alloc)
	* fix 32000 subdirectory limit		* fix 32000 subdirectory limit
	* nsec timestamps for mtime, atime, ctime, create time		* nsec timestamps for mtime, atime, ctime, create time
	@@ -116,10 +125,11 @@ grouping of bitmaps and inode tables. Some test results available here:
	When mounting an ext4 filesystem, the following option are accepted:		When mounting an ext4 filesystem, the following option are accepted:
	(*) == default		(*) == default

	extents (*) ext4 will use extents to address file data. The		ro Mount filesystem read only. Note that ext4 will
	file system will no longer be mountable by ext3.		replay the journal (and thus write to the
			partition) even when mounted "read only". The
	noextents ext4 will not use extents for newly created files		mount options "ro,noload" can be used to prevent
			writes to the filesystem.

	journal_checksum Enable checksumming of the journal transactions.		journal_checksum Enable checksumming of the journal transactions.
	This will allow the recovery code in e2fsck and the		This will allow the recovery code in e2fsck and the
	@@ -134,17 +144,17 @@ journal_async_commit Commit block can be written to disk without waiting
	journal=update Update the ext4 file system's journal to the current		journal=update Update the ext4 file system's journal to the current
	format.		format.

	journal=inum When a journal already exists, this option is ignored.
	Otherwise, it specifies the number of the inode which
	will represent the ext4 file system's journal file.

	journal_dev=devnum When the external journal device's major/minor numbers		journal_dev=devnum When the external journal device's major/minor numbers
	have changed, this option allows the user to specify		have changed, this option allows the user to specify
	the new journal location. The journal device is		the new journal location. The journal device is
	identified through its new major/minor numbers encoded		identified through its new major/minor numbers encoded
	in devnum.		in devnum.

	noload Don't load the journal on mounting.		noload Don't load the journal on mounting. Note that
			if the filesystem was not unmounted cleanly,
			skipping the journal replay will lead to the
			filesystem containing inconsistencies that can
			lead to any number of problems.

	data=journal All data are committed into the journal prior to being		data=journal All data are committed into the journal prior to being
	written into the main file system.		written into the main file system.
	@@ -219,9 +229,12 @@ minixdf Make 'df' act like Minix.

	debug Extra debugging information is sent to syslog.		debug Extra debugging information is sent to syslog.

	errors=remount-ro(*) Remount the filesystem read-only on an error.		errors=remount-ro Remount the filesystem read-only on an error.
	errors=continue Keep going on a filesystem error.		errors=continue Keep going on a filesystem error.
	errors=panic Panic and halt the machine if an error occurs.		errors=panic Panic and halt the machine if an error occurs.
			(These mount options override the errors behavior
			specified in the superblock, which can be configured
			using tune2fs)

	data_err=ignore(*) Just print an error message if an error occurs		data_err=ignore(*) Just print an error message if an error occurs
	in a file data buffer in ordered mode.		in a file data buffer in ordered mode.
	@@ -261,6 +274,42 @@ delalloc (*) Deferring block allocation until write-out time.
	nodelalloc Disable delayed allocation. Blocks are allocation		nodelalloc Disable delayed allocation. Blocks are allocation
	when data is copied from user to page cache.		when data is copied from user to page cache.

			max_batch_time=usec Maximum amount of time ext4 should wait for
			additional filesystem operations to be batch
			together with a synchronous write operation.
			Since a synchronous write operation is going to
			force a commit and then a wait for the I/O
			complete, it doesn't cost much, and can be a
			huge throughput win, we wait for a small amount
			of time to see if any other transactions can
			piggyback on the synchronous write. The
			algorithm used is designed to automatically tune
			for the speed of the disk, by measuring the
			amount of time (on average) that it takes to
			finish committing a transaction. Call this time
			the "commit time". If the time that the
			transactoin has been running is less than the
			commit time, ext4 will try sleeping for the
			commit time to see if other operations will join
			the transaction. The commit time is capped by
			the max_batch_time, which defaults to 15000us
			(15ms). This optimization can be turned off
			entirely by setting max_batch_time to 0.

			min_batch_time=usec This parameter sets the commit time (as
			described above) to be at least min_batch_time.
			It defaults to zero microseconds. Increasing
			this parameter may improve the throughput of
			multi-threaded, synchronous workloads on very
			fast disks, at the cost of increasing latency.

			journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
			highest priorty) which should be used for I/O
			operations submitted by kjournald2 during a
			commit operation. This defaults to 3, which is
			a slightly higher priority than the default I/O
			priority.

	Data Mode		Data Mode
	=========		=========
	There are 3 different data modes:		There are 3 different data modes:

block/Kconfig

+6 −0

Original line number	Original line	Diff line number	Diff line
	@@ -36,6 +36,12 @@ config LBD
	This option also enables support for single files larger than		This option also enables support for single files larger than
	2TB.		2TB.

			The ext4 filesystem requires that this feature be enabled in
			order to support filesystems that have the huge_file feature
			enabled. Otherwise, it will refuse to mount any filesystems
			that use the huge_file feature, which is enabled by default
			by mke2fs.ext4. The GFS2 filesystem also requires this feature.

	If unsure, say N.		If unsure, say N.

	config BLK_DEV_IO_TRACE		config BLK_DEV_IO_TRACE

fs/block_dev.c

+15 −0

Original line number	Original line	Diff line number	Diff line
	@@ -1234,6 +1234,20 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
	return blkdev_ioctl(bdev, mode, cmd, arg);		return blkdev_ioctl(bdev, mode, cmd, arg);
	}		}

			/*
			* Try to release a page associated with block device when the system
			* is under memory pressure.
			*/
			static int blkdev_releasepage(struct page *page, gfp_t wait)
			{
			struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;

			if (super && super->s_op->bdev_try_to_free_page)
			return super->s_op->bdev_try_to_free_page(super, page, wait);

			return try_to_free_buffers(page);
			}

	static const struct address_space_operations def_blk_aops = {		static const struct address_space_operations def_blk_aops = {
	.readpage = blkdev_readpage,		.readpage = blkdev_readpage,
	.writepage = blkdev_writepage,		.writepage = blkdev_writepage,
	@@ -1241,6 +1255,7 @@ static const struct address_space_operations def_blk_aops = {
	.write_begin = blkdev_write_begin,		.write_begin = blkdev_write_begin,
	.write_end = blkdev_write_end,		.write_end = blkdev_write_end,
	.writepages = generic_writepages,		.writepages = generic_writepages,
			.releasepage = blkdev_releasepage,
	.direct_IO = blkdev_direct_IO,		.direct_IO = blkdev_direct_IO,
	};		};

fs/ext3/hash.c

+67 −10

Original line number	Original line	Diff line number	Diff line
	@@ -35,23 +35,71 @@ static void TEA_transform(__u32 buf[4], __u32 const in[])


	/* The old legacy hash */		/* The old legacy hash */
	static __u32 dx_hack_hash (const char *name, int len)		static __u32 dx_hack_hash_unsigned(const char *name, int len)
	{		{
	__u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;		__u32 hash, hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
			const unsigned char ucp = (const unsigned char ) name;

			while (len--) {
			hash = hash1 + (hash0 ^ (((int) ucp++) 7152373));

			if (hash & 0x80000000)
			hash -= 0x7fffffff;
			hash1 = hash0;
			hash0 = hash;
			}
			return hash0 << 1;
			}

			static __u32 dx_hack_hash_signed(const char *name, int len)
			{
			__u32 hash, hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
			const signed char scp = (const signed char ) name;

	while (len--) {		while (len--) {
	__u32 hash = hash1 + (hash0 ^ (name++ 7152373));		hash = hash1 + (hash0 ^ (((int) scp++) 7152373));

	if (hash & 0x80000000) hash -= 0x7fffffff;		if (hash & 0x80000000)
			hash -= 0x7fffffff;
	hash1 = hash0;		hash1 = hash0;
	hash0 = hash;		hash0 = hash;
	}		}
	return (hash0 << 1);		return hash0 << 1;
			}

			static void str2hashbuf_signed(const char msg, int len, __u32 buf, int num)
			{
			__u32 pad, val;
			int i;
			const signed char scp = (const signed char ) msg;

			pad = (__u32)len \| ((__u32)len << 8);
			pad \|= pad << 16;

			val = pad;
			if (len > num*4)
			len = num * 4;
			for (i = 0; i < len; i++) {
			if ((i % 4) == 0)
			val = pad;
			val = ((int) scp[i]) + (val << 8);
			if ((i % 4) == 3) {
			*buf++ = val;
			val = pad;
			num--;
			}
			}
			if (--num >= 0)
			*buf++ = val;
			while (--num >= 0)
			*buf++ = pad;
	}		}

	static void str2hashbuf(const char msg, int len, __u32 buf, int num)		static void str2hashbuf_unsigned(const char msg, int len, __u32 buf, int num)
	{		{
	__u32 pad, val;		__u32 pad, val;
	int i;		int i;
			const unsigned char ucp = (const unsigned char ) msg;

	pad = (__u32)len \| ((__u32)len << 8);		pad = (__u32)len \| ((__u32)len << 8);
	pad \|= pad << 16;		pad \|= pad << 16;
	@@ -62,7 +110,7 @@ static void str2hashbuf(const char msg, int len, __u32 buf, int num)
	for (i=0; i < len; i++) {		for (i=0; i < len; i++) {
	if ((i % 4) == 0)		if ((i % 4) == 0)
	val = pad;		val = pad;
	val = msg[i] + (val << 8);		val = ((int) ucp[i]) + (val << 8);
	if ((i % 4) == 3) {		if ((i % 4) == 3) {
	*buf++ = val;		*buf++ = val;
	val = pad;		val = pad;
	@@ -95,6 +143,8 @@ int ext3fs_dirhash(const char name, int len, struct dx_hash_info hinfo)
	const char *p;		const char *p;
	int i;		int i;
	__u32 in[8], buf[4];		__u32 in[8], buf[4];
			void (str2hashbuf)(const char , int, __u32 *, int) =
			str2hashbuf_signed;

	/* Initialize the default seed for the hash checksum functions */		/* Initialize the default seed for the hash checksum functions */
	buf[0] = 0x67452301;		buf[0] = 0x67452301;
	@@ -113,13 +163,18 @@ int ext3fs_dirhash(const char name, int len, struct dx_hash_info hinfo)
	}		}

	switch (hinfo->hash_version) {		switch (hinfo->hash_version) {
			case DX_HASH_LEGACY_UNSIGNED:
			hash = dx_hack_hash_unsigned(name, len);
			break;
	case DX_HASH_LEGACY:		case DX_HASH_LEGACY:
	hash = dx_hack_hash(name, len);		hash = dx_hack_hash_signed(name, len);
	break;		break;
			case DX_HASH_HALF_MD4_UNSIGNED:
			str2hashbuf = str2hashbuf_unsigned;
	case DX_HASH_HALF_MD4:		case DX_HASH_HALF_MD4:
	p = name;		p = name;
	while (len > 0) {		while (len > 0) {
	str2hashbuf(p, len, in, 8);		(*str2hashbuf)(p, len, in, 8);
	half_md4_transform(buf, in);		half_md4_transform(buf, in);
	len -= 32;		len -= 32;
	p += 32;		p += 32;
	@@ -127,10 +182,12 @@ int ext3fs_dirhash(const char name, int len, struct dx_hash_info hinfo)
	minor_hash = buf[2];		minor_hash = buf[2];
	hash = buf[1];		hash = buf[1];
	break;		break;
			case DX_HASH_TEA_UNSIGNED:
			str2hashbuf = str2hashbuf_unsigned;
	case DX_HASH_TEA:		case DX_HASH_TEA:
	p = name;		p = name;
	while (len > 0) {		while (len > 0) {
	str2hashbuf(p, len, in, 4);		(*str2hashbuf)(p, len, in, 4);
	TEA_transform(buf, in);		TEA_transform(buf, in);
	len -= 16;		len -= 16;
	p += 16;		p += 16;

fs/ext3/namei.c

+9 −2

Original line number	Original line	Diff line number	Diff line
	@@ -364,6 +364,8 @@ dx_probe(struct qstr entry, struct inode dir,
	goto fail;		goto fail;
	}		}
	hinfo->hash_version = root->info.hash_version;		hinfo->hash_version = root->info.hash_version;
			if (hinfo->hash_version <= DX_HASH_TEA)
			hinfo->hash_version += EXT3_SB(dir->i_sb)->s_hash_unsigned;
	hinfo->seed = EXT3_SB(dir->i_sb)->s_hash_seed;		hinfo->seed = EXT3_SB(dir->i_sb)->s_hash_seed;
	if (entry)		if (entry)
	ext3fs_dirhash(entry->name, entry->len, hinfo);		ext3fs_dirhash(entry->name, entry->len, hinfo);
	@@ -632,6 +634,9 @@ int ext3_htree_fill_tree(struct file *dir_file, __u32 start_hash,
	dir = dir_file->f_path.dentry->d_inode;		dir = dir_file->f_path.dentry->d_inode;
	if (!(EXT3_I(dir)->i_flags & EXT3_INDEX_FL)) {		if (!(EXT3_I(dir)->i_flags & EXT3_INDEX_FL)) {
	hinfo.hash_version = EXT3_SB(dir->i_sb)->s_def_hash_version;		hinfo.hash_version = EXT3_SB(dir->i_sb)->s_def_hash_version;
			if (hinfo.hash_version <= DX_HASH_TEA)
			hinfo.hash_version +=
			EXT3_SB(dir->i_sb)->s_hash_unsigned;
	hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;		hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;
	count = htree_dirblock_to_tree(dir_file, dir, 0, &hinfo,		count = htree_dirblock_to_tree(dir_file, dir, 0, &hinfo,
	start_hash, start_minor_hash);		start_hash, start_minor_hash);
	@@ -1152,9 +1157,9 @@ static struct ext3_dir_entry_2 do_split(handle_t handle, struct inode *dir,
	u32 hash2;		u32 hash2;
	struct dx_map_entry *map;		struct dx_map_entry *map;
	char data1 = (bh)->b_data, *data2;		char data1 = (bh)->b_data, *data2;
	unsigned split, move, size, i;		unsigned split, move, size;
	struct ext3_dir_entry_2 de = NULL, de2;		struct ext3_dir_entry_2 de = NULL, de2;
	int err = 0;		int err = 0, i;

	bh2 = ext3_append (handle, dir, &newblock, &err);		bh2 = ext3_append (handle, dir, &newblock, &err);
	if (!(bh2)) {		if (!(bh2)) {
	@@ -1394,6 +1399,8 @@ static int make_indexed_dir(handle_t handle, struct dentry dentry,

	/* Initialize as for dx_probe */		/* Initialize as for dx_probe */
	hinfo.hash_version = root->info.hash_version;		hinfo.hash_version = root->info.hash_version;
			if (hinfo.hash_version <= DX_HASH_TEA)
			hinfo.hash_version += EXT3_SB(dir->i_sb)->s_hash_unsigned;
	hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;		hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;
	ext3fs_dirhash(name, namelen, &hinfo);		ext3fs_dirhash(name, namelen, &hinfo);
	frame = frames;		frame = frames;