Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 2150edc6 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
  jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
  ext4: Remove "extents" mount option
  block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
  ext4: Make printk's consistently prefixed with "EXT4-fs: "
  ext4: Add sanity checks for the superblock before mounting the filesystem
  ext4: Add mount option to set kjournald's I/O priority
  jbd2: Submit writes to the journal using WRITE_SYNC
  jbd2: Add pid and journal device name to the "kjournald2 starting" message
  ext4: Add markers for better debuggability
  ext4: Remove code to create the journal inode
  ext4: provide function to release metadata pages under memory pressure
  ext3: provide function to release metadata pages under memory pressure
  add releasepage hooks to block devices which can be used by file systems
  ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
  ext4: Init the complete page while building buddy cache
  ext4: Don't allow new groups to be added during block allocation
  ext4: mark the blocks/inode bitmap beyond end of group as used
  ext4: Use new buffer_head flag to check uninit group bitmaps initialization
  ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
  ext4: code cleanup
  ...
parents cd764695 4b905671
Loading
Loading
Loading
Loading
+67 −18
Original line number Original line Diff line number Diff line
@@ -58,13 +58,22 @@ Note: More extensive information for getting started with ext4 can be


	# mount -t ext4 /dev/hda1 /wherever
	# mount -t ext4 /dev/hda1 /wherever


  - When comparing performance with other filesystems, remember that
  - When comparing performance with other filesystems, it's always
    ext3/4 by default offers higher data integrity guarantees than most.
    important to try multiple workloads; very often a subtle change in a
    So when comparing with a metadata-only journalling filesystem, such
    workload parameter can completely change the ranking of which
    as ext3, use `mount -o data=writeback'.  And you might as well use
    filesystems do well compared to others.  When comparing versus ext3,
    `mount -o nobh' too along with it.  Making the journal larger than
    note that ext4 enables write barriers by default, while ext3 does
    the mke2fs default often helps performance with metadata-intensive
    not enable write barriers by default.  So it is useful to use
    workloads.
    explicitly specify whether barriers are enabled or not when via the
    '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
    for a fair comparison.  When tuning ext3 for best benchmark numbers,
    it is often worthwhile to try changing the data journaling mode; '-o
    data=writeback,nobh' can be faster for some workloads.  (Note
    however that running mounted with data=writeback can potentially
    leave stale data exposed in recently written files in case of an
    unclean shutdown, which could be a security exposure in some
    situations.)  Configuring the filesystem with a large journal can
    also be helpful for metadata-intensive workloads.


2. Features
2. Features
===========
===========
@@ -74,7 +83,7 @@ Note: More extensive information for getting started with ext4 can be
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
* extent format more robust in face of on-disk corruption due to magics,
* internal redunancy in tree
* internal redundancy in tree
* improved file allocation (multi-block alloc)
* improved file allocation (multi-block alloc)
* fix 32000 subdirectory limit
* fix 32000 subdirectory limit
* nsec timestamps for mtime, atime, ctime, create time
* nsec timestamps for mtime, atime, ctime, create time
@@ -116,10 +125,11 @@ grouping of bitmaps and inode tables. Some test results available here:
When mounting an ext4 filesystem, the following option are accepted:
When mounting an ext4 filesystem, the following option are accepted:
(*) == default
(*) == default


extents		(*)	ext4 will use extents to address file data.  The
ro                   	Mount filesystem read only. Note that ext4 will
			file system will no longer be mountable by ext3.
                     	replay the journal (and thus write to the

                     	partition) even when mounted "read only". The
noextents		ext4 will not use extents for newly created files
                     	mount options "ro,noload" can be used to prevent
		     	writes to the filesystem.


journal_checksum	Enable checksumming of the journal transactions.
journal_checksum	Enable checksumming of the journal transactions.
			This will allow the recovery code in e2fsck and the
			This will allow the recovery code in e2fsck and the
@@ -134,17 +144,17 @@ journal_async_commit Commit block can be written to disk without waiting
journal=update		Update the ext4 file system's journal to the current
journal=update		Update the ext4 file system's journal to the current
			format.
			format.


journal=inum		When a journal already exists, this option is ignored.
			Otherwise, it specifies the number of the inode which
			will represent the ext4 file system's journal file.

journal_dev=devnum	When the external journal device's major/minor numbers
journal_dev=devnum	When the external journal device's major/minor numbers
			have changed, this option allows the user to specify
			have changed, this option allows the user to specify
			the new journal location.  The journal device is
			the new journal location.  The journal device is
			identified through its new major/minor numbers encoded
			identified through its new major/minor numbers encoded
			in devnum.
			in devnum.


noload			Don't load the journal on mounting.
noload			Don't load the journal on mounting.  Note that
                     	if the filesystem was not unmounted cleanly,
                     	skipping the journal replay will lead to the
                     	filesystem containing inconsistencies that can
                     	lead to any number of problems.


data=journal		All data are committed into the journal prior to being
data=journal		All data are committed into the journal prior to being
			written into the main file system.
			written into the main file system.
@@ -219,9 +229,12 @@ minixdf Make 'df' act like Minix.


debug			Extra debugging information is sent to syslog.
debug			Extra debugging information is sent to syslog.


errors=remount-ro(*)	Remount the filesystem read-only on an error.
errors=remount-ro	Remount the filesystem read-only on an error.
errors=continue		Keep going on a filesystem error.
errors=continue		Keep going on a filesystem error.
errors=panic		Panic and halt the machine if an error occurs.
errors=panic		Panic and halt the machine if an error occurs.
                        (These mount options override the errors behavior
                        specified in the superblock, which can be configured
                        using tune2fs)


data_err=ignore(*)	Just print an error message if an error occurs
data_err=ignore(*)	Just print an error message if an error occurs
			in a file data buffer in ordered mode.
			in a file data buffer in ordered mode.
@@ -261,6 +274,42 @@ delalloc (*) Deferring block allocation until write-out time.
nodelalloc		Disable delayed allocation. Blocks are allocation
nodelalloc		Disable delayed allocation. Blocks are allocation
			when data is copied from user to page cache.
			when data is copied from user to page cache.


max_batch_time=usec	Maximum amount of time ext4 should wait for
			additional filesystem operations to be batch
			together with a synchronous write operation.
			Since a synchronous write operation is going to
			force a commit and then a wait for the I/O
			complete, it doesn't cost much, and can be a
			huge throughput win, we wait for a small amount
			of time to see if any other transactions can
			piggyback on the synchronous write.   The
			algorithm used is designed to automatically tune
			for the speed of the disk, by measuring the
			amount of time (on average) that it takes to
			finish committing a transaction.  Call this time
			the "commit time".  If the time that the
			transactoin has been running is less than the
			commit time, ext4 will try sleeping for the
			commit time to see if other operations will join
			the transaction.   The commit time is capped by
			the max_batch_time, which defaults to 15000us
			(15ms).   This optimization can be turned off
			entirely by setting max_batch_time to 0.

min_batch_time=usec	This parameter sets the commit time (as
			described above) to be at least min_batch_time.
			It defaults to zero microseconds.  Increasing
			this parameter may improve the throughput of
			multi-threaded, synchronous workloads on very
			fast disks, at the cost of increasing latency.

journal_ioprio=prio	The I/O priority (from 0 to 7, where 0 is the
			highest priorty) which should be used for I/O
			operations submitted by kjournald2 during a
			commit operation.  This defaults to 3, which is
			a slightly higher priority than the default I/O
			priority.

Data Mode
Data Mode
=========
=========
There are 3 different data modes:
There are 3 different data modes:
+6 −0
Original line number Original line Diff line number Diff line
@@ -36,6 +36,12 @@ config LBD
	  This option also enables support for single files larger than
	  This option also enables support for single files larger than
	  2TB.
	  2TB.


	  The ext4 filesystem requires that this feature be enabled in
	  order to support filesystems that have the huge_file feature
	  enabled.    Otherwise, it will refuse to mount any filesystems
	  that use the huge_file feature, which is enabled by default
	  by mke2fs.ext4.   The GFS2 filesystem also requires this feature.

	  If unsure, say N.
	  If unsure, say N.


config BLK_DEV_IO_TRACE
config BLK_DEV_IO_TRACE
+15 −0
Original line number Original line Diff line number Diff line
@@ -1234,6 +1234,20 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
	return blkdev_ioctl(bdev, mode, cmd, arg);
	return blkdev_ioctl(bdev, mode, cmd, arg);
}
}


/*
 * Try to release a page associated with block device when the system
 * is under memory pressure.
 */
static int blkdev_releasepage(struct page *page, gfp_t wait)
{
	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;

	if (super && super->s_op->bdev_try_to_free_page)
		return super->s_op->bdev_try_to_free_page(super, page, wait);

	return try_to_free_buffers(page);
}

static const struct address_space_operations def_blk_aops = {
static const struct address_space_operations def_blk_aops = {
	.readpage	= blkdev_readpage,
	.readpage	= blkdev_readpage,
	.writepage	= blkdev_writepage,
	.writepage	= blkdev_writepage,
@@ -1241,6 +1255,7 @@ static const struct address_space_operations def_blk_aops = {
	.write_begin	= blkdev_write_begin,
	.write_begin	= blkdev_write_begin,
	.write_end	= blkdev_write_end,
	.write_end	= blkdev_write_end,
	.writepages	= generic_writepages,
	.writepages	= generic_writepages,
	.releasepage	= blkdev_releasepage,
	.direct_IO	= blkdev_direct_IO,
	.direct_IO	= blkdev_direct_IO,
};
};


+67 −10
Original line number Original line Diff line number Diff line
@@ -35,23 +35,71 @@ static void TEA_transform(__u32 buf[4], __u32 const in[])




/* The old legacy hash */
/* The old legacy hash */
static __u32 dx_hack_hash (const char *name, int len)
static __u32 dx_hack_hash_unsigned(const char *name, int len)
{
{
	__u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
	__u32 hash, hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
	const unsigned char *ucp = (const unsigned char *) name;

	while (len--) {
		hash = hash1 + (hash0 ^ (((int) *ucp++) * 7152373));

		if (hash & 0x80000000)
			hash -= 0x7fffffff;
		hash1 = hash0;
		hash0 = hash;
	}
	return hash0 << 1;
}

static __u32 dx_hack_hash_signed(const char *name, int len)
{
	__u32 hash, hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
	const signed char *scp = (const signed char *) name;

	while (len--) {
	while (len--) {
		__u32 hash = hash1 + (hash0 ^ (*name++ * 7152373));
		hash = hash1 + (hash0 ^ (((int) *scp++) * 7152373));


		if (hash & 0x80000000) hash -= 0x7fffffff;
		if (hash & 0x80000000)
			hash -= 0x7fffffff;
		hash1 = hash0;
		hash1 = hash0;
		hash0 = hash;
		hash0 = hash;
	}
	}
	return (hash0 << 1);
	return hash0 << 1;
}

static void str2hashbuf_signed(const char *msg, int len, __u32 *buf, int num)
{
	__u32	pad, val;
	int	i;
	const signed char *scp = (const signed char *) msg;

	pad = (__u32)len | ((__u32)len << 8);
	pad |= pad << 16;

	val = pad;
	if (len > num*4)
		len = num * 4;
	for (i = 0; i < len; i++) {
		if ((i % 4) == 0)
			val = pad;
		val = ((int) scp[i]) + (val << 8);
		if ((i % 4) == 3) {
			*buf++ = val;
			val = pad;
			num--;
		}
	}
	if (--num >= 0)
		*buf++ = val;
	while (--num >= 0)
		*buf++ = pad;
}
}


static void str2hashbuf(const char *msg, int len, __u32 *buf, int num)
static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num)
{
{
	__u32	pad, val;
	__u32	pad, val;
	int	i;
	int	i;
	const unsigned char *ucp = (const unsigned char *) msg;


	pad = (__u32)len | ((__u32)len << 8);
	pad = (__u32)len | ((__u32)len << 8);
	pad |= pad << 16;
	pad |= pad << 16;
@@ -62,7 +110,7 @@ static void str2hashbuf(const char *msg, int len, __u32 *buf, int num)
	for (i=0; i < len; i++) {
	for (i=0; i < len; i++) {
		if ((i % 4) == 0)
		if ((i % 4) == 0)
			val = pad;
			val = pad;
		val = msg[i] + (val << 8);
		val = ((int) ucp[i]) + (val << 8);
		if ((i % 4) == 3) {
		if ((i % 4) == 3) {
			*buf++ = val;
			*buf++ = val;
			val = pad;
			val = pad;
@@ -95,6 +143,8 @@ int ext3fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
	const char	*p;
	const char	*p;
	int		i;
	int		i;
	__u32		in[8], buf[4];
	__u32		in[8], buf[4];
	void		(*str2hashbuf)(const char *, int, __u32 *, int) =
				str2hashbuf_signed;


	/* Initialize the default seed for the hash checksum functions */
	/* Initialize the default seed for the hash checksum functions */
	buf[0] = 0x67452301;
	buf[0] = 0x67452301;
@@ -113,13 +163,18 @@ int ext3fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
	}
	}


	switch (hinfo->hash_version) {
	switch (hinfo->hash_version) {
	case DX_HASH_LEGACY_UNSIGNED:
		hash = dx_hack_hash_unsigned(name, len);
		break;
	case DX_HASH_LEGACY:
	case DX_HASH_LEGACY:
		hash = dx_hack_hash(name, len);
		hash = dx_hack_hash_signed(name, len);
		break;
		break;
	case DX_HASH_HALF_MD4_UNSIGNED:
		str2hashbuf = str2hashbuf_unsigned;
	case DX_HASH_HALF_MD4:
	case DX_HASH_HALF_MD4:
		p = name;
		p = name;
		while (len > 0) {
		while (len > 0) {
			str2hashbuf(p, len, in, 8);
			(*str2hashbuf)(p, len, in, 8);
			half_md4_transform(buf, in);
			half_md4_transform(buf, in);
			len -= 32;
			len -= 32;
			p += 32;
			p += 32;
@@ -127,10 +182,12 @@ int ext3fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
		minor_hash = buf[2];
		minor_hash = buf[2];
		hash = buf[1];
		hash = buf[1];
		break;
		break;
	case DX_HASH_TEA_UNSIGNED:
		str2hashbuf = str2hashbuf_unsigned;
	case DX_HASH_TEA:
	case DX_HASH_TEA:
		p = name;
		p = name;
		while (len > 0) {
		while (len > 0) {
			str2hashbuf(p, len, in, 4);
			(*str2hashbuf)(p, len, in, 4);
			TEA_transform(buf, in);
			TEA_transform(buf, in);
			len -= 16;
			len -= 16;
			p += 16;
			p += 16;
+9 −2
Original line number Original line Diff line number Diff line
@@ -364,6 +364,8 @@ dx_probe(struct qstr *entry, struct inode *dir,
		goto fail;
		goto fail;
	}
	}
	hinfo->hash_version = root->info.hash_version;
	hinfo->hash_version = root->info.hash_version;
	if (hinfo->hash_version <= DX_HASH_TEA)
		hinfo->hash_version += EXT3_SB(dir->i_sb)->s_hash_unsigned;
	hinfo->seed = EXT3_SB(dir->i_sb)->s_hash_seed;
	hinfo->seed = EXT3_SB(dir->i_sb)->s_hash_seed;
	if (entry)
	if (entry)
		ext3fs_dirhash(entry->name, entry->len, hinfo);
		ext3fs_dirhash(entry->name, entry->len, hinfo);
@@ -632,6 +634,9 @@ int ext3_htree_fill_tree(struct file *dir_file, __u32 start_hash,
	dir = dir_file->f_path.dentry->d_inode;
	dir = dir_file->f_path.dentry->d_inode;
	if (!(EXT3_I(dir)->i_flags & EXT3_INDEX_FL)) {
	if (!(EXT3_I(dir)->i_flags & EXT3_INDEX_FL)) {
		hinfo.hash_version = EXT3_SB(dir->i_sb)->s_def_hash_version;
		hinfo.hash_version = EXT3_SB(dir->i_sb)->s_def_hash_version;
		if (hinfo.hash_version <= DX_HASH_TEA)
			hinfo.hash_version +=
				EXT3_SB(dir->i_sb)->s_hash_unsigned;
		hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;
		hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;
		count = htree_dirblock_to_tree(dir_file, dir, 0, &hinfo,
		count = htree_dirblock_to_tree(dir_file, dir, 0, &hinfo,
					       start_hash, start_minor_hash);
					       start_hash, start_minor_hash);
@@ -1152,9 +1157,9 @@ static struct ext3_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
	u32 hash2;
	u32 hash2;
	struct dx_map_entry *map;
	struct dx_map_entry *map;
	char *data1 = (*bh)->b_data, *data2;
	char *data1 = (*bh)->b_data, *data2;
	unsigned split, move, size, i;
	unsigned split, move, size;
	struct ext3_dir_entry_2 *de = NULL, *de2;
	struct ext3_dir_entry_2 *de = NULL, *de2;
	int	err = 0;
	int	err = 0, i;


	bh2 = ext3_append (handle, dir, &newblock, &err);
	bh2 = ext3_append (handle, dir, &newblock, &err);
	if (!(bh2)) {
	if (!(bh2)) {
@@ -1394,6 +1399,8 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,


	/* Initialize as for dx_probe */
	/* Initialize as for dx_probe */
	hinfo.hash_version = root->info.hash_version;
	hinfo.hash_version = root->info.hash_version;
	if (hinfo.hash_version <= DX_HASH_TEA)
		hinfo.hash_version += EXT3_SB(dir->i_sb)->s_hash_unsigned;
	hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;
	hinfo.seed = EXT3_SB(dir->i_sb)->s_hash_seed;
	ext3fs_dirhash(name, namelen, &hinfo);
	ext3fs_dirhash(name, namelen, &hinfo);
	frame = frames;
	frame = frames;
Loading