Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 10f3e23f authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull ext4 updates from Ted Ts'o:

 - Convert content from the ext4 wiki to Documentation rst files so it
   is more likely to be updated as we add new features to ext4.

 - Add 64-bit timestamp support to ext4's superblock fields.

 - ... and the usual bug fixes and cleanups, including a Spectre gadget
   fixup and some hardening against maliciously corrupted file systems.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (34 commits)
  ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
  ext4: improve code readability in ext4_iget()
  ext4: fix spectre gadget in ext4_mb_regular_allocator()
  ext4: check for NUL characters in extended attribute's name
  ext4: use ext4_warning() for sb_getblk failure
  ext4: fix race when setting the bitmap corrupted flag
  ext4: reset error code in ext4_find_entry in fallback
  ext4: handle layout changes to pinned DAX mappings
  dax: dax_layout_busy_page() warn on !exceptional
  docs: fix up the obviously obsolete bits in the new ext4 documentation
  docs: add new ext4 superblock time extension fields
  docs: create filesystem internal section
  ext4: use swap macro in mext_page_double_lock
  ext4: check allocation failure when duplicating "data" in ext4_remount()
  ext4: fix warning message in ext4_enable_quotas()
  ext4: super: extend timestamps to 40 bits
  jbd2: replace current_kernel_time64 with ktime equivalent
  ext4: use timespec64 for all inode times
  ext4: use ktime_get_real_seconds for i_dtime
  ext4: use 64-bit timestamps for mmp_time
  ...
parents 3bb37da5 863c37fc
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -34,7 +34,7 @@ needs_sphinx = '1.3'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure']
extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', 'kfigure', 'sphinx.ext.ifconfig']

# The name of the math extension changed on Sphinx 1.4
if major == 1 and minor > 3:
+64 −78
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

Ext4 Filesystem
===============
========================
General Information
========================

Ext4 is an advanced level of the ext3 filesystem which incorporates
scalability and reliability enhancements for supporting large filesystems
@@ -11,31 +13,24 @@ Mailing list: linux-ext4@vger.kernel.org
Web site:	http://ext4.wiki.kernel.org


1. Quick usage instructions:
===========================
Quick usage instructions
========================

Note: More extensive information for getting started with ext4 can be
found at the ext4 wiki site at the URL:
http://ext4.wiki.kernel.org/index.php/Ext4_Howto

  - Compile and install the latest version of e2fsprogs (as of this
    writing version 1.41.3) from:
  - The latest version of e2fsprogs can be found at:

    http://sourceforge.net/project/showfiles.php?group_id=2406
    https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/

	or

    https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
    http://sourceforge.net/project/showfiles.php?group_id=2406

	or grab the latest git repository from:

    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git

  - Note that it is highly important to install the mke2fs.conf file
    that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
    you have edited the /etc/mke2fs.conf file installed on your system,
    you will need to merge your changes with the version from e2fsprogs
    1.41.x.
   https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git

  - Create a new filesystem using the ext4 filesystem type:

@@ -50,10 +45,6 @@ Note: More extensive information for getting started with ext4 can be

        # tune2fs -I 256 /dev/hda1

    (Note: we currently do not have tools to convert an ext4
    filesystem back to ext3; so please do not do try this on production
    filesystems.)

  - Mounting:

	# mount -t ext4 /dev/hda1 /wherever
@@ -75,10 +66,11 @@ Note: More extensive information for getting started with ext4 can be
    the filesystem with a large journal can also be helpful for
    metadata-intensive workloads.

2. Features
===========
Features
========

2.1 Currently available
Currently Available
-------------------

* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
@@ -103,31 +95,15 @@ Note: More extensive information for getting started with ext4 can be
[1] Filesystems with a block size of 1k may see a limit imposed by the
directory hash tree having a maximum depth of two.

2.2 Candidate features for future inclusion

* online defrag (patches available but not well tested)
* reduced mke2fs time via lazy itable initialization in conjunction with
  the uninit_bg feature (capability to do this is available in e2fsprogs
  but a kernel thread to do lazy zeroing of unused inode table blocks
  after filesystem is first mounted is required for safety)

There are several others under discussion, whether they all make it in is
partly a function of how much time everyone has to work on them. Features like
metadata checksumming have been discussed and planned for a bit but no patches
exist yet so I'm not sure they're in the near-term roadmap.

The big performance win will come with mballoc, delalloc and flex_bg
grouping of bitmaps and inode tables.  Some test results available here:

 - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
 - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html

3. Options
==========
Options
=======

When mounting an ext4 filesystem, the following option are accepted:
(*) == default

======================= =======================================================
Mount Option            Description
======================= =======================================================
ro                   	Mount filesystem read only. Note that ext4 will
                     	replay the journal (and thus write to the
                     	partition) even when mounted "read only". The
@@ -387,12 +363,14 @@ i_version Enable 64-bit inode version support. This option is
dax			Use direct access (no page cache).  See
			Documentation/filesystems/dax.txt.  Note that
			this option is incompatible with data=journal.
======================= =======================================================

Data Mode
=========
There are 3 different data modes:

* writeback mode

  In data=writeback mode, ext4 does not journal data at all.  This mode provides
  a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
  mode - metadata journaling.  A crash+recovery can cause incorrect data to
@@ -400,20 +378,23 @@ appear in files which were written shortly before the crash. This mode will
  typically provide the best ext4 performance.

* ordered mode

  In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata information related to data changes with the data blocks into a
single unit called a transaction.  When it's time to write the new metadata
out to disk, the associated data blocks are written first.  In general,
this mode performs slightly slower than writeback but significantly faster than journal mode.
  groups metadata information related to data changes with the data blocks into
  a single unit called a transaction.  When it's time to write the new metadata
  out to disk, the associated data blocks are written first.  In general, this
  mode performs slightly slower than writeback but significantly faster than
  journal mode.

* journal mode

  data=journal mode provides full data and metadata journaling.  All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state.  This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all others modes.  Enabling this mode will disable delayed
allocation and O_DIRECT support.
  written to the journal first, and then to its final location.  In the event of
  a crash, the journal can be replayed, bringing both data and metadata into a
  consistent state.  This mode is the slowest except when data needs to be read
  from and written to disk at the same time where it outperforms all others
  modes.  Enabling this mode will disable delayed allocation and O_DIRECT
  support.

/proc entries
=============
@@ -425,10 +406,12 @@ Information about mounted ext4 file systems can be found in
in table below.

Files in /proc/fs/ext4/<devname>
..............................................................................

================ =======
 File            Content
================ =======
 mb_groups       details of multiblock allocator buddy cache of free blocks
..............................................................................
================ =======

/sys entries
============
@@ -439,11 +422,13 @@ Information about mounted ext4 file systems can be found in
/sys/fs/ext4/dm-0).   The files in each per-device directory are shown
in table below.

Files in /sys/fs/ext4/<devname>
Files in /sys/fs/ext4/<devname>:

(see also Documentation/ABI/testing/sysfs-fs-ext4)
..............................................................................
 File                         Content

============================= =================================================
File                          Content
============================= =================================================
 delayed_allocation_blocks    This file is read-only and shows the number of
                              blocks that are dirty in the page cache, but
                              which do not have their location in the
@@ -508,7 +493,7 @@ Files in /sys/fs/ext4/<devname>
                              in the file system. If there is not enough space
                              for the reserved space when mounting the file
                              mount will _not_ fail.
..............................................................................
============================= =================================================

Ioctls
======
@@ -518,8 +503,10 @@ through the system call interfaces. The list of all Ext4 specific ioctls are
shown in the table below.

Table of Ext4 specific ioctls
..............................................................................

============================= =================================================
Ioctl			      Description
============================= =================================================
 EXT4_IOC_GETFLAGS	      Get additional attributes associated with inode.
			      The ioctl argument is an integer bitfield, with
			      bit values described in ext4.h. This ioctl is an
@@ -610,8 +597,7 @@ Table of Ext4 specific ioctls
			      normal user by accident.
			      The data blocks of the previous boot loader
			      will be associated with the given inode.

..............................................................................
============================= =================================================

References
==========
+17 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

===============
ext4 Filesystem
===============

General usage and on-disk artifacts writen by ext4.  More documentation may
be ported from the wiki as time permits.  This should be considered the
canonical source of information as the details here have been reviewed by
the ext4 community.

.. toctree::
   :maxdepth: 5
   :numbered:

   ext4
   ondisk/index
+44 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

About this Book
===============

This document attempts to describe the on-disk format for ext4
filesystems. The same general ideas should apply to ext2/3 filesystems
as well, though they do not support all the features that ext4 supports,
and the fields will be shorter.

**NOTE**: This is a work in progress, based on notes that the author
(djwong) made while picking apart a filesystem by hand. The data
structure definitions should be current as of Linux 4.18 and
e2fsprogs-1.44. All comments and corrections are welcome, since there is
undoubtedly plenty of lore that might not be reflected in freshly
created demonstration filesystems.

License
-------
This book is licensed under the terms of the GNU Public License, v2.

Terminology
-----------

ext4 divides a storage device into an array of logical blocks both to
reduce bookkeeping overhead and to increase throughput by forcing larger
transfer sizes. Generally, the block size will be 4KiB (the same size as
pages on x86 and the block layer's default block size), though the
actual size is calculated as 2 ^ (10 + ``sb.s_log_block_size``) bytes.
Throughout this document, disk locations are given in terms of these
logical blocks, not raw LBAs, and not 1024-byte blocks. For the sake of
convenience, the logical block size will be referred to as
``$block_size`` throughout the rest of the document.

When referenced in ``preformatted text`` blocks, ``sb`` refers to fields
in the super block, and ``inode`` refers to fields in an inode table
entry.

Other References
----------------

Also see http://www.nongnu.org/ext2-doc/ for quite a collection of
information about ext2/3. Here's another old reference:
http://wiki.osdev.org/Ext2
+56 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

Block and Inode Allocation Policy
---------------------------------

ext4 recognizes (better than ext3, anyway) that data locality is
generally a desirably quality of a filesystem. On a spinning disk,
keeping related blocks near each other reduces the amount of movement
that the head actuator and disk must perform to access a data block,
thus speeding up disk IO. On an SSD there of course are no moving parts,
but locality can increase the size of each transfer request while
reducing the total number of requests. This locality may also have the
effect of concentrating writes on a single erase block, which can speed
up file rewrites significantly. Therefore, it is useful to reduce
fragmentation whenever possible.

The first tool that ext4 uses to combat fragmentation is the multi-block
allocator. When a file is first created, the block allocator
speculatively allocates 8KiB of disk space to the file on the assumption
that the space will get written soon. When the file is closed, the
unused speculative allocations are of course freed, but if the
speculation is correct (typically the case for full writes of small
files) then the file data gets written out in a single multi-block
extent. A second related trick that ext4 uses is delayed allocation.
Under this scheme, when a file needs more blocks to absorb file writes,
the filesystem defers deciding the exact placement on the disk until all
the dirty buffers are being written out to disk. By not committing to a
particular placement until it's absolutely necessary (the commit timeout
is hit, or sync() is called, or the kernel runs out of memory), the hope
is that the filesystem can make better location decisions.

The third trick that ext4 (and ext3) uses is that it tries to keep a
file's data blocks in the same block group as its inode. This cuts down
on the seek penalty when the filesystem first has to read a file's inode
to learn where the file's data blocks live and then seek over to the
file's data blocks to begin I/O operations.

The fourth trick is that all the inodes in a directory are placed in the
same block group as the directory, when feasible. The working assumption
here is that all the files in a directory might be related, therefore it
is useful to try to keep them all together.

The fifth trick is that the disk volume is cut up into 128MB block
groups; these mini-containers are used as outlined above to try to
maintain data locality. However, there is a deliberate quirk -- when a
directory is created in the root directory, the inode allocator scans
the block groups and puts that directory into the least heavily loaded
block group that it can find. This encourages directories to spread out
over a disk; as the top-level directory/file blobs fill up one block
group, the allocators simply move on to the next block group. Allegedly
this scheme evens out the loading on the block groups, though the author
suspects that the directories which are so unlucky as to land towards
the end of a spinning drive get a raw deal performance-wise.

Of course if all of these mechanisms fail, one can always use e4defrag
to defragment files.
Loading