- File systems manage data structures that must persist on disk
- Updates to these structures involve multiple disk writes
- A crash or power loss can interrupt these writes mid-operation
- This can leave the file system in an inconsistent state
- The crash consistency problem: how to update the disk despite crashes
- Appending data to a file requires updating:
- Inode (update file size, add new block pointer)
- Data bitmap (mark new block as allocated)
- New data block (write actual data)
- Possible crash scenarios:
- Only data block written: Not a problem, as if write never occurred
- Only inode updated: Points to uninitialized data, inconsistent with bitmap
- Only bitmap updated: Space leak, block marked used but not referenced
- Runs at boot time after a crash
- Scans entire file system to find and fix inconsistencies
- Very slow on large disks (can take hours)
- Superblock: Sanity checks, may use backup superblock if corrupt
- Free blocks: Scans inodes to rebuild correct allocation bitmaps
- Inode state: Verifies each inode for corruption, clears bad inodes
- Inode links: Verifies and corrects link counts for all files
- Duplicates: Checks for and resolves duplicate block pointers
- Bad blocks: Removes pointers to blocks outside valid range
- Directory integrity: Verifies directory structure and entries
- Very slow on large disks
- Scans entire disk even for small updates
- Complex to implement correctly for all cases
- Borrowed from database systems (write-ahead logging)
- Write updates to a journal/log before modifying file system
- If crash occurs, replay journal entries to recover
- Much faster recovery than fsck
- Journal both metadata and data updates
- Protocol:
- Journal write: Write transaction (TxB, metadata, data, TxE) to log
- Journal commit: Wait for transaction to be fully on disk
- Checkpoint: Write changes to final file system locations
- Free: Mark transaction as free in journal superblock
- Only journal metadata updates
- Data written directly to file system
- Reduces write traffic compared to data journaling
- More commonly used (e.g., ext3, NTFS, XFS)
- Protocol:
- Data write: Write data to final location
- Journal metadata write: Write begin block and metadata to log
- Journal commit: Write transaction end block to log
- Checkpoint metadata: Write metadata to final locations
- Free: Mark transaction as free in journal superblock
- Block reuse issues: Solved with revoke records
- Ordering of writes is critical for consistency
- Batching updates into larger transactions
- Checksum-based optimization for faster commits
- Carefully order all writes to maintain consistency
- Never leave on-disk structures in an inconsistent state
- Complex to implement, requires deep file system knowledge
- Never overwrite data or metadata in place
- Write updates to new locations on disk
- Atomically switch to new version by updating root structure
- Used in ZFS, Btrfs
- Add backpointers to all blocks (e.g., data block points back to inode)
- Check consistency on access by verifying forward and backward pointers
- Allows for lazy consistency checking
- Reduce waits for disk writes to complete
- Use generalized transaction checksums
- Can significantly improve performance for some workloads
- Requires changes to disk interface for full benefits
In summary, crash consistency is a fundamental challenge in file system design. While journaling (especially metadata journaling) is widely used in modern file systems due to its balance of performance and fast recovery, research continues to develop new techniques to improve both consistency guarantees and performance. The choice of consistency mechanism can significantly impact file system behavior, recovery time, and overall system performance.