Filesystem Journaling: How Write-Ahead Logging Prevents Data Loss

The Problem: Crashes During Writes

Imagine you're updating a file when suddenly—power failure! Without protection, your filesystem could be left in an inconsistent state:

Half-written metadata: Directory entries point to freed blocks
Orphaned data: Allocated blocks with no file reference
Corrupted structures: Inconsistent inode tables, bitmaps

Traditional filesystems required full disk scans (fsck) after crashes—potentially hours on large drives. Journaling solves this with write-ahead logging.

The Journaling Solution

Core Idea: Before making any changes, write your intentions to a journal (transaction log). If a crash occurs, replay the journal to complete or undo partial operations.

Think of it like a chef's prep notes: write down what you're about to cook before you start. If interrupted, check your notes to know what state you're in.

How Journaling Works: Interactive Exploration

See the journaling mechanism in action—from transaction start to commit, and crash recovery:

Normal Operation: Transaction Flow

Step 1 of 5

Initial State: Filesystem Consistent

Transaction Journal

Journal empty

Filesystem State

doc.txt

Size: 10KB

Blocks: 100, 101

State: consistent

Filesystem in consistent state

Journal is empty (or only old transactions)

Last checkpoint at position 0

Ready to process new write operation

User wants to append data to doc.txt

Journal Modes: Safety vs Performance

Different journaling modes offer varying guarantees:

1. Journal Mode (Full Journaling)

What's journaled: Both metadata AND data
Process: Write data to journal → Write metadata to journal → Commit → Write to final location
Safety: Highest - complete consistency
Performance: Slowest - everything written twice
Use case: Critical data (financial systems)

2. Ordered Mode (Default)

What's journaled: Only metadata
Process: Write data to disk → Write metadata to journal → Commit → Write metadata to final location
Safety: High - metadata consistent, data may be old
Performance: Balanced - data written once
Use case: Most systems (ext4 default, XFS)

3. Writeback Mode

What's journaled: Only metadata
Process: Write metadata to journal → Commit → Write data and metadata to disk (any order)
Safety: Lower - metadata consistent, data may be garbage
Performance: Fastest - no ordering constraints
Use case: Non-critical data, scratch disks

The Transaction Lifecycle

1. Transaction Start
   ↓
2. Write Intent to Journal (WAL)
   ↓
3. Wait for Journal Flush
   ↓
4. Commit Transaction (Commit Record)
   ↓
5. Apply Changes to Filesystem
   ↓
6. Mark Journal Entries as Completed (Checkpoint)
   ↓
7. Reuse Journal Space (Circular Buffer)

Recovery After Crash

When a filesystem mounts after a crash:

Scan Journal: Read from last checkpoint
Check Commit Records: Find complete vs incomplete transactions
Replay Complete: Apply committed but not-yet-applied changes
Rollback Incomplete: Ignore uncommitted transactions
Mount Filesystem: Now in consistent state

Recovery Time: Seconds to minutes (scanning journal only), not hours (scanning entire disk).

Journaling in Different Filesystems

ext4

Journal: Dedicated journal inode or external device
Modes: journal, ordered (default), writeback
Journal size: Configurable (default ~128MB)
Command: tune2fs -o journal_data /dev/sda1

XFS

Journal: Metadata-only (always ordered mode)
Log size: Configurable with -l size=128m
Real-time log: Optional separate device for sync writes
Efficient: Only logs metadata changes

NTFS

$LogFile: Transaction log for metadata
Mode: Metadata journaling only
Recovery: Automatic on mount (chkdsk if needed)
USN Journal: Separate change journal for applications

Btrfs / ZFS

No traditional journal: Use Copy-on-Write instead
Atomic operations: CoW provides transaction semantics
See: Copy-on-Write mechanism

Performance Impact

Journal Placement

# Internal journal (default)
mkfs.ext4 /dev/sda1

# External journal (faster, separate device)
mkfs.ext4 -J device=/dev/sdb1 /dev/sda1

External journal benefits:

Reduced seek time (journal on SSD, data on HDD)
Parallel I/O
Better for write-heavy workloads

Journal Size Tuning

# Larger journal = more buffering, less frequent commits
tune2fs -J size=400 /dev/sda1  # 400MB journal

# For databases: larger journal reduces checkpoint frequency

Best Practices

Use ordered mode for most workloads (default)
Enable full journaling only for critical data
External journal on separate SSD for performance
Monitor journal wraps: dumpe2fs /dev/sda1 | grep -i journal
Disable journaling only for scratch/tmp filesystems

Journaling vs Copy-on-Write

Aspect	Journaling	CoW (Btrfs/ZFS)
Method	Write-ahead log	Never overwrite
Overhead	Write twice (journal + final)	Write once (new location)
Recovery	Replay journal	Always consistent
Snapshots	Not supported	Free with CoW
Maturity	Very mature	Newer (Btrfs)

When to choose:

Journaling (ext4, XFS): Maximum maturity, proven reliability
CoW (Btrfs, ZFS): Want snapshots, checksums, modern features

Copy-on-Write: Alternative consistency mechanism
Data Integrity: Checksums and corruption detection
ext4: Journaling modes in detail
XFS: Metadata journaling
NTFS: $LogFile transaction log

Key Takeaways

Journaling = Insurance: Protects against crashes with write-ahead logging
Fast Recovery: Seconds instead of hours (no full fsck needed)
Configurable Safety: Choose journal mode based on data criticality
Performance Trade-off: More safety = more writes (journal overhead)
Universal: Used by ext4, XFS, NTFS, and most modern filesystems

Table of Contents

Normal Operation: Transaction Flow

Initial State: Filesystem Consistent