From Spinning Platters to Smart Partitions

01Title

From Spinning Platters to Smart Partitions

A deep dive into storage, filesystems, and Linux partitioning

Hard drivesSSDsFilesystemsCOW vs JournalingPartitioning Strategy

Section 1

What Is a Hard Drive?

Physical storage, magnetic platters, and solid-state cells

03Hardware — HDD vs SSD

HDD vs SSD — Physical Structure

HDD 3.5" annotated vs SSD 2.5" with components labelled

Left: HDD — mechanical platters, actuator arm, R/W heads floating on air bearing. Right: SSD — NAND flash chips, controller, optional DRAM cache.

04Section 1 · Storage Fundamentals

The Big Picture: Why Storage Matters

RAM (Volatile)

Fast but loses all data when power is cut. Running programs live here — not your files.

Storage (Persistent)

Retains data without power. Everything you save lives here — OS, files, databases.

Network / Cloud

Remote persistent storage. The same fundamentals apply underneath.

Storage Evolution

Magnetic tape → HDD → SSD (SATA) → NVMe SSD
Each generation brought orders-of-magnitude improvements — but the fundamental job stays the same: store bits reliably and retrieve them quickly. From minutes per operation (tape) to microseconds (NVMe).

📝 Notes

RAM loses data when power cuts out — storage does not. Without persistent storage there are no files, no operating system, no databases. The evolution from tape to NVMe spans roughly six orders of magnitude in random-access speed.

05Section 1 · HDD

HDD: Mechanical Architecture

Platters — spinning magnetic disks (5400 / 7200 / 10 000 RPM)
Read/write heads — float nanometres above the platter surface on an air cushion
Actuator arm — positions heads radially across tracks
Tracks — concentric rings on each platter surface
Sectors — smallest addressable unit (traditionally 512 B, now 4 K)
Cylinders — same track on all platter surfaces stacked vertically

Seek time — time to move head to correct track (~3–10 ms)
Rotational latency — waiting for sector to rotate under head (~4 ms avg at 7200 RPM)
Transfer rate — sequential reads: ~150–200 MB/s
Random 4K IOPS — only ~100–200 (catastrophically slow for random workloads)
Noise, vibration, shock sensitivity — physical medium = physical fragility

⚠ If the read head touches the platter surface (head crash), data can be permanently destroyed. The air bearing is only nanometres thick — a single dust particle can cause catastrophic damage.

📝 Notes

Think of a vinyl record player, but incredibly precise and fast. The heads never touch the surface — they float on an air cushion. The mechanical nature is the fundamental limitation: to read a random block you must physically move the arm and wait for the disk to rotate. That takes 5–10 milliseconds per random access. HDDs are fine for sequential workloads (streaming a movie) but terrible for random access (database queries).

06Section 1 · HDD

HDD: How Data Is Written & Why Seek Time Kills Performance

Magnetic Encoding

Each bit is stored as a tiny magnetic domain oriented North or South. Changing orientation requires passing the write head over the exact spot and pulsing an electromagnetic field.

Sequential vs Random I/O

Sequential reads are fast — the head barely moves. Random reads require a full arm seek + rotational wait for every block. Databases, OS metadata, and small files are all random.

Fragmentation

When a file's blocks scatter across the disk, each fragment requires a separate seek. Over time an HDD spends more time seeking than transferring. Defragmentation physically consolidates blocks — meaningless on SSDs.

Physical Vulnerabilities

HDDs fail from: bearing wear (gradual), head crashes (sudden), vibration interference, thermal expansion. Always assume an HDD will eventually fail — because it will.

📝 Notes

The single most important concept: random vs sequential access. An HDD reading a large video file (sequential) is perfectly adequate. An HDD running a database with thousands of tiny random reads per second is a disaster — the arm physically cannot move fast enough. An HDD gets ~200 random reads/second; a good NVMe SSD gets 1,000,000. This 5,000× difference is why SSDs transformed database performance so dramatically.

07Section 1 · SSD

SSD: Solid-State Architecture

Cell Types — the fundamental speed/density/endurance trade-off

SLC

1 bit/cell
Fastest
Highest endurance
100K+ P/E cycles
Most expensive

MLC

2 bits/cell
Fast
Good endurance
~10K P/E cycles
Prosumer

TLC

3 bits/cell
Moderate speed
~3K P/E cycles
Consumer standard

QLC

4 bits/cell
Slowest writes
~1K P/E cycles
Budget / read-heavy

Key Components

NAND Flash dies → organized into blocks → pages (smallest writable unit, typically 4–16 KB)
Controller → the CPU of the SSD, runs FTL firmware
DRAM (optional) → caches the FTL mapping table; DRAM-less SSDs are 2–4× slower under random load

📝 Notes

SSDs have no moving parts — everything is electronic. The critical architectural constraint is the NAND hierarchy: data is read/written at the PAGE level, but erased at the BLOCK level. A block typically contains 512 pages. This asymmetry — fine-grained writes, coarse-grained erases — is responsible for almost all of the complexity in SSD firmware. The cell type trade-off: SLC is used in enterprise caches and the SLC pseudo-cache layer on consumer drives. Most consumer drives today use TLC or QLC, which is why sustained write performance drops dramatically once the SLC cache fills.

08Section 1 · SSD

SSD: How Data Is Written — and Why It's Complicated

The Write-Erase Asymmetry

Write individual PAGES (4–16 KB), but erase an entire BLOCK (256–512 pages = up to 8 MB). You cannot overwrite a used page — it must be erased first.

Write Amplification

Writing 4 KB of user data may trigger reading, modifying, erasing, and rewriting an entire 4 MB block. WAF (Write Amplification Factor) = physical writes ÷ logical writes. High WAF → faster cell wear.

P/E Cycles & Wear

Every Program/Erase cycle slightly degrades the cell's oxide layer. When a cell can no longer reliably store charge, it's marked bad. Consumer TLC SSDs: ~1000–3000 cycles per block.

Over-Provisioning

SSDs reserve extra NAND (7–28% on consumer drives) that the OS never sees. This spare area absorbs writes, enables background garbage collection, and replaces worn-out blocks transparently.

📝 Notes

This write-erase asymmetry is THE defining characteristic of NAND flash — everything about SSD design flows from it. You can't update in place, so every write is logically an append to a new location. The old location becomes 'stale' — it holds data that's no longer valid but can't be reused until an entire block is erased. Garbage collection reclaims stale blocks and runs in the background. An SSD with no free blocks must garbage-collect synchronously on every write, causing the 'sustained write cliff' seen in benchmarks.

09Section 1 · Comparison

HDD vs SSD — Performance Comparison (4K random reads)

Metric	HDD (7200 RPM)	SATA SSD	NVMe SSD
Sequential Read	~150–200 MB/s	~550 MB/s	3–7 GB/s
Sequential Write	~120–150 MB/s	~520 MB/s	2–6 GB/s
Random 4K Read	~0.5–1 MB/s	~50 MB/s	300–700 MB/s
Latency	5–10 ms	~0.1 ms	~0.02–0.05 ms
Random IOPS	~100–200	~90,000	up to 1,000,000
Noise / Vibration	Yes	None	None
Shock resistance	Low	High	High

Key insight: The Random 4K row is what matters for real-world workloads. A good NVMe SSD does ~5,000× more random reads per second than an HDD. For databases, web servers, and OS metadata lookups, random IOPS is the only number that matters.

📝 Notes

Sequential speed matters for backups, video, and large file transfers. But an OS is always doing random I/O — metadata lookups, library loading, log writes, database queries. An HDD at 200 IOPS vs NVMe at 1,000,000 IOPS: that is a 5,000× difference. A database that saturated an HDD server at 200 req/sec can handle 50,000 req/sec on the same hardware with an NVMe swap-in.

10Section 1 · Cache & FTL

Caching & Data Management: HDD vs SSD

HDD Cache (DRAM on PCB)

8–256 MB buffer caches recently read/written sectors. Write-back mode: drive reports "done" before data is on the platter — faster, but power loss before flush = data loss. Write-through: safer, slower.

SSD DRAM Cache

Caches the FTL mapping table. DRAM-less SSDs must read the map from flash for every random access — 2–4× slower under random load. Capacitor-protected DRAM = power-loss safe.

SLC Cache (Pseudo-SLC)

Consumer TLC/QLC drives emulate fast SLC on a portion of NAND. Writes land here first at SLC speed. When the SLC cache fills, write speed plummets to raw TLC speed (~100 MB/s).

Flash Translation Layer (FTL)

Translates LBAs (what the OS sees) to physical NAND addresses. Enables: wear levelling, garbage collection, bad block management. TRIM lets the OS inform FTL which blocks are freed — enables proactive GC.

📝 Notes

The FTL is arguably the most important piece of software in a modern SSD. It creates the illusion of a simple, rewritable block device on top of a medium that can't be overwritten in place. Wear levelling: even if the OS writes to the same logical block address a million times, the FTL spreads those writes across all physical cells. TRIM is the OS saying: 'these blocks were deleted — you don't need to preserve them during garbage collection.' Without TRIM, the SSD must preserve stale pages unnecessarily, increasing write amplification.

11Section 1 · Analogy

An SSD is like a candle — every write consumes a little of its remaining life. The question isn't if it will wear out, but when. Good firmware design (FTL, wear levelling, TRIM) extends that life as far as possible.

📝 Notes

Unlike HDDs which fail due to mechanical wear (bearings, head crashes), SSDs fail because their flash cells can no longer reliably store charge. The cell's oxide layer degrades with every program/erase cycle. High write workloads — log-heavy servers, database transaction logs, compilation caches — burn through SSD endurance much faster than typical desktop use. Enterprise SSDs are rated in Drive Writes Per Day (DWPD); consumer SSDs in Total Bytes Written (TBW).

12Section 1 · SSD Cell Wear & FTL

SSD Cell Wear & Flash Translation Layer (FTL)

NAND Cell P/E Cycle Limits

SLC

100 000+

MLC

~10 000

TLC

~3 000

QLC

~1 000

What happens inside the cell

●●
●●

New cell

0 cycles
all traps intact

●○
●○

Worn cell

~80% P/E used
oxide degrading

○○
○○

Dead cell

limit exceeded
cannot store charge

● = charge trap intact ○ = oxide degraded

FTL Responsibilities

⚖ Wear Levelling

Rotates writes across ALL physical blocks. Hot LBAs (e.g. filesystem journal) are remapped to fresh cells periodically.

🗑 Garbage Collection

Pages can only be written once per erase cycle. GC finds blocks with stale pages, copies live pages elsewhere, erases the block, returns it to the free pool.

✂ TRIM / UNMAP

OS tells FTL which LBAs were freed (deleted files). FTL marks those physical pages invalid immediately, enabling proactive GC.

🛡 Bad Block Management

When a cell reaches its P/E limit, FTL permanently remaps that block to a spare in the over-provisioned reserve area. Transparent to the OS.

WAF (Write Amplification Factor) = physical bytes written ÷ logical bytes written. WAF > 1 always. High WAF → faster cell death. Over-provisioning (7–28% hidden NAND) gives GC room to work and keeps WAF low.

📝 Notes

Every NAND flash cell stores data by trapping electrons in a floating gate. Each Program/Erase cycle slightly degrades the oxide layer. After enough cycles the cell can no longer reliably hold charge — it 'forgets' its value.

The FTL is the SSD's operating system. Without wear levelling, the OS journal area (written thousands of times per day) would burn out cells in weeks. Without TRIM, garbage collection would waste enormous amounts of work preserving data the OS already deleted.

WAF is the key metric for SSD health: a WAF of 1.0 is ideal (impossible in practice); a WAF of 10 means writing 100 GB logically causes 1 TB of physical NAND writes.

13Section 1 · Interfaces

NVMe vs SATA: The Interface Layer

SATA III — designed in 2003 for spinning disks

Maximum bandwidth: 600 MB/s (~550 MB/s real with protocol overhead)
Command queue: NCQ, max 32 commands at once
Large latency overhead — commands traverse AHCI driver stack
Physical SATA cable + power — 2 connectors per drive
Status: legacy — still fine for a boot drive if no NVMe slot available

NVMe over PCIe — designed specifically for SSDs

PCIe 4.0 ×4: up to 7 GB/s; PCIe 5.0 ×4: up to 14 GB/s
Queue depth: 65,535 queues × 65,535 commands each
Minimal latency — speaks directly to CPU memory bus
M.2 slot or U.2 — single connector, no cables
Namespaces: one drive can present multiple independent block devices

The protocol matters as much as the flash chips. Even the fastest TLC NAND behind a SATA interface is capped at 550 MB/s with only 32 commands in flight. NVMe's 65K×65K queue depth means millions of concurrent I/O operations — critical for database servers handling hundreds of simultaneous queries.

📝 Notes

Queue depth is the overlooked dimension of storage performance. SATA's limit of 32 concurrent commands means that on a database server with 100 parallel queries, the other 68 are always waiting. NVMe's massively parallel queue architecture allows the SSD to execute orders from many CPU cores simultaneously without any serialisation bottleneck. This, combined with direct PCIe bus access bypassing the legacy AHCI driver stack, is why the latency gap between SATA and NVMe is even larger than the bandwidth gap.

14Section 1 · Visual

Visual: HDD vs SSD — Physical Data Organisation

HDD — Concentric Tracks & Sectors

Track = one concentric ring
Sector = smallest addressable unit (512 B / 4 KB)
Cylinder = same track on all platters (3D)
Random reads: arm movement + rotational wait = 5–10 ms per access

SSD — NAND Flash: Blocks → Pages (no rotation)

Any page can be read in ~50 µs — no mechanical delay
But blocks with stale pages must be fully erased before reuse
Erase takes ~1.5 ms and permanently wears out the cells
Page states: ■ written ■ stale (pending erase) □ free

⚠ This physical difference — HDD seek latency vs SSD write-erase asymmetry — drives ALL design decisions in firmware, filesystems, and partitioning strategy that follow in this guide.

An entire block must be erased before any page in it can be rewritten. Even if only one page in a 512-page block is stale, GC must copy the other 511 live pages elsewhere before erasing. This is the source of write amplification.

📝 Notes

Internalise this slide. HDD: every random read is a physical journey — arm moves, disk rotates, head positions. Minimum ~5ms per access. SSD: any page is readable in microseconds. But the stale page problem means writes are not symmetric with reads. The erase-before-write constraint is the source of garbage collection, wear levelling, write amplification, and the 'sustained write cliff.' Understanding this makes every other SSD decision obvious.

Section 2

Partitions vs Filesystems

What they are, how they differ, and how they interact

16Section 2 · Partitions

What Is a Partition?

Core Definition: A partition is a contiguous range of Logical Block Addresses (LBAs) on a storage device, defined by a start address and an end address. The device itself has no awareness of what is stored inside — it is simply a defined region of blocks.

Analogy: Partitions are like rooms in a building. The building (disk) doesn't care what furniture (data) is inside each room (partition). It only knows where each room starts and ends.

What the partition table records:

Start LBA — first addressable block belonging to this partition
End LBA — last addressable block
Partition type GUID — hint about intended content (Linux data, EFI System, swap, etc.)
Flags / attributes — bootable, read-only, required, etc.

⚠ The partition type GUID is only a hint — nothing enforces what content is placed inside. Formatting a "Linux swap" partition with ext4 is perfectly legal; the kernel won't complain (though it won't auto-use it as swap).

📝 Notes

A partition is purely a geometric concept — a start address and an end address on the disk. Nothing more. The disk firmware just sees blocks; it has no idea what's inside any partition. The separation between 'where is the partition' and 'what's inside it' is a fundamental design principle of Unix storage that gives Linux its flexibility.

17Section 2 · Visual

From Raw Disk to Partitions to Filesystems

1. Raw disk — one uniform address space (LBA 0 … LBA N)

Unpartitioned disk — 500 GB — all blocks equivalent   LBA 0 …………………………… LBA N

2. After partitioning — GPT records start/end LBA for each partition

/efi

/boot

/root

/home — Btrfs — 280 GB

/var 80GB

swap

free

GPT: each coloured region = {start_lba, end_lba, type_guid, name}

3. Filesystem inside /home — does NOT have to fill the whole partition

/home partition boundary (280 GB)

Btrfs filesystem — 200 GB

80 GB unformatted
(partition reserved)

resize2fs can grow the filesystem to fill the full 280 GB later — online, without rebooting, without changing the partition table.

📝 Notes

Read this slide top to bottom as three layers. Row 1: a brand-new disk is just a flat address space. Row 2: after partitioning, the GPT table says 'blocks X through Y belong to this partition.' The filesystem doesn't exist yet. Row 3: mkfs creates the filesystem inside the partition, and crucially — it doesn't have to fill it. The remaining space is unaddressed partition space. Run resize2fs later and the filesystem expands into it — no partition table changes, no downtime.

18Section 2 · Partition Tables

What Is a Partition Table?

A partition table is a small data structure stored at the very beginning of a disk that tells the OS how the disk is divided into regions (partitions) — where each one starts, where it ends, and what type it is.

MBR (Master Boot Record) — 1983

Bootstrap code

sig

← 512 bytes total →

Max 4 primary partitions (or 3+1 extended workaround)
Max disk size: 2 TB (32-bit LBA addressing)
Single copy — corruption of sector 0 = unbootable disk

Compatible with all BIOS hardware since 1983

GPT (GUID Partition Table) — 2006 / UEFI

MBR

hdr

Partition entries (128 × 128 B)

…data…

Backup GPT

← primary (start of disk) ·····backup at end of disk →

Up to 128 partitions
Max disk size: 9.4 ZB (64-bit LBA addressing)
Redundant headers + CRC32 checksums — self-healing
Each partition has a 128-bit GUID — globally unique
Required for UEFI boot and disks > 2 TB

⚠ The partition table does NOT store data — it only stores geometry (start LBA, end LBA, type GUID, name). The filesystem lives INSIDE the partition. Use GPT for everything new.

📝 Notes

The MBR has a hard 2 TB limit due to its 32-bit LBA addresses — any disk larger than 2 TB cannot be fully addressed with MBR. GPT's backup header at the end of disk means a corrupted primary header is recoverable — with MBR, corruption of the first 512 bytes is often catastrophic. Use GPT for any disk bought after ~2010.

19Section 2 · Partition Tables

Partition Tables: MBR vs GPT

	MBR	GPT
Introduced	1983	2006 (UEFI spec)
Max partitions	4 primary (or 3+1 extended workaround)	128 (Linux: effectively unlimited)
Max disk size	2 TB	9.4 ZB (zettabytes)
Redundancy	Single copy — corruption = unbootable	Primary + backup header at end of disk
Boot standard	BIOS (legacy)	UEFI (required for >2 TB disks)
Integrity check	None	CRC32 checksum on header and table
Partition IDs	1-byte type code	128-bit GUID — globally unique
Use today	Old hardware only	Always use GPT on modern systems

📝 Notes

There is no good reason to use MBR on any disk bought in the last decade. GPT supports 128 partitions, has CRC32 checksums for corruption detection, and the backup header means recovery is possible. The 128-bit GUID per partition means no collision risk when mixing disks from different systems, unlike MBR's 1-byte type codes.

20Section 2 · GPT Register Example

GPT Partition Table — Example Register for a 500 GB SSD

GPT

/efi

/boot

/root

/home — 280 GB

/var 80G

/tmp

bak

#	Mount / Name	Type GUID	Partition GUID (unique)	Start LBA	End LBA	Size
—	GPT Header	— metadata, not a partition —		0	33	34 LBAs
1	`/boot/efi`	C12A7328-…EC93B (EFI System)	3E6D4B27-…5E71	2,048	1,050,623	512 MB
2	`/boot`	0FC63DAF-…77DE4 (Linux fs)	A1B2C3D4-…7890	1,050,624	3,147,775	1 GB
3	`/ (root)`	0FC63DAF-…77DE4 (Linux fs)	B2C3D4E5-…8901	3,147,776	87,033,855	40 GB
4	`/home`	0FC63DAF-…77DE4 (Linux fs)	C3D4E5F6-…9012	87,033,856	673,235,967	280 GB
5	`/var`	0FC63DAF-…77DE4 (Linux fs)	D4E5F6A7-…0123	673,235,968	841,008,127	80 GB
6	`swap`	0657FD6D-…4F4F (Linux swap)	E5F6A7B8-…1234	841,008,128	857,785,343	8 GB
7	`/tmp`	0FC63DAF-…77DE4 (Linux fs)	F6A7B8C9-…2345	857,785,344	878,756,863	10 GB
—	GPT Backup	— metadata, not a partition —		976,773,135	976,773,167	33 LBAs

Each partition entry is exactly 128 bytes. Size = (End LBA − Start LBA + 1) × 512 bytes. Partitions start at LBA 2,048 (1 MiB aligned). Tools: gdisk -l /dev/sda or sgdisk --print /dev/sda

📝 Notes

Type GUID is just a hint — it says nothing about which filesystem is inside. All three 'Linux filesystem' partitions share the same type GUID but could contain ext4, Btrfs, XFS, or anything else. The unique Partition GUID distinguishes them — each is a 128-bit globally unique identifier that will never collide with another partition anywhere in the world. The protective MBR at LBA 0 prevents legacy tools from overwriting the GPT when they see an unrecognised disk format.

21Section 2 · Filesystems

What Is a Filesystem?

Core Definition: A filesystem is a data structure layered ON TOP of a block device (partition or logical volume) that defines how files and directories are named, organised, stored, and retrieved. The partition gives raw blocks — the filesystem gives them meaning.

What a filesystem defines:

File naming

Allowed characters, max length, case sensitivity

Directory structure

How parent/child relationships are stored (tree, B-tree, hash table)

Inodes / metadata

Permissions, timestamps, owner, size, pointer to data blocks

Free space tracking

Bitmap, extent tree, or B-tree of free extents

Data block mapping

How an inode maps to actual data blocks on disk

Crash consistency

Journaling, COW, or nothing — what survives a power cut

📝 Notes

If the partition is the empty room, the filesystem is the shelving system, the catalogue, and the rules for where things go. mkfs writes the initial filesystem structures into the raw blocks of the partition. After mkfs, the partition contains: a superblock (filesystem metadata), inode tables, block bitmaps, and an empty root directory. Before mkfs, it's just raw, uninterpreted blocks.

22Section 2 · Key Concept

The Key Distinction: Partition ≠ Filesystem

PARTITION /dev/sda2 (e.g. 100 GB)

FILESYSTEM (ext4) — 60 GB

Unused partition space

A partition is a block device: /dev/sda2 — raw, unformatted blocks. No files yet.
A filesystem is created on top: mkfs.ext4 /dev/sda2 writes filesystem structures into those raw blocks.
The filesystem does NOT have to use all of the partition's space. resize2fs can shrink or grow an ext4 filesystem independently of the partition boundary.
Practical use: leave unallocated space at partition end → grow filesystem online later without downtime (especially powerful with LVM).

📝 Notes

This is the single most important conceptual distinction in this section. The partition is the container — a range of blocks. The filesystem is the structure inside that container. They are completely independent layers. A 100 GB partition with a 60 GB filesystem means 40 GB of the partition's blocks are just unaddressed. Grow the filesystem later without touching the partition: resize2fs --size 100G /dev/sda2.

23Section 2 · Advanced Concept

Two Filesystems on One Partition — and Why

Loop Devices

A regular file within a filesystem can be treated as a block device: losetup /dev/loop0 disk.img → mkfs.ext4 /dev/loop0 → mount. A complete second filesystem living inside a file on the first.

LVM Thin Provisioning

LVM Logical Volumes decouple filesystems from partitions entirely. Multiple LVs can live within one PV (physical volume on a partition), each with its own filesystem.

Container Images

Docker/Podman layers are filesystem images stacked via overlay mounts. Each layer is a filesystem image inside a file on the outer filesystem.

LUKS Encrypted Vault

A LUKS container is a file → opened as a loop device → filesystem inside. The entire encrypted FS lives as a single file on the outer FS.

VM Disk Images

.qcow2 / .vmdk files contain complete virtual disks with their own partition tables and filesystems, living inside the host filesystem.

The "two filesystems on one partition" idea is the foundation of containers, VMs, and encrypted storage. It gives portability, isolation, snapshots, and encryption — without needing dedicated hardware partitions.

📝 Notes

The loop device mechanism is the key: any file can become a block device, and any block device can have a filesystem. Docker images are the most visible real-world example: multiple overlayfs layers, each a filesystem image, stacked to create the container's unified view. There is also a less-discussed use case: data can be structured in a way that makes additional content non-obvious from the outside. A system can look completely normal without specific knowledge of the inner filesystem.

Section 3

Filesystem Types

FAT · Journaling · Copy-on-Write · and beyond

25Section 3 · Core Problem

The Problem All Filesystems Must Solve

Crash Consistency: A file operation involves multiple disk writes: data blocks, inode update, directory entry, free space bitmap. If power dies between any two of these, the filesystem is in an inconsistent state. How you handle this defines your FS architecture.

Example: appending 1 byte requires updating:

Data block

Write the actual new byte to a disk block

Inode

Update file size and block pointer in the inode

Block bitmap

Mark the new block as allocated in the free space map

Power dies between steps 1 and 2: data is on disk but the inode doesn't point to it → lost data

Power dies between 2 and 3: inode is updated but bitmap says block is free → block allocated twice (corruption)

Three main strategies: (1) Do nothing — run fsck on recovery (old ext2, FAT). (2) Write-ahead journal — log intentions before acting. (3) Copy-on-Write — never overwrite existing data, swap pointers atomically.

📝 Notes

A single logical file operation requires multiple physical writes that must all appear atomic — all-or-nothing. But disks are not atomic. Power can fail after any individual write. The three main strategies each have different performance/safety trade-offs. Journaling writes intentions to a circular log before acting — on crash, replay the log. COW never overwrites existing data — always writes new data to a free location, then atomically swaps the pointer.

26Section 3 · FAT

Table-Based Filesystems: FAT32 & exFAT

How FAT Works

The File Allocation Table is a flat linked list of cluster entries. Each cluster entry points to the next cluster of the file, or marks EOF. The directory entry gives the first cluster. Navigation = follow the chain.

The FAT32 Limits

Max file size: 4 GB (32-bit cluster size field). Max volume: 2 TB. No permissions, no symlinks, no hard links. No journaling — power loss mid-write can corrupt the FAT itself, making files unreadable.

exFAT — Microsoft's "FAT for Flash":

Removes 4 GB file size limit (64-bit cluster count)
Improved allocation bitmap — faster free space lookup than FAT32
Still no journaling, no Unix permissions, no hard links
Supported natively by Windows, macOS, and Linux (kernel 5.4+)
Use case: USB drives and SD cards shared between OS — NOT for system partitions

FAT32 is required for /boot/efi because the UEFI specification mandates FAT32. For any Linux system partition, use ext4, XFS, or Btrfs instead — FAT has no concept of Unix permissions.

📝 Notes

FAT32 is the lingua franca of storage — every device understands it. But its 4 GB file limit is a constant source of pain. exFAT removes the file size limit but keeps all of FAT's other weaknesses. For Linux system partitions, you'd never use either — they have no concept of Unix permissions (owner, group, rwx bits), so they can't store a Unix filesystem correctly. They're only for /boot/efi which is required to be FAT32 by the UEFI specification.

27Section 3 · Visual

Visual: How FAT, ext4, and Btrfs Organise Data

FAT32

Linked list: FAT[3]→5→7→EOF

Flat table of cluster entries. To find block N, follow N links in the table — O(N) seeks.

Problem: O(N) seeks to find block N. No crash safety. FAT corruption = unreadable files.

ext4 (journaling)

Inode 42 → extent tree

B-tree of block ranges — O(log N) lookup. Journal records intentions before writing.

Overwrite = in-place + journal commit → safe but double-write overhead

Btrfs (COW)

Root (old) → Root (NEW)

Write new data to free space. Atomically swap the root pointer. Old root = snapshot.

Never overwrites! New version → free space. Old = snapshot / freed by GC later.

📝 Notes

FAT32: linked list — O(N) seeks to find block N. No crash safety at all. ext4: inode with extent tree — O(log N). Journal records intentions before writing; on crash, replay the journal. Updates are in-place — old data is gone once written. Btrfs COW: never touch old data. Write new version to free space, update inode and B-tree upward, atomically swap the root pointer. At the moment of the swap, the old root is still valid — freeze it and you have a snapshot for free.

28Section 3 · Journaling

Journaling Filesystems: ext4, XFS, NTFS

The Journal: Write-Ahead Log. Before making any change, the filesystem writes its intention to a dedicated journal area: "I am about to update inode 42 and block 1337." Only after the journal commit does it apply the changes. On crash: replay the journal to reach consistency.

ext4 Journaling Modes:

writeback

Only metadata journaled. Data can be written before or after metadata. Fastest, least safe — data may be wrong on crash but FS structure is valid.

ordered (default)

Data written to its final location before the metadata journal commit. Balance of safety and performance. This is the Linux default.

full journal

Both data AND metadata go through the journal — every write is journaled twice. Safest, but ~2× write overhead. High-availability scenarios.

ext4: extents (better large-file performance), delayed allocation, 1 EB max volume, 64-bit block numbers, online defrag.

XFS: high parallelism, excellent for large files and many concurrent writes. B+tree directory indexing. Online grow (not shrink). Default on RHEL/Rocky.

📝 Notes

The journal is typically a circular log of fixed size, often 128 MB. Journal writes are sequential, so they're fast even on HDDs. On crash recovery, the kernel reads the journal, replays any committed-but-not-applied transactions, and the filesystem is consistent in seconds — versus the old ext2 approach of running fsck over the entire disk which could take hours on a large drive. XFS is the default on Red Hat derivatives and is exceptional at handling workloads with many concurrent writers, making it ideal for /var and database directories.

29Section 3 · COW

Copy-on-Write Filesystems: Btrfs, ZFS, APFS

Core Principle: Never Overwrite In Place. Write the NEW version to a free location, update the parent metadata pointer, then release the old block. The old version remains valid until the pointer is atomically swapped.

What COW enables — for free:

Atomic commits

Either the entire transaction lands (new root pointer visible) or nothing does. No journal needed — the pointer swap IS the commit.

Instant snapshots

Just freeze the old root pointer. The snapshot is an immutable view at a point in time. Cost: ~0 ms, ~0 bytes initially.

Block-level checksums

Every data and metadata block has a hash (CRC32c or SHA256). Silent corruption — a flipped bit — is detected and (with redundancy) self-healed.

Transparent compression

Compress each block before writing. Per-file or global. Algorithms: lz4 (fast), zstd (balanced), zlib (dense).

Deduplication

Identical block content → single shared physical block. Multiple files reference the same data without extra storage.

Send / Receive

Stream the diff between two snapshots over a pipe — efficient incremental backup without rsync scanning the entire tree.

📝 Notes

COW is an elegant solution to the crash-consistency problem — instead of logging intentions (journaling), you simply never destroy old data until the new data is safely committed. The pointer swap is effectively atomic. The bonus features — snapshots, checksums, compression, deduplication — are not add-ons. They emerge naturally from the architecture: snapshots are free because old data is preserved anyway; checksums are natural because you're writing to a new location and can compute the hash as you go.

30Section 3 · COW Internals

COW Deep Dive: How a Write Actually Works

Write new data

New data written to a free block — old block untouched

Write new metadata

New inode / extent node written, pointing to the new data block

Write new tree node

Parent B-tree node updated — copy written to free space

Atomic root swap

Super-block's root pointer updated atomically. Old blocks are now "freed" — GC can reclaim them

Why snapshots are free: Before step 4, the old root pointer still points to a valid, consistent tree. A snapshot just keeps this old pointer alive — preventing GC from reclaiming those blocks. Initial disk cost = zero bytes.

Btrfs: btrfs subvolume snapshot / /.snapshots/snap1 — instant.
ZFS: zfs snapshot pool/dataset@snap1 — instant.

📝 Notes

Step 4 — the root pointer swap — is the only atomic operation. Everything before it writes to free space and doesn't affect the live filesystem. A snapshot is literally 'don't update this pointer — keep this version alive too.' The filesystem now has two valid root pointers: the live one and the snapshot one. They share all blocks that haven't changed. As the live filesystem changes, only the changed blocks diverge — the unchanged blocks are still shared with the snapshot. This is called 'reflinks' internally.

31Section 3 · COW Trade-offs

COW Trade-offs: The Cost of All That Power

Write Amplification

Even small writes trigger a cascade of metadata tree updates up to the root. Writing 4 KB of data may cause 5–6 additional metadata block writes. On SSDs usually acceptable; on HDDs under random workloads it's punishing.

Fragmentation on HDDs

Because data is never written in-place, files fragment continuously. On SSDs fragmentation is irrelevant (no seek penalty). On spinning disks, COW + heavy write workloads eventually degrades read performance significantly.

✓ Ideal for SSDs

No seek penalty for fragmentation. Checksums catch silent corruption (bit rot). Snapshots are invaluable for SSD-based systems.

⚠ Caution for Databases

Databases do fine-grained, high-frequency updates to B-tree pages. COW + database = massive write amplification. Use nodatacow or a separate non-COW partition for database data directories.

If running PostgreSQL, MySQL, or any database on Btrfs, set nodatacow on the data directory (chattr +C /var/lib/postgresql before initialising the cluster). This falls back to journaling semantics for that directory while keeping COW for everything else.

📝 Notes

COW is not a free lunch. The metadata tree amplification means a single logical write touches many physical blocks — data block, inode, extent tree node, and possibly multiple B-tree interior nodes up to the root. For random write workloads — exactly what databases do — it's significant. Production servers with Btrfs on /var have shown 10× write amplification under systemd-journald load. The same journal write that costs 1 physical I/O on XFS costs 8–12 on Btrfs COW.

32Section 3 · Btrfs

Btrfs: Linux's Native COW Filesystem

Subvolumes — independent filesystem trees within one Btrfs volume. Each has its own snapshot history, can be mounted separately, and can be sent/received independently.
Snapshots — instant, space-efficient. snapper integrates with package managers: automatic snapshot before/after every dnf/apt operation → instant rollback.
Online resize — grow or shrink while mounted (shrink limited by actual data usage).
RAID 0/1/10 — built into the filesystem; mdadm not needed. RAID 5/6: still has known reliability issues — avoid in production.
Send / Receive — btrfs send | btrfs receive streams the diff between two snapshots. Efficient incremental backup over SSH.

Transparent compression — per-file or global; lz4 (fastest), zstd (best ratio/speed balance), zlib (highest compression). Typical savings: 30–60% on code, docs, logs.
Checksums — CRC32c on all data and metadata blocks by default. SHA256 available. Silent corruption detected immediately on read.
Deduplication — inline (slow, high RAM) or offline via duperemove/bees. Best for VM images, backups, containers.

Used by: Fedora (default since F33), openSUSE (default), Steam Deck OS, Synology NAS DSM.
Status (2026): Stable for single-disk and RAID 1/10. Avoid RAID 5/6 for critical data.

📝 Notes

Btrfs is the native choice for modern Linux systems that want COW features without ZFS licensing complexity. The killer feature for desktop and workstation use is snapper integration: every time you run 'dnf upgrade' or 'apt upgrade', snapper automatically takes a before-and-after snapshot. If the update breaks something, you boot into the GRUB snapshot menu and roll back the entire system in 30 seconds.

33Section 3 · ZFS

ZFS: The Gold Standard of Storage Integrity

Architecture: No Partitions Needed

ZFS combines volume manager + filesystem. Physical disks → zpool → datasets (filesystems) + zvols (block devices). ZFS bypasses the partition layer entirely. One zpool can span multiple disks with built-in RAID.

End-to-End Integrity

Every block carries a 256-bit checksum (SHA256/SHA512). On read, checksum is verified. On mismatch with RAID redundancy: ZFS reads the redundant copy, fixes the corrupt block, and logs the event — fully automatic self-healing.

Caching Architecture

ARC (Adaptive Replacement Cache): RAM-based read cache, far smarter than page cache. L2ARC: SSD-based extension of ARC. ZIL (ZFS Intent Log): write-ahead log for sync writes, can be accelerated with a dedicated NVMe device (SLOG).

Linux Trade-offs

OpenZFS is stable on Linux (6.x kernel). CDDL license is incompatible with GPL — installed as a DKMS module. RAM hungry: 1 GB RAM per TB of storage is a rough guideline for ARC. Not default on any major distro.

⚠ ZFS deduplication requires a hash table in RAM — one entry per block. A 1 TB pool with 4 KB blocks = 256 million entries ≈ 80 GB RAM just for dedup metadata. This is why ZFS dedup is disabled by default.

📝 Notes

Use ZFS when data loss is genuinely not an option — storage servers, NAS devices, production databases, scientific data. The self-healing capability is uniquely valuable: a disk can develop bit rot — individual bits flipping due to cosmic rays or cell degradation — and ZFS detects and corrects it automatically, provided there's redundancy. Btrfs has checksums too, but without redundancy it can only detect corruption, not fix it. The RAM requirement is real: the '1 GB RAM per TB' rule is a practical minimum for comfortable ARC operation.

34Section 3 · Virtual FS

Special Filesystems: tmpfs, proc, sysfs, and the VFS

tmpfs — RAM-Backed Filesystem

Looks and behaves like a regular filesystem, but lives entirely in RAM (and swap). Lost on reboot. Used for /tmp, /run, /dev/shm. Mount: mount -t tmpfs -o size=4G tmpfs /tmp. Lightning fast — zero disk I/O.

proc & sysfs — Kernel as Files

/proc exposes running processes and kernel parameters as files. /proc/cpuinfo, /proc/meminfo — all virtual files generated on-demand. /sys exposes hardware and driver configuration. Reading /proc/cpuinfo doesn't touch the disk — the kernel generates it on the fly.

devtmpfs & udev

/dev contains device nodes — special files that represent hardware. devtmpfs populates /dev automatically as devices appear. ls -la /dev/sda shows a block device node; reading/writing it reads/writes the raw disk.

VFS — Virtual Filesystem Switch

The kernel's VFS layer routes read/write calls to the appropriate driver: ext4, Btrfs, tmpfs, proc — the application doesn't know or care which. To the OS, everything is a file.

The VFS is one of Unix's most elegant ideas: presenting everything — files, devices, kernel parameters, network sockets — through a uniform interface: open, read, write, close. Using tmpfs for /tmp gives a significant speed boost for programs creating many temporary files — compilation, package building, scripting.

📝 Notes

The VFS is one of Unix's most elegant design decisions. By presenting everything through a uniform file interface, applications don't need to know what's underneath. 'cat /proc/cpuinfo' uses the exact same syscall as 'cat /home/user/document.txt'. tmpfs for /tmp is a significant performance win for programs that create many temporary files — compilation, package building, and many scripted workflows all benefit enormously from zero disk I/O for temporary data.

35Section 3 · Visual

Filesystem Behaviour on HDD vs SSD — Why It Matters

Filesystem	HDD (spinning)	SATA/NVMe SSD	NVMe + heavy random write
FAT32 / exFAT	✓ OK Small files/USB fine. Fragmentation over time. No crash safety.	✓ OK Fine for removable media. TRIM not supported.	⚠ Poor No TRIM. Wastes SSD capability. Use only for /boot/efi.
ext4 (journaling)	✓ Good Excellent. Sequential I/O, minimal fragmentation.	✓ Good Works well. Use noatime + discard=async.	✓ Good Solid and predictable. Best for /boot and databases.
XFS (journaling)	✓ Good Excellent for large files & concurrent writes. Cannot shrink.	✓ Great Ideal for /var. High IOPS, parallel metadata ops.	✓ Great Best filesystem for /var on NVMe. High write parallelism.
Btrfs (COW)	⚠ Caution COW → fragmentation grows. Slow random reads over time. Avoid for /var.	✓ Good Checksums catch bit rot. Snapshots + compression. Ideal /home.	✓ Great Full benefit: fast GC, low fragmentation penalty, snapshots.
ZFS (COW + pool)	✓ Great Designed for large HDD arrays. Self-healing with redundancy.	✓ Great ARC cache shines. L2ARC on SSD accelerates read cache.	⚠ RAM! Needs lots of RAM for ARC (1 GB/TB guideline). Otherwise excellent.

⚠ Key rule: Btrfs/ZFS COW on /var or DB data dirs = write amplification. Use ext4 or XFS there, and set nodatacow on any Btrfs mount used by databases. Production servers with Btrfs on /var have shown 10× write amplification under systemd-journald load.

📝 Notes

This matrix is the practical synthesis of everything in Section 3. Btrfs on HDD: COW writes never touch the same physical location twice, so files fragment continuously. On SSDs there's no seek penalty so fragmentation doesn't hurt. On HDDs, reading a heavily fragmented Btrfs filesystem after months of use can be shockingly slow. ZFS RAM requirement: run ZFS on a machine with 4 GB of RAM and the ARC has almost no space — you lose the main advantage while keeping all the complexity. ZFS is for servers with 16–128 GB of RAM.

36Section 3 · External Resource

See the Filesystems in Action

https://oos.wimic.agh.edu.pl/oos/3/ →

AGH Open Operating Systems course — practical filesystem exercises

Section 4

Partitioning Linux Disks

Strategy, tools, mount options, and LVM

38Section 4 · Directory Analysis

Linux Directory Structure: I/O Profiles

Directory	Content	Write Frequency	File Size / Count	Notes
/	OS core, init, config	Very Low	Mixed	Stable after install; updates only
/home	User files, docs, media	Low–Med	Large files	Rare writes, large sequential reads
/var	Logs, databases, caches, mail	Very High	Many small	THE hot-write directory
/var/log	System logs	Continuous	Small, appended	Append-only, high volume
/var/lib	Database files (postgres, etc.)	Very High	Medium	Random reads/writes — DB I/O
/tmp	Temporary build files	High	Varies	tmpfs ideal — evaporates on reboot
/boot	Kernel, initrd, GRUB	Very Low	Small	Writes only on kernel updates
/boot/efi	EFI system partition	Rarely	Small	Must be FAT32; UEFI reads it
/usr	Programs, libraries, shared	Very Low	Many medium	Read-only in practice; updates only
/opt	Third-party software	Low	Mixed	Similar to /usr

The /var rows are the most important. On a busy server, /var does thousands of small random writes per second. This is fundamentally different from /home, which might be written to a few times an hour. This difference drives the choice to put them on separate partitions with different filesystems and mount options.

📝 Notes

/var is where the system lives at runtime — logs are written continuously, package databases are updated, and any databases (PostgreSQL, MySQL, MariaDB) live in /var/lib. This directory does thousands of small random writes per second on a busy system. It has completely different requirements from /home, which might get written to a few times an hour when a user saves a document.

39Section 4 · Rationale

Why Separate Partitions?

Isolation

A full /var (logs, caches) does NOT crash the OS. / still has free space. Without isolation, a rogue log file can fill the root filesystem and make the system unbootable.

Different Mount Options

/home → noexec, nosuid (users can't execute binaries or suid tricks). /tmp → noexec, nosuid, nodev. These security options can't be mixed on one partition.

Different Filesystems

/var benefits from XFS (high IOPS, many small files). /home benefits from Btrfs (snapshots, compression). Two filesystems cannot exist on one partition normally.

Different Backup Strategies

/home: backed up nightly. /var/cache, /tmp: never backed up. /boot: backed up before kernel updates. Separate partitions → separate backup policies.

Quotas & Accounting

Per-partition disk quotas for multi-user systems. Easier capacity planning — you know exactly how much space each subsystem uses.

Independent Resize & Replace

Move /home to a new, larger disk without touching /. Add a faster NVMe just for /var. Replace / with a fresh install keeping /home intact.

The isolation argument alone justifies separate /var on any server. A log rotation script failure filling /var/log would take down a system with a single root partition. With separate /var, SSH still works and the admin can log in and fix the problem.

📝 Notes

Isolation justifies separate /var on any server. Production servers have gone down because a log rotation script failed and /var/log filled the root filesystem — the SSH daemon couldn't write its PID file, the package manager couldn't run, and the system became unresponsive. With separate /var, only the functionality that writes to /var is affected — the OS itself keeps running.

40Section 4 · FS Selection

Choosing the Right Filesystem for Each Mount

Mount	Filesystem	Key Options	Reason
/boot/efi	FAT32	defaults	UEFI specification requirement. Must be FAT32 — no exceptions.
/boot	ext4	defaults	GRUB must read it at boot — universal support. No COW complexity needed.
/ (root)	Btrfs or ext4	compress=zstd, discard=async	Btrfs: system snapshot rollback via snapper. ext4: maximum stability and simplicity.
/home	Btrfs	noatime, compress=zstd	Large files → compression saves 30–50%. Per-user subvolumes → per-user snapshots.
/var	XFS or ext4	noatime, discard=async	Thousands of small files, high random write frequency. XFS excels here. AVOID Btrfs COW.
/tmp	tmpfs	size=4G, noexec, nosuid	RAM-backed — zero disk I/O. Auto-cleaned on reboot. Size-limited to prevent RAM exhaustion.

⚠ The /var recommendation is critical: avoid Btrfs COW for /var. systemd-journald writes many tiny log entries continuously. Btrfs has to write data, update inode, update B-tree node, and update the root for each — that's 4× the I/O for each log write, thousands of times per second on a busy server. XFS was designed for exactly this workload.

📝 Notes

The /var recommendation deserves emphasis. When systemd-journald writes a log entry — 200 bytes — Btrfs has to: write the data to a new block, write an updated inode to a new block, write an updated B-tree node to a new block, and update the root. That's 4× the I/O for 200 bytes of log data, and it happens thousands of times per second on a busy server. XFS was designed for exactly this workload — high-throughput, many-small-files, concurrent writers.

41Section 4 · Practical Example

Practical Partitioning Example: 500 GB SSD

Option A: Direct Partitions

Device	Size	FS	Mount
/dev/sda1	512 MB	FAT32	/boot/efi
/dev/sda2	1 GB	ext4	/boot
/dev/sda3	40 GB	Btrfs	/
/dev/sda4	280 GB	Btrfs	/home
/dev/sda5	80 GB	XFS	/var
/dev/sda6	rest	ext4	/data

Option B: With LVM (recommended)

LV / Device	Size	FS	Mount
/dev/sda1	512 MB	FAT32	/boot/efi
/dev/sda2	1 GB	ext4	/boot
/dev/sda3 → LVM PV	Volume Group 'vg0'
lv_root	40 GB	Btrfs	/
lv_home	280 GB	Btrfs	/home
lv_var	80 GB	XFS	/var
lv_tmp	10 GB	ext4	/tmp (fallback)
(free)	~89 GB	—	Grow any LV online later

The LVM advantage: if /var grows beyond expectations, run lvextend -L +20G /dev/vg0/lv_var && xfs_growfs /var — online, without unmounting, without rebooting, in ~10 seconds. The 89 GB of unallocated space acts as a flexible reserve for whichever volume needs to grow.

📝 Notes

The LVM option is the recommended approach for any serious system. The key advantage is the unallocated space at the bottom of the VG: if /var grows beyond expectations, you can extend it online while PostgreSQL is running — no downtime, no unmounting, no rebooting. Without LVM, resizing partitions requires careful coordination of adjacent partition boundaries. LVM decouples logical size from physical layout entirely.

42Section 4 · Mount Options

Mount Options That Matter

Option	Effect	Recommended for
noatime	Disables access time (atime) updates on reads. Each "cat file" no longer triggers a metadata write.	/home, /var, all SSDs
relatime	Updates atime only if it's older than mtime. Safer default — some apps rely on atime ordering.	Default safe option on /
noexec	Prevents execution of binaries from this filesystem. Blocks direct exploit payloads.	/tmp, /var, /home (security hardening)
nosuid	Ignores setuid/setgid bits. Prevents privilege escalation via suid binaries on user-writable mounts.	/tmp, /home
compress=zstd	Btrfs: transparent zstd compression. Excellent ratio+speed balance. Saves 30–60% on text/code.	Btrfs /home, /
nodatacow	Btrfs: disable COW for specific files/directories. Falls back to journaling. Essential for databases.	Btrfs + PostgreSQL/MySQL data dirs
discard=async	SSD TRIM: asynchronously informs SSD of freed blocks. Enables proactive GC. All modern FS support it.	All SSDs — use async not sync
barrier=0	Disables write barriers (cache flush ordering). DANGEROUS — only safe with battery-backed cache.	Never without battery-backed cache

noatime is the single highest-impact, zero-risk mount option. Every file read on a normal mount triggers a metadata write to update the access time. On a system reading thousands of library files during compilation, this is thousands of unnecessary writes. noatime eliminates them completely.

📝 Notes

noatime is the single highest-impact, zero-risk mount option. Every file read on a normal mount triggers a metadata write to update the access time. On a system reading thousands of library files during compilation, this is thousands of unnecessary writes. noatime eliminates them. discard=async is the modern way to enable TRIM — 'async' batches TRIM commands rather than issuing one per deleted file. barrier=0 disables write barriers and can corrupt data on power loss unless hardware-level write ordering is guaranteed by a battery-backed controller cache.

43Section 4 · Tools

Partitioning & Formatting Tools in Linux

Partitioning

fdisk /dev/sda — interactive text UI, now supports GPT. Classic tool, still widely used.
gdisk /dev/sda — GPT-focused successor. Better GPT handling, same workflow.
parted /dev/sda — supports MBR and GPT, scripting-friendly, used by installers.
gparted — GUI frontend to parted. Excellent for visual partition resizing. Run from a live USB.
lsblk — list block devices, sizes, mount points, and relationships in a tree. First thing to run on a new system.
blkid — show UUID, filesystem type, and label for each partition. Essential for writing /etc/fstab.

Formatting & Resizing

mkfs.ext4 /dev/sda2 — format as ext4. Options: -L label, -m 1 (reduce reserved blocks).
mkfs.xfs /dev/sda5 — format as XFS. Options: -L label, -f (force).
mkfs.btrfs /dev/sda3 — format as Btrfs. Options: -L label, -d single.
tune2fs -m 1 /dev/sda2 — reduce ext4 reserved block percentage (default 5% → 1% on large volumes).
resize2fs /dev/sda2 80G — resize ext4 filesystem (partition must be resized first).
xfs_growfs /var — grow XFS filesystem online to fill available space.

⚠ Always run lsblk and blkid before any partitioning work to confirm the correct device. Writing to the wrong disk is catastrophic — there is no undo.

📝 Notes

For a new installation, use either the distro installer or gdisk followed by mkfs commands. For an already-running system, gparted from a live USB is the safest way to resize partitions — you can't safely resize a mounted filesystem. Growing is usually safe while mounted: ext4 via resize2fs, XFS via xfs_growfs, Btrfs via btrfs filesystem resize. Shrinking almost always requires unmounting.

44Section 4 · LVM

LVM: Logical Volume Manager — Why It Changes Everything

The LVM Layer Stack

Physical Volumes

/dev/sda3, /dev/sdb1, /dev/nvme0n1p3 — actual partitions

Volume Group

vg0 — pool of all PV space, addressed as one

Logical Volumes

lv_root, lv_home, lv_var — virtual block devices, arbitrarily sized

Filesystems

ext4, XFS, Btrfs — formatted on top of each LV

Key Capabilities

Online resize — grow an LV and filesystem while mounted and in use. No downtime: lvextend -L +50G vg0/lv_var && xfs_growfs /var
Span multiple disks — add a new disk as a PV, extend the VG, then extend any LV across it. Transparent to the filesystem.
Snapshots — block-level COW snapshots of any LV, regardless of filesystem type. Used for consistent backups.
Thin provisioning — allocate more logical space than physical. Overcommit storage; allocate physical only as needed.
Cache volumes — attach a fast NVMe as an lvmcache to accelerate a slow HDD logical volume automatically.
Migration — pvmove migrates data between PVs online. Replace a failing disk without downtime.

LVM is the missing layer that makes Linux storage truly flexible. Without LVM, partition sizes are mostly fixed at install time. With LVM, you can grow /var by 50 GB while PostgreSQL is running — no downtime, no unmounting, no rebooting. Always leave some unallocated space in the VG for future growth.

📝 Notes

LVM adds abstraction between partitions and filesystems. Without LVM, partition sizes are mostly fixed at install time. With LVM, you can grow /var by 50 GB while the system is running PostgreSQL — no downtime, no unmounting, no rebooting. The 'unallocated space in the VG' pattern is the key practice: always leave some space in the VG unallocated. You don't need to know how big /var will eventually be — just give it a reasonable initial size and leave spare capacity in the VG to extend later.

45Section 4 · Alignment

Alignment, 4K Sectors, and Why They Matter

The 512 → 4K Transition

Old disks: 512-byte physical sectors. Modern disks: 4096-byte (4K/4Kn) physical sectors. Many modern drives emulate 512-byte sectors (512e) for compatibility while having 4K physical sectors underneath.

Misalignment Penalty

If a partition starts at a non-4K-aligned offset, every 4K filesystem write spans two physical 4K sectors: Read-Modify-Write old sector 1 + Read-Modify-Write old sector 2. That's 4 physical I/Os instead of 1. Performance halved, wear doubled on SSDs.

Modern Tool Defaults (Safe)

fdisk, gdisk, and parted all default to 1 MiB alignment for new partitions. 1 MiB covers all 4K sector sizes and all SSD erase block sizes (128 KB–4 MB). You rarely need to think about this — but understand why.

Checking Alignment

parted /dev/sda align-check optimal 1
cat /sys/block/sda/queue/physical_block_size

📝 Notes

Partition alignment is handled automatically by modern tools. The 1 MiB alignment standard is generous — it aligns to any physically plausible sector or erase block size. If scripting partition creation, always specify units in MiB and let parted snap to the nearest boundary. The align-check command is good for verifying partitions created by older tools or migrated from other systems.

Section 5

Putting It All Together

Decision frameworks, summary tables, and next steps

47Section 5 · Framework

Decision Framework: Choosing Your Storage Stack

❓ SSD or HDD?

SSD → Enable TRIM (discard=async), prefer COW FS or XFS/ext4 with noatime.
HDD → Prefer journaling FS, minimise fragmentation, separate /home.

❓ What is the workload?

Large sequential files → XFS or Btrfs+compress.
Many small random files → XFS or ext4.
System snapshots → Btrfs+snapper.
Databases → ext4/XFS + nodatacow on Btrfs.

❓ How many disks?

Single → Careful partition sizing + LVM for flexibility.
Multiple → Btrfs RAID 1/10, or ZFS for integrity.
mdadm + LVM for maximum control.

❓ Need snapshots for rollback?

Yes → Btrfs with snapper (desktop/workstation). ZFS with zfs-auto-snapshot (server). Both integrate with package managers for automatic pre-update snapshots.

❓ Need maximum flexibility?

Always use LVM between partitions and filesystems. Leave 10–20% of VG unallocated. Use thin provisioning for development/test VMs.

❓ Need data integrity above all?

ZFS: end-to-end checksums, self-healing, ARC cache, ZIL for sync writes. Requires more RAM (minimum 8 GB, ideally 32+ GB for large pools).

📝 Notes

These six questions cover 90% of storage architecture decisions. Work through them top to bottom for any new system. The workload question is the most important — the I/O profile of applications should drive the filesystem choice. The flexibility question is why LVM is recommended by default: the cost of setting it up is 5 minutes; the benefit of being able to resize volumes online is enormous over the lifetime of a system.

48Section 5 · Summary

Quick Reference: Recommended Configuration

Mount Point	Filesystem	Key Mount Options	Reason
/boot or /boot/efi	FAT32	defaults	UEFI requirement
/boot (for BIOS)	ext4	defaults	GRUB compatibility
/	Btrfs or ext4	compress=zstd, discard=async	Snapshots (Btrfs) or stability (ext4)
/home	Btrfs	noatime, compress=zstd	Large files, compression, per-user snapshots
/var	XFS or ext4	noatime, discard=async	High-freq small writes; avoid COW
/var/lib/postgresql	ext4 or XFS	noatime, nodatacow (Btrfs)	DB random I/O — COW is harmful
/tmp	tmpfs	size=4G, noexec, nosuid	RAM speed, auto-clean on reboot
(all volumes)	LVM over all	—	Online resize, flexibility, snapshots

⚠ The database row is often forgotten: if you install PostgreSQL and leave its data directory on a Btrfs COW mount, you'll see elevated write amplification. Set chattr +C /var/lib/postgresql to enable nodatacow on that directory after creating it, before initialising the database cluster.

📝 Notes

This is the cheat sheet — print it, save it, use it for the next install. The most important rows: /var — use XFS, not Btrfs COW; /home — use Btrfs with compression for maximum space efficiency. The database row is often forgotten: if you install PostgreSQL and leave its data directory on a Btrfs COW mount, you'll see elevated write amplification. Set chattr +C on the directory after creating it, before initialising the database cluster.

Section 6

Extra Info

Case sensitivity, encoding, permissions, compression, and RAM

50Section · Filesystem

Case Sensitivity — It's the Filesystem, Not the Kernel

How it works: The kernel passes the filename string to the filesystem driver UNCHANGED. The driver alone decides whether 'File.txt' and 'file.txt' are the same path or two different ones. The kernel has no opinion — it is completely encoding- and case-agnostic.

Filesystem	Case-sensitive?	Notes
ext4	✓ Yes (default)	Can be made case-insensitive per-directory (kernel 5.2+, casefold option)
XFS	✓ Yes	Always case-sensitive, no toggle
Btrfs	✓ Yes (default)	Per-directory case-insensitive flag (+F) available
ZFS	✓ Yes (default)	casesensitivity=insensitive per dataset
FAT32 / exFAT	✗ No	Case-preserving but case-insensitive — standard USB behaviour
NTFS	✗ No (default)	Case-insensitive by default; case-sensitive mode available in Windows 10+
tmpfs	✓ Yes	RAM filesystem — always case-sensitive

⚠ Practical consequence: copying from ext4 → FAT32 when 'readme.txt' and 'README.TXT' both exist will silently overwrite one with the other. This is a common source of cross-platform bugs in git repositories and shared filesystems.

📝 Notes

Case sensitivity lives in the FILESYSTEM DRIVER, not the kernel. The VFS just passes filenames exactly as typed to the driver. ext4, XFS, Btrfs, ZFS compare bytes exactly, so 'File.txt' != 'file.txt'. FAT/NTFS fold to uppercase before comparing, so they are equal. Practical consequence: copying from a case-sensitive Linux ext4 partition to a FAT32 USB drive may cause files named 'readme.txt' and 'README.TXT' to silently collide — one will overwrite the other. This is a real source of bugs in cross-platform projects.

51Section · Case Sensitivity Demo

How Many Different Files Can You Have Named 'file.txt'?

On a case-sensitive filesystem (ext4 / XFS / Btrfs / ZFS), the word 'file' has 4 characters, each of which can be upper or lowercase: 2⁴ = 16 distinct names, each a separate file with its own inode:

file.txt
File.txt
fIle.txt
FIle.txt
fiLe.txt
FiLe.txt
fILe.txt
FILe.txt
filE.txt
FilE.txt
fIlE.txt
FIlE.txt
fiLE.txt
FiLE.txt
fILE.txt
FILE.txt

On FAT32 or NTFS (case-insensitive): all 16 names point to the SAME file. First one created wins; subsequent creates silently overwrite it.

# Try on Linux (ext4) — all 16 create DIFFERENT files:
touch file.txt File.txt fIle.txt fiLe.txt filE.txt FIle.txt FiLe.txt FILe.txt
touch FILE.txt fILE.txt fiLE.txt filE.txt FIlE.txt FiLE.txt fIlE.txt FILE.txt
ls -1 | wc -l   # → 16

Real-world consequence: git repositories cloned on Linux that contain files whose names differ only in case will behave incorrectly on Windows/macOS, where the second file silently replaces the first. This has caused actual incidents in large cross-platform projects.

📝 Notes

The word 'file' has 4 characters → 2⁴ = 16 distinct name combinations. On any case-sensitive Linux filesystem, all 16 are genuinely different files with different inodes. On FAT32 or NTFS, all 16 names are identical because the comparison is done after folding to uppercase — the first file created wins; all subsequent creates are overwrites. This is a known problem in cross-platform development and has caused real incidents: git repos with case-conflicting filenames fail silently on Windows.

52Section · Kernel / VFS

Users, Groups & Permissions — The Kernel's Job (VFS Layer)

Call sequence:

Process calls open("file", O_RDONLY)

VFS: compare process UID/GID vs inode uid/gid/mode

Permission check PASSES → filesystem driver → disk

PERMISSION DENIED (EACCES) — filesystem driver never invoked

Where metadata is STORED vs ENFORCED

Stored: in the filesystem inode (uid, gid, mode fields on disk).
Enforced: by the kernel VFS — before the request ever reaches the driver.
The filesystem driver never sees a request it isn't allowed to serve.

root always bypasses permission checks

UID 0 (root) skips the rwx check entirely inside the kernel. This is hardcoded in the VFS layer — no filesystem can override it.

FAT32 / exFAT — no inode permissions

FAT has no uid/gid/mode fields on disk. The vfat driver synthesises fake permissions from mount options: uid=1000, gid=1000, fmask=133. These exist in RAM only — never written to disk. The enforcement mechanism is identical; only the source of the metadata differs.

📝 Notes

Key distinction: STORED → filesystem (the inode on disk contains uid, gid, and the 12 permission bits). ENFORCED → kernel VFS layer, every single time, before the filesystem driver is involved. The call sequence: syscall → VFS permission_check() → if OK → filesystem driver → disk. For FAT: there is no uid/gid/mode in the FAT directory entry structure (designed in 1977 for single-user DOS). The vfat driver lies to the VFS — it returns whatever was specified in the mount options. The kernel then enforces those fake values just as strictly as it would enforce real ext4 metadata. Network filesystems (NFS, SMB) are a special case: both the client kernel AND the remote server kernel perform permission checks independently.

53Section · Encoding

File Encoding — What Is Actually Stored in a Filename?

Linux kernel rule: filenames are just bytes

The kernel stores filenames as raw byte sequences. It has no idea what encoding they use. The only forbidden bytes are 0x00 (NUL — string terminator) and 0x2F (slash — path separator). Everything else is legal.

What encoding is actually used?

By convention: UTF-8 everywhere on modern Linux (set via locale, e.g. LANG=en_US.UTF-8). But this is a userspace convention — the kernel never validates it. A file whose name is valid ISO-8859-2 is completely legal.

Encoding	Bytes for 'ą'	Bytes for '€'	Used where
ASCII	N/A (not representable)	N/A	Legacy English-only systems
ISO-8859-2	0xB1 (1 byte)	N/A	Old Polish/Central European
UTF-8	0xC4 0x85 (2 bytes)	0xE2 0x82 0xAC (3 bytes)	All modern Linux/macOS/Web
UTF-16	0x0105 (2 bytes)	0x20AC (2 bytes)	Windows NTFS filenames internally

A file created on a Polish Windows (UTF-16 NTFS) and copied to Linux may display as garbled characters if the terminal locale doesn't match. The bytes are the same — the interpretation differs.

NTFS stores filenames in UTF-16LE internally. The Linux ntfs3 driver automatically translates to/from UTF-8 when mounting — translation is transparent to userspace.

📝 Notes

When you type 'ls' and see filenames, what encoding are those characters in? Answer: whatever was used when the file was created, because the kernel stores raw bytes. Modern Linux uses UTF-8 by convention (set in /etc/locale.conf or LANG environment variable). The practical problem: if you create a file on a system with ISO-8859-2 locale, then view it on a UTF-8 system, the bytes are unchanged but the terminal interprets them as UTF-8 — producing garbage characters. This is called 'mojibake.' Key takeaway: encoding is a USERSPACE convention; the kernel is encoding-agnostic.

54Section · COW Filesystems

Btrfs & ZFS — Transparent Compression and Why They Need Lots of RAM

How transparent compression works:

Application: write(fd, buf, 4096)

Normal write syscall — application is unaware of what follows

VFS page cache (RAM)

Data lands in RAM first

Btrfs/ZFS: compress block in RAM

Compressed before writing to disk — transparent to the application

Write COMPRESSED block to NAND/disk

Fewer physical sectors used

Compression algorithms:

lz4

Fastest · Low ratio
Best for /var, DBs

zstd

Fast · Good ratio
Default for /home

zlib

Slow · Best ratio
Archives, static

Why Btrfs & ZFS need lots of RAM:

B-tree metadata in RAM (both)

The entire filesystem B-tree (inode table, extent map, checksum tree) must be cached in RAM for fast lookups. The bigger the filesystem, the larger this tree. Evicting it forces every metadata operation to hit the disk.

ARC — Adaptive Replacement Cache (ZFS)

ZFS replaces the OS page cache with its own smarter cache called ARC. By default it can use up to 50% of available RAM. Guideline: 1 GB RAM per 1 TB of storage for comfortable ARC operation.

COW write buffers (both)

Every write first lands in RAM (page cache), gets compressed, checksummed, and only then flushed to a new free location. Under heavy write load, dirty pages accumulate in RAM before each sync.

Deduplication tables (ZFS)

Optional block-level dedup requires a hash table in RAM — one entry per block. A 1 TB pool with 4 KB blocks = 256 million entries. At ~320 bytes each = ~80 GB RAM. This is why ZFS dedup is disabled by default.

📝 Notes

Both Btrfs and ZFS compress data transparently — the application writes normal data, the filesystem compresses it before writing to disk, and decompresses when reading. Algorithm choice matters: lz4 is extremely fast with low CPU overhead and modest compression — good for /var. zstd offers the best balance of speed and ratio — recommended for /home. zlib is slow but squeezes the most — only worthwhile for rarely-read archival data. Typical savings: source code/text files → 50–70%. Compressed media (JPEG, MP4) → 0%.

The RAM requirement is the #1 operational concern with ZFS. The B-tree metadata must be warm in RAM for acceptable performance — a cold ZFS pool is noticeably slower than ext4. ZFS dedup is the 'gotcha' that has surprised many admins: enable it on a large pool without enough RAM and the system will swap to death.

55Further Reading & Tools

Questions?

Arch Wiki

Partitioning, Btrfs, LVM, XFS — the most comprehensive Linux storage documentation available.
wiki.archlinux.org

OpenZFS Docs

Authoritative ZFS documentation, tuning guides, and hardware recommendations.
openzfs.github.io

Btrfs Wiki

Official Btrfs documentation including gotchas and RAID status.
btrfs.readthedocs.io

fio — I/O Benchmarking

Benchmark any filesystem/partition configuration:
fio --name=randread --ioengine=libaio --rw=randread --bs=4k

iostat / iotop

iostat -x 1 — per-device I/O stats.
iotop — per-process I/O (like top, for disk)

blktrace

Low-level block I/O tracing — see exactly what operations reach the disk. Combined with blkparse and btt for analysis.

Key takeaways from this guide:
SSDs ≠ faster HDDs — the architecture fundamentally changes how you manage data (write amplification, TRIM, FTL).
Partitions and filesystems are separate layers — treat them independently.
Match the filesystem to the workload: XFS → /var, Btrfs → /home, tmpfs → /tmp.
Use LVM for flexibility. Enable noatime and discard=async everywhere on SSDs.

From Spinning Plattersto Smart Partitions