Kernel Mount (seaweed-vfs) — Architecture, Performance & Memory

Technical reference for the Kernel Mount — how the kernel module and daemon split the work, why it’s faster than FUSE for cached access, and why the mount process can’t run out of memory as file count grows.

Architecture

seaweed-vfs follows the WEKA model: a thin in-kernel module owns the VFS integration — inodes, dentries, and the Linux page cache — and a userspace daemon (sw-kd) does all SeaweedFS networking (filer gRPC for metadata, volume HTTP for data). The kernel does zero networking; the hard, fast-moving datapath stays in maintainable userspace.

The two halves talk over a /dev/seaweedvfs character device — an OrangeFS-style request/reply channel. The daemon serves that device; the kernel forwards VFS operations to it and caches the results.

Performance

Because the module owns the page cache and dentry cache:

Cached reads and metadata are served in-kernel — no per-operation userspace round-trip. A FUSE client bounces essentially every VFS call out to a userspace process; the kernel mount calls the daemon only on a cache miss or a write.
Kernel readahead accelerates sequential reads, and repeated reads are served straight from the page cache like any local filesystem.
An optional io_uring fast path (ublk-style) keeps many requests in flight for high-concurrency workloads; the default read()/write() channel is the fallback.

(Throughput and latency vary by workload and hardware — these are architectural properties, not a published benchmark.)

Memory: why it can’t OOM from scale

A FUSE client keeps a per-inode map in its own process. That map is a non-reclaimable heap that grows with the number of files the kernel has looked up; at large file counts it reaches multiple GB and can OOM the mount — ≈6.8 GB RSS was reported at ~33M files in upstream issue seaweedfs#10020.

The kernel mount keeps no per-inode map in the mount process (ino = hash(path)), so its metadata footprint stays flat — about 8 MB regardless of file count (measured from 500K to 33M files). The per-file metadata is still cached, but by the OS as reclaimable dentry/inode cache that the kernel evicts under memory pressure — so it never OOMs, and it’s a cost every filesystem client incurs, not a heap unique to this mount.

In short: the part that grows with file count is reclaimable kernel cache, not an unbounded userspace heap — so the mount process can’t be the thing that runs you out of memory.

Daemon RSS is cache, not file-count growth

That 8 MB is the daemon’s metadata cost — the part that stays flat as the tree grows. The daemon’s total resident memory (RSS) also includes two things that track I/O activity, not file count, and both are bounded:

A whole-chunk read cache. Prefetched and recently-read immutable chunks are held in a capacity-bounded LRU so sequential and repeated reads hit memory. On a streaming mount this is the largest part of RSS. It defaults to 256 MiB and is set with --chunk-cache-mb (0 disables it) — see Per-filer tuning.
Allocator retention. glibc’s malloc holds freed memory in per-thread arenas instead of returning it to the OS, so RSS creeps up and only occasionally drops. Cap it by starting the daemon with MALLOC_ARENA_MAX=2.

So an active daemon can range from tens of MB to a couple of GB depending on the read cache and allocator — that’s bounded cache you can tune, not the unbounded per-file heap that OOMs a FUSE client at scale.

Stateless daemon

The kernel↔daemon protocol is path-based — each request carries its target path(s), so the daemon holds no inode map and a restart is transparent. You can restart or upgrade sw-kd without unmounting: in-flight requests get -ENOTCONN and the daemon re-attaches (a brief I/O pause, no unmount).

Requirements & limits

Linux kernel 6.1 or newer, x86_64 or arm64 (CI validates 6.1 LTS → 7.0).
A reachable filer gRPC endpoint (filer HTTP port + 10000; default 18888).
Under Secure Boot, the module must be signed by an enrolled key — DKMS signs with a per-host key and prompts enrollment; precompiled modules use a key you enroll once with mokutil.
The kernel does no networking; all SeaweedFS I/O (and any future RDMA datapath) lives in the daemon.

See Kernel Mount for install and mount instructions, and Kernel Mount operations for manual install, building from source, deploy, and upgrade details.