ADR 001: Lint the Source, Not the Build
Status: Active (Genesis Decision) Decider: Architecture Lead Date: 2026-01-01 (founding principle, pre-v0.1.0)
Context
When Zenzic was conceived, the dominant approach to documentation validation was
output-based analysis: tools like linkchecker and htmlproofer fetch or
parse the HTML generated by the build engine, then traverse the rendered page
structure to verify link targets, image paths, and anchor IDs.
This approach has a fundamental structural flaw: the validator is downstream of the build. Validation can only run after the build succeeds. If the build fails — due to a syntax error, a missing plugin, or an engine version mismatch — no validation occurs at all. The pipeline produces silence where it should produce a diagnostic.
Three compounding problems emerge in CI environments:
-
Build coupling. A documentation validator that requires a successful build
cannot be the first gate in the pipeline. It must be placed after
mkdocs buildornpm run build, adding 2–10 minutes of build overhead before a single link is checked. -
Engine fragility. Build engines change how they generate anchor IDs, URL
slugs, and asset paths between minor versions. A validator calibrated to the output of MkDocs 1.5 may silently miss broken links under MkDocs 1.6 because the ID generation scheme changed. The validator is, in effect, testing the engine's output rather than the author's intent.
-
Engine lock-in. A validator that understands HTML from one engine cannot
validate HTML from another without engine-specific adaptation. This creates a validation ecosystem that fragments along engine lines rather than converging on universal documentation quality standards.
The "MkDocs Crisis" — a period during Zenzic's early development when the reference documentation lost all link validity due to an MkDocs upgrade that changed slug generation — crystallised the cost of output-based validation. The error was not in the Markdown source; it was in the mismatch between the source and the engine's new URL convention. An output-based validator would have caught this only after the broken site was deployed.
Decision
Zenzic analyzes raw Markdown source files and static configuration files exclusively. It never inspects, fetches, or depends on HTML build output.
The implementation vehicle for this decision is the Virtual Site Map (VSM) — a complete in-memory projection of the final site, constructed from source files alone, using engine-specific knowledge encoded in adapters (see ADR 005, ADR 007).
The VSM allows Zenzic to answer questions that previously required a live site:
-
"Does this anchor
#installationexist in the target page?" — answered byparsing the Markdown heading structure, not the rendered HTML.
-
"Is this path
/docs/reference/finding-codesa valid route?" — answered bythe VSM's route graph, which models i18n fallbacks and versioned slugs without executing the build.
-
"Is this asset referenced in
docusaurus.config.tspresent on disk?" — answeredby static parsing of the TypeScript config file, not by starting a Node.js process.
Rationale
1. Pre-Build Error Prevention
A broken link discovered before the build is a developer warning. A broken link discovered after a 10-minute build is a CI failure that blocks the PR queue. Zenzic's position in the pipeline is always before the build — it is the gate that certifies the source is structurally sound before any build resource is consumed.
2. Engine Agnosticism by Design
By analyzing source files rather than build output, Zenzic is inherently
engine-agnostic. The same check links command validates an MkDocs project,
a Docusaurus site, and a Zensical wiki — because all three share the same
raw Markdown format. Engine-specific URL conventions are encoded in the adapter
layer (not in the validator), making the core engine permanently portable.
3. Deterministic Analysis
Source files are static. A given set of Markdown files produces the same analysis results regardless of which machine runs Zenzic, which Python version is installed, or which timezone the CI runner is in. Build-output validators introduce non-determinism through engine version drift, network-fetched pages, and CDN caching. Zenzic's source-based analysis is a pure function of the repository state — identical input, identical output, always.
4. The Ghost Route Capability
The VSM models routes that do not exist as physical files on disk: i18n fallback routes, versioned documentation slugs, and engine-generated index pages. An output-based validator can only test routes that the build produces. Zenzic's VSM models the intent of the documentation architecture, catching structural errors in routes that the author planned but hasn't yet published.
Invariants (Non-Negotiable)
-
Zenzic's validation logic (
core/validator.py,core/scanner.py) must neverstart an HTTP request, load a browser, or parse HTML. All analysis operates on bytes read from the filesystem.
-
The VSM (
models/vsm.py) is the canonical source of route truth. No validatormay compute a route by invoking the build engine — even as a subprocess.
-
Adapters may read static configuration files (
.ts,.yml,.toml) usingpure-Python text parsing. They must not execute those files (see ADR 002).
Consequences
-
Zenzic's analysis performance is content-dependent. Measured against
the real
zenzic-docproject (59 MDX pages with JSX, frontmatter, and tables): ~420 ms of pure analysis time on a warm Python process. Simple Markdown projects with minimal frontmatter and no JSX can scan 200 files in ~100 ms. End-to-end wall time on a colduvxinvocation adds ~2–8 s of Python interpreter startup on top of analysis time. Runpython scripts/benchmark.py --repo <path>to measure your own project. -
Zenzic can be placed as the first step in any CI pipeline, before
npm install, beforepip install, before the build engine is even available. -
Engine-specific quirks (Docusaurus anchor generation, MkDocs nav contracts,
Zensical slug conventions) are isolated in the adapter layer. The core engine is permanently engine-neutral.
-
The VSM provides a testable, inspectable data structure for documentation
architecture — enabling future capabilities like structural diffing, coverage metrics, and ghost route detection without modifying the analysis core.