Skip to main content

Advanced Features

Deep reference for the Three-Pass Pipeline, Zenzic Shield, accessibility checks, and programmatic usage from Python.


Reference integrity (v0.2.0)

zenzic check references runs the Three-Pass Reference Pipeline — the core engine behind every reference-quality and security check Zenzic performs.

Why three passes?

Markdown reference-style links separate where a link points (the definition) from where it appears (the usage). A single-pass scanner cannot resolve a reference that appears before its definition. Zenzic solves this with a deliberate three-pass structure:

PassNameWhat happens
1HarvestStream the file line-by-line; record all [id]: url definitions into a ReferenceMap; run the Shield on every URL and line
2Cross-CheckRe-stream the file; for every [text][id] usage, look up id in the now-complete ReferenceMap; flag missing IDs as Dangling References
3Integrity ReportCompute the integrity score; append Dead Definitions, duplicate-ID warnings, and alt-text warnings to the findings list

Pass 2 only begins when Pass 1 completes without security findings. If the Shield fires during harvesting, Zenzic exits immediately with code 2 — no reference resolution occurs on files that contain leaked credentials.

What the pipeline catches

IssueTypeBlocks exit?
Dangling Reference[text][id] where id has no definitionerrorYes
Dead Definition[id]: url defined but never used by any linkwarningNo (yes with --strict)
Duplicate Definition — same id defined twice; first wins (CommonMark §4.7)warningNo
Missing alt-text![](url) or <img> with blank/absent altwarningNo
Secret detected — credential pattern found in a reference URL or linesecurityExit 2
Path traversal — link resolves to an OS system directorysecurityExit 3

Reference Integrity Score

Each file receives a per-file score:

Reference Integrity = (resolved definitions / total definitions) × 100

A file where every defined reference is used at least once scores 100. Unused (dead) definitions pull the score down. When a file has no definitions at all, the score is 100 by convention.

The integrity score is a per-file diagnostic — it does not feed into the zenzic score overall quality score. Use it to identify files that accumulate unused reference link boilerplate.


Zenzic Shield

The Shield runs inside Pass 1 — every URL extracted from a reference definition is scanned the moment the harvester encounters it, before any other processing continues. The Shield also applies a defence-in-depth pass to non-definition lines to catch secrets in plain prose.

Detected credential patterns

Pattern nameRegexWhat it catches
openai-api-keysk-[a-zA-Z0-9]{48}OpenAI API keys
github-tokengh[pousr]_[a-zA-Z0-9]{36}GitHub personal/OAuth tokens
aws-access-keyAKIA[0-9A-Z]{16}AWS IAM access key IDs
stripe-live-keysk_live_[0-9a-zA-Z]{24}Stripe live secret keys
slack-tokenxox[baprs]-[0-9a-zA-Z]{10,48}Slack bot/user/app tokens
google-api-keyAIza[0-9A-Za-z\-_]{35}Google Cloud / Maps API keys
private-key-----BEGIN [A-Z ]+ PRIVATE KEY-----PEM private keys (RSA, EC, etc.)
hex-encoded-payload(?:\\x[0-9a-fA-F]{2}){3,}Detects obfuscation attempts that hide payloads or credentials via hex escapes. This technique is commonly used to evade naive string linters and is treated as a critical source-transparency violation.

Shield behaviour

  • Every line is scanned — including lines inside fenced code blocks (labelled or unlabelled). A credential committed in a bash example is still a committed credential.
  • Detection is non-suppressible--exit-zero, exit_zero = true in zenzic.toml, and --strict have no effect on Shield findings.
  • Exit code 2 is reserved exclusively for Shield events. It is never used for ordinary check failures.
  • Exit code 3 is reserved for Blood Sentinel events — links that resolve to OS system directories. Like exit code 2, it is never suppressed.
  • Files with security findings are excluded from link validation — Zenzic does not ping URLs that may contain leaked credentials.
  • Code block link isolation — while the Shield scans inside fenced blocks, the link and reference validators do not. Example URLs inside code blocks (e.g. https://api.example.com) never produce false-positive link errors.
If you receive exit code 2

Treat it as a build-blocking security incident. Rotate the exposed credential immediately, then remove or replace the offending reference URL. Do not commit the secret into history.

See the Shield in action

The repository ships examples/safety_demonstration.md — an intentional test fixture containing a circular link and a hex-encoded payload. Run zenzic check all against it to observe a live Shield breach and a CIRCULAR_LINK info finding.


Hybrid scanning logic

Zenzic applies different scanning rules to prose and code blocks because the two contexts have different risk profiles:

Content locationShield (secrets)Snippet syntaxLink / ref validation
Prose and reference definitions
Fenced block — supported language (python, yaml, json, toml)
Fenced block — unsupported language (bash, javascript, …)
Fenced block — unlabelled (```)

Why links are excluded from fenced blocks: documentation examples routinely contain illustrative URLs (https://api.example.com/v1/users) that do not exist as real endpoints. Checking them would produce hundreds of false positives with no security value.

Why secrets are included everywhere: a credential embedded in a bash example is still a committed secret. It lives in git history, is indexed by code-search tools, and can be extracted by automated scanners that do not respect Markdown formatting.

Why syntax checking is limited to known parsers: validating Bash or JavaScript would require third-party parsers or subprocesses, violating the No-Subprocess Pillar. Zenzic validates what it can validate purely in Python.


Alt-text accessibility

zenzic check references also flags images that lack meaningful alt text:

  • Markdown inline images![](url) or ![ ](url) (blank alt string)
  • HTML <img> tags<img src="..."> with no alt attribute, or alt="" with no content

An explicitly empty alt="" is treated as intentionally decorative and is not flagged. A completely absent alt attribute, or whitespace-only alt text, is flagged as a warning.

Alt-text findings are warnings — they appear in the report but do not affect the exit code unless --strict is active.


Programmatic usage

Import Zenzic's scanner functions directly into your own Python tooling.

Single-file scan

Use ReferenceScanner to run the three-pass pipeline on one file:

from pathlib import Path
from zenzic.core.scanner import ReferenceScanner

scanner = ReferenceScanner(Path("docs/guide.md"))

# Pass 1 — harvest definitions; collect Shield findings
security_findings = []
for lineno, event_type, data in scanner.harvest():
if event_type == "SECRET":
security_findings.append(data)
# In production: raise SystemExit(2) or typer.Exit(2) here

# Pass 2 — resolve reference links (must be after harvest)
cross_check_findings = scanner.cross_check()

# Pass 3 — compute integrity score and consolidate all findings
report = scanner.get_integrity_report(cross_check_findings, security_findings)

print(f"Integrity score: {report.score:.1f}")
for f in report.findings:
level = "WARN" if f.is_warning else "ERROR"
print(f" [{level}] {f.file_path}:{f.line_no}{f.detail}")

Multi-file scan

Use scan_docs_references to scan every .md file in a repository and optionally validate external URLs:

from pathlib import Path
from zenzic.core.scanner import scan_docs_references
from zenzic.models.config import ZenzicConfig

config, _ = ZenzicConfig.load(Path("."))

reports, link_errors = scan_docs_references(
Path("."),
config,
validate_links=True, # set False to skip HTTP validation
)

for report in reports:
if report.security_findings:
raise SystemExit(2) # your code is responsible for exit-code enforcement
for finding in report.findings:
print(finding)

for error in link_errors:
print(f"[LINK] {error}")

scan_docs_references deduplicates external URLs across the entire docs tree before firing HTTP requests — 50 files linking to the same URL result in exactly one HEAD request.

Hybrid Adaptive Engine — v0.5.0a1

scan_docs_references is the single unified entry point for all scan modes. It selects sequential or parallel execution automatically based on the number of files in the repository:

Repo sizeEngine behaviourReason
< 50 filesSequential (always)Process-spawn overhead (~200–400 ms) exceeds the parallelism benefit
≥ 50 files, workers=1SequentialExplicit serial override
≥ 50 files, workers=None or workers=NParallel (ProcessPoolExecutor)CPU-bound regex work dominates; linear scaling
5 000+ filesParallel with workers=cpu_countProven 3–6× speedup on 8-core runners

The 50-file threshold (ADAPTIVE_PARALLEL_THRESHOLD) is the conservative break-even point where parallelism pays for its own startup cost.

from pathlib import Path
from zenzic.core.scanner import scan_docs_references

# Default: sequential (workers=1, zero overhead)
reports, _ = scan_docs_references(Path("."))

# Explicit parallel: 4 workers, auto-activates only if ≥ 50 files
reports, _ = scan_docs_references(Path("."), workers=4)

# Fully automatic: ProcessPoolExecutor picks worker count from os.cpu_count()
reports, _ = scan_docs_references(Path("."), workers=None)

# With external link validation (works in both sequential and parallel mode)
reports, link_errors = scan_docs_references(Path("."), validate_links=True, workers=None)

Determinism guarantee: results are always sorted by file_path regardless of execution mode. The same input always produces the same ordered output.

Pickling contract for plugin rules (BaseRule subclasses):

Rules are validated for pickle-serializability at engine construction time (eager validation). A non-serialisable rule raises PluginContractError immediately — before any file is scanned.

  • Rules must be defined at module level. A class defined inside a function or lambda cannot be pickled and will be rejected at load time.
  • All instance attributes must be pickleable. Pre-compiled re.compile() patterns, strings, and numbers are always safe. File handles, database connections, and lambda closures are not.
  • No mutable global state. Workers receive independent copies of the rule engine (via pickle). A global counter mutated inside check() will be local to each worker process and discarded on completion — results will differ from sequential mode silently. Return all state as RuleFinding objects.

See Writing Plugin Rules for the complete contract, examples, and packaging instructions.


Fenced-code and frontmatter exclusion

The harvester and cross-checker both skip content that should never trigger findings:

  • YAML frontmatter — the leading --- block (first line only) is skipped in its entirety, including any reference-like syntax it might contain.
  • Fenced code blocks — lines inside ``` or ~~~ fences are ignored. URLs in code examples never produce false positives.

This exclusion is applied consistently in both Pass 1 and Pass 2.


Zenzic does not only check whether a linked file exists on disk — it checks whether that file is reachable through the site navigation. This catches an entire class of navigation defects that file-existence checks miss entirely.

Dark pages

A dark page is a file that exists on disk and is physically served by the engine at its URL — but is missing from the site navigation. The link works. The page loads. The user who follows it arrives successfully. And then they are lost: no breadcrumb, no menu entry, no way back through the navigation tree.

Dark pages are invisible to users browsing your site. They are the documentation equivalent of a room with no door — the room exists, but no one can find it without already knowing where it is.

Zenzic flags links to dark pages as UNREACHABLE_LINK. This is not a broken link. It is a navigation defect: the link is syntactically correct, the file resolves, but the destination is unreachable through normal browsing.

How it works

When a build-engine config (mkdocs.yml) is present, Zenzic constructs a Virtual Site Map (VSM) before running link validation. The VSM maps every .md source file to:

  • its canonical URL (e.g. docs/guide/installation.md/guide/installation/)
  • its routing status — one of REACHABLE, ORPHAN_BUT_EXISTING, IGNORED, or CONFLICT

A file is REACHABLE if it appears in the nav: section of mkdocs.yml. A file is ORPHAN_BUT_EXISTING if it lives on disk but has no nav entry — the engine copies it to site/ and serves it, but no user can find it through navigation.

When a link resolves to a dark page (ORPHAN_BUT_EXISTING or IGNORED) in the VSM, Zenzic emits:

[UNREACHABLE_LINK] index.md:22 — 'guide/secret.md' resolves to '/guide/secret/'
which exists on disk but is not listed in the site navigation (UNREACHABLE_LINK)
— add it to nav in mkdocs.yml or remove the link
│ - [Secret page](guide/secret.md)

The Visual Snippet () shows the exact source line so you can locate and fix the link without searching through the file.

Routing collision (CONFLICT)

Two source files that map to the same canonical URL produce a CONFLICT in the VSM. The most common case is the Double Index: index.md and README.md coexisting in the same directory. Both produce the same URL (/dir/) — the build engine's behaviour is undefined. Zenzic detects this before the build runs.

Engine behaviour

AdapterUNREACHABLE_LINK?Trigger
MkDocs (with mkdocs.yml + nav:)YesFile not listed in nav: (ORPHAN_BUT_EXISTING)
MkDocs (no nav: declared)NoAll files auto-included by MkDocs
ZensicalYesFile or directory starting with _ (IGNORED)
Vanilla (no engine config)NoNo routing concept
Fix an UNREACHABLE_LINK

Either add the target page to nav: in mkdocs.yml, or replace the link with one pointing to a reachable page.

Private pages (Zensical)

Files and directories whose name starts with an underscore (_) are treated as private by Zenzic when the Zensical engine is active. Links to these resources are flagged as UNREACHABLE_LINK because Zensical never serves _-prefixed paths to the public.

docs/
├── index.md
├── features.md
└── _private/ ← Zensical ignores this directory entirely
└── notes.md ← links to this file → UNREACHABLE_LINK
[UNREACHABLE_LINK] index.md:8 — '_private/notes.md' resolves to '/_private/notes/'
which exists on disk but is not listed in the site navigation (UNREACHABLE_LINK) —
add it to nav in mkdocs.yml or remove the link
│ - [Private Notes](_private/notes.md)

This rule applies to any path segment starting with _:

PathStatus
_private/notes.mdIGNOREDUNREACHABLE_LINK
_drafts/wip.mdIGNOREDUNREACHABLE_LINK
public/page.mdREACHABLE — served normally
MkDocs does not have this rule

MkDocs does not treat underscore-prefixed directories as private. Only Zensical enforces the _-prefix convention. When switching engines, audit any _-prefixed directories in your docs tree.


Multi-language documentation

When your project uses MkDocs i18n or Zensical's locale system, Zenzic adapts automatically:

  • Locale directories suppressed from orphan detection — files under docs/it/, docs/fr/, etc. are not reported as orphans. The adapter detects locale directories from the engine's i18n configuration.
  • Cross-locale link resolution — the engine adapters resolve links that cross locale boundaries (e.g. a link from docs/it/page.md to docs/en/page.md) without false positives.
  • Vanilla mode skips orphan check entirely — when no build-engine config is present, every file would appear as an orphan. Zenzic skips the check rather than report noise.
Force Vanilla mode to suppress orphan check
zenzic check all --engine vanilla