Concept: pipeline-first documentation lifecycle¶

Pipeline-first documentation is a methodology where automated systems generate, validate, and publish API documentation from source contracts. Humans review and approve outputs instead of writing from scratch, reducing review cycles from 5+ rounds to 1-2 rounds.

The problem it solves¶

Without pipeline-first documentation, engineering teams face three critical challenges:

Documentation drift: API code changes ship to production while docs remain stuck on the previous version. The average drift window is 2-4 weeks, during which users encounter incorrect information.
Inconsistent quality: Documentation quality depends on individual writers. One team produces excellent guides while another produces minimal stubs. No enforced quality bar exists across the organization.
Multi-protocol coverage gaps: Teams that support REST, GraphQL, gRPC, AsyncAPI, and WebSocket must maintain five separate documentation sets. Without automation, at least two protocols fall behind on every release.

Traditional approaches like wiki-based documentation or manual Markdown editing fail because they rely on human memory to trigger updates. When the OpenAPI spec changes, nobody remembers to update the corresponding tutorial.

How the pipeline works¶

The VeriDoc pipeline follows an eight-stage execution order. Each stage reads the output of the previous stage and produces artifacts for the next.

Stage 1: ingest¶

Read source contracts from the repository:

Contract type	Format	Example path
REST	OpenAPI 3.0 YAML	`api/openapi.yaml`
GraphQL	SDL schema	`contracts/graphql.schema.graphql`
gRPC	Proto3 definition	`contracts/grpc/veriops.proto`
AsyncAPI	AsyncAPI 2.6.0 YAML	`contracts/asyncapi.yaml`
WebSocket	WebSocket contract YAML	`contracts/websocket.yaml`

Stage 2: lint¶

Validate each contract against protocol-specific rules. REST uses Spectral with 18 rules, GraphQL uses schema validation, gRPC uses protoc compilation, AsyncAPI uses the AsyncAPI parser, and WebSocket uses custom schema validation.

Stage 3: regression¶

Compare the current contract against the previous snapshot to detect breaking changes. Breaking changes (removed endpoints, renamed fields, changed types) trigger warnings in the review manifest.

Stage 4: generate¶

Produce reference documentation from validated contracts. Each protocol generates endpoint tables, payload schemas, code examples, and interactive testers.

Stage 5: quality gate¶

Run 32 automated checks on every generated page:

Category	Check count	What they verify
GEO checks	8	LLM and AI search optimization: meta descriptions, first paragraph length, heading hierarchy, fact density
SEO checks	14	Traditional search optimization: title length, URL depth, internal links, structured data
Style checks	6	American English, active voice, no weasel words, no contractions, second person, present tense
Contract checks	4	Schema validation, regression detection, snippet lint, self-verification against endpoints

Stage 6: test assets¶

Generate API test cases for integration testing frameworks. The pipeline produces test cases in three formats:

Format	Output path	Purpose
JSON (generic)	`reports/api_test_cases.json`	Framework-agnostic test definitions
CSV (TestRail)	`reports/testrail_test_cases.csv`	Import into TestRail test management
JSON (Zephyr)	`reports/zephyr_test_cases.json`	Import into Zephyr Scale for Jira

Stage 7: RAG optimize¶

Build a knowledge retrieval index, FAISS vector store, and knowledge graph for AI-powered search. Six advanced retrieval features are available:

Artifact	Description	Metrics (VeriOps demo)
Knowledge modules	Auto-extracted topic chunks	124 modules
Knowledge graph	Node and edge relationships	957 nodes, 817 edges
Retrieval index	Search-optimized vector index	Precision@3: 0.58, Recall@3: 0.93
FAISS index	`text-embedding-3-small` embeddings	Cosine similarity search

Advanced feature	Description
Token-aware chunking	Splits modules into 750-token chunks with 100-token overlap
Hybrid search (RRF)	Fuses semantic and token-overlap rankings (k=60)
HyDE query expansion	Generates hypothetical passage before embedding
Cross-encoder reranking	Rescores top 20 candidates with `ms-marco-MiniLM-L-6-v2`
Embedding cache	In-memory LRU cache (TTL: 3,600 seconds, max: 512 entries)
Multi-mode evaluation	Compares token, semantic, hybrid, and hybrid+rerank modes

Stage 8: publish¶

Copy verified artifacts to the documentation site. Only artifacts that pass all quality gates reach the publish stage.

Quality gate breakdown¶

The pipeline enforces 32 automated checks before any document reaches production:

Check ID	Rule	Severity	Threshold
GEO-1	Meta description present	Error	Must exist
GEO-1b	Meta description length (minimum)	Warning	50 characters minimum
GEO-1c	Meta description length (maximum)	Warning	160 characters maximum
GEO-2	First paragraph length	Warning	60 words maximum
GEO-3	First paragraph definition pattern	Suggestion	Contains "is," "enables," "provides," or "allows"
GEO-4	Heading specificity	Warning	No generic headings (overview, setup, configuration)
GEO-5	Heading hierarchy	Error	No skipped levels (H2 to H4 is invalid)
GEO-6	Fact density	Warning	At least one fact per 200 words
SEO-01	Title length	Error/Warning	10-70 characters
SEO-02	Title keyword match	Suggestion	50% overlap with filename keywords
SEO-03	URL depth	Warning	Max 4 directory levels
SEO-04	URL naming	Warning	Kebab-case only
SEO-05	Image alt text	Warning	100% of images must have alt text
SEO-06	Internal links	Suggestion	At least 1 per page
SEO-07	Bare URLs	Warning	All URLs must use `[text](url)` format
SEO-08	Path special characters	Warning	Alphanumeric and hyphens only
SEO-09	Line length	Warning	Max 120 characters outside code blocks
SEO-10	Heading keyword overlap	Suggestion	H2 headings share keywords with title
SEO-11	Freshness signal	Suggestion	`last_reviewed` or `date` in frontmatter
SEO-12	Content depth	Warning	Minimum 100 words
SEO-13	Duplicate headings	Warning	No two headings share the same text
SEO-14	Structured data	Suggestion	At least 1 table, code block, or list

Key benefits¶

Zero-drift guarantee¶

Documentation updates when contracts change, not weeks later. The pipeline detects drift by comparing the current contract hash against the last published snapshot. When drift is detected, the pipeline regenerates the affected pages automatically.

Protocol parity¶

REST, GraphQL, gRPC, AsyncAPI, and WebSocket documentation follow the same quality bar. All five protocols pass through identical pipeline stages with protocol-specific validation at each stage.

Operator review checkpoint¶

The pipeline generates a review manifest before publish. It lists all artifacts, their availability status, and provides an approval checklist. Operators approve or reject the entire batch instead of reviewing individual pages.

Advanced RAG pipeline¶

The knowledge retrieval index with 957 nodes and 817 edges enables AI support agents to answer user questions from the documentation. The pipeline auto-extracts 124 knowledge modules from docs content, builds a searchable graph, and embeds modules into a FAISS vector store. Precision@3 reaches 0.58 and recall@3 reaches 0.93 with zero hallucination across 60 curated evaluation queries.

Comparison: traditional versus pipeline-first¶

Dimension	Traditional docs	Pipeline-first docs	Improvement
Drift window	2-4 weeks	0 days (auto-generated)	Eliminated
Quality checks	Manual review	32 automated checks	Consistent
Review cycles	5+ rounds	1-2 rounds	60% reduction
Protocol coverage	1-2 protocols	5 protocols	Full parity
Time to publish	2-3 days	20 minutes	95% faster
Stale page detection	Discovered by users	Weekly automated scan	Proactive
RAG readiness	Manual tagging	Auto-generated index	Automated

When to use pipeline-first documentation¶

Use pipeline-first documentation when you have:

More than 2 API protocols to document (REST + GraphQL + gRPC is the common starting point)
Release cadence faster than monthly (weekly or biweekly releases benefit most)
Quality requirements that exceed what manual review can sustain
AI-powered support agents that need structured knowledge for retrieval

Do not use pipeline-first documentation when:

You have a single, stable API with infrequent changes (manual docs are sufficient)
Your documentation is primarily conceptual, not API reference (the pipeline focuses on contract-driven content)

Next steps¶

How-to: keep docs aligned with every release for the operational workflow
Quality evidence and gate results for the latest pipeline metrics
Troubleshooting: common pipeline issues if pipeline stages fail
Quality evidence for the latest gate results