K² Coding Context Demo: Java R&D Case Study

K² ships scoped context for coding agents

Coding agents need source-role-separated context before they edit. This case study shows how your coding agent, including Codex, Claude Code, or Cursor, can use K² corpora, named agents, a Knowledge Feed, and a pipeline to retrieve private guides, source, tests, architecture notes, and guardrails with citations. The benchmark is supporting evidence; the reusable architecture is the product.

Try in 10 minutes Run a pilot on your codebase

Core K² primitives: collections, Agents, Knowledge Feed, and Pipeline.

Ablated benchmark snapshot

98 / 100K² MCP guardrail-ablated accepted patches

96 / 100repo-only baseline guardrail-ablated accepted patches

52 / 100Context7 public-docs MCP guardrail-ablated accepted patches

Full-rubric lift is reported separately as guide-compliance and review-rework evidence.

Architecture first

K² connects your coding agent to scoped Java evidence.

Inspect the topology first: private knowledge is indexed by role, routed through bounded K² agents, exposed over MCP, and consumed by Codex, Claude Code, Cursor, or another coding agent before it edits Java. The benchmark below is supporting evidence; the architecture is the part a customer would replicate.

Customer knowledgeGuides, docs, code, tests

K² platformCollections, Agents, Feed, Pipeline

MCP serverScoped evidence with citations

Coding agentPlans, edits, and explains

Java patchFocused tests and review trail

Source-role-separated context architecture

K² keeps guides, docs, code, and tests separated, asks bounded agents in a declared order, and lets a Knowledge Feed promote repeated source findings back into durable guidance.

The core insight: source-role-separated context

Coding agents need to know whether retrieved text is a rule, an API contract, implementation precedent, or a test expectation. K² preserves that role through collections, metadata filters, named agents, and citations instead of flattening every source into one undifferentiated prompt.

Guides set constraintsConfluence guardrails, review rules, migration notes, and team conventions.

Docs set API contractsVersion-pinned public or private documentation for framework behavior.

Code shows precedentClasses, packages, handlers, message types, and neighboring implementation patterns.

Tests define acceptanceFocused checks and reviewable verification commands before a patch is claimed.

What scoped retrieval looks like in a customer demo

Before a coding agent edits Java, K² should answer operational questions like these with citations from the customer's private guides, source, tests, and architecture notes.

Ask K²How do we add a controller in module X?

Ask K²Which DTO pattern should this endpoint use?

Ask K²Which neighboring implementation should we mirror?

Ask K²Which test should be extended?

Ask K²Which Confluence rule applies before the coding agent edits?

K² platform value map

Each card maps a K² feature to the development value shown in this demo.

Collections

Separate corpora for guides, docs, and code

K² keeps Confluence-style guidance, versioned documentation, source, and tests queryable without mixing their roles.

Demo assets: java-rd-guides, flink-docs-2.2, flink-code-2.2.
Customer value: the agent knows whether evidence is a rule, API doc, implementation, or test.

Metadata filters

Precise retrieval for legacy Java

Framework, version, source kind, module, package, class, API surface, and path metadata keep answers scoped.

Demo filter examples target Flink 2.2 REST handlers and tests.
Customer value: fewer irrelevant snippets and less prompt waste.

Hybrid search

Semantic questions plus exact symbols

Dense retrieval helps with conceptual questions, while sparse matching protects exact Java class and method names.

Demo symbols: JobVertexWatermarksHandler, MessageQueryParameter.
Customer value: code references survive even when names are obscure.

Agents

Specialized context workers

Guide, docs, code, and architect agents each answer a bounded part of the development question.

Demo agents return guardrails, release docs, source/test anchors, and a cited plan.
Customer value: the coding agent receives structured context instead of a pile of text.

Knowledge Feed

Turn repeated discoveries into durable guidance

The feed loop can promote recurring source findings back into the guide corpus for future sessions.

Demo feed: Flink REST Guide Feed from code agent to java-rd-guides.
Customer value: institutional knowledge improves instead of being rediscovered every time.

Pipeline + MCP

Auditable route from K² to your coding agent

The Pipeline Spec shows the topology; MCP makes the same coding-agent workflow use K² context before editing.

Demo result: answer excerpts, code diff excerpts, and verification artifacts are linked.
Customer value: reviewers can inspect what evidence influenced the patch.

Benchmark evidence, led by the ablated control

On the dimensions that exclude guide-compliance scoring, K² is narrowly ahead of the repo-only baseline and materially ahead of public-docs-only context. The full rubric then shows the additional lift from retrieving guide rules that the baseline does not have.

98 / 100K² MCP guardrail-ablated accepted patches

96 / 100repo-only baseline guardrail-ablated accepted patches

52 / 100Context7 public-docs MCP guardrail-ablated accepted patches

+65 ptsfull-rubric guide-compliance lift versus repo-only baseline

Full-rubric accepted patches were K² 96/100, repo-only baseline 31/100, and Context7 public-docs MCP 24/100. Read that gap as a guide-retrieval result: K² retrieved the same Confluence-style guardrails that the full scorer checks.

Token math is agent-side: retrieved snippets count once they enter the coding-agent prompt, but K² platform retrieval, ingestion, storage, and subscription costs are reported separately below. Do not quote this as a broad Context7 ranking or expected customer outcome.

Methodology and ablation

The circularity risk is explicit: K² retrieves guide rules and the full rubric rewards guide compliance. The table therefore reports the full score beside a guardrail-ablated pass rate, where guide-compliance failures are removed from pass/fail attribution.

Scoring rubric

Component	Weight
Focused tests + build verification	40%
Expected files/modules touched	25%
Required behavior or diff-pattern checks	15%
Confluence/internal guide compliance	10%
Review scope and safety	10%

Guardrail-ablated versus full pass rate

Arm	Guardrail-ablated score	Full score
K² MCPProject guides, source, tests, and versioned docs through K².	98 / 100	96 / 100
Repo-only baselineLocal checkout and model memory without an external context service.	96 / 100	31 / 100
Context7 public-docs MCPPublic documentation through Context7, without private guide/source/test corpora.	52 / 100	24 / 100

Authorship/freeze disclosure

The public artifact does not independently prove that task authors, guide authors, and scorer authors were blind to K² outputs before freezing. The defensible public claim is therefore narrower: K² improved this guide-retrieval-heavy benchmark, and customer-specific claims require a frozen customer replication before indexing or running either arm.

Cost model

Agent-token numbers count prompt and completion tokens captured by the benchmark runner. Retrieved K² snippets are included once they enter the agent prompt.
K² platform cost is not hidden in the token-savings number. It includes ingestion, retrieval queries, storage, and subscription.
Illustrative benchmark-scale platform allocation: Pro tier at $249/month for this demo corpus and run.
If full-rubric accepted patches are the customer-relevant outcome because guide violations create review rework, K² platform allocation is $249 / 96 = $2.59 per full-rubric accepted patch before model-token cost.
If the guardrail-ablated frame is used as the raw code-quality denominator, K² adds 2 incremental ablated accepted patches over the repo-only baseline, or about $124.50 per incremental ablated patch before model-token effects.
That is the point of reporting both frames: the public run shows a narrow ablated code-quality lead and a large guide-compliance/review-rework lead.
Break-even against the repo-only baseline on the full-rubric outcome occurs when blended agent-token price exceeds roughly $0.51 per million tokens saved.
At Claude Sonnet-style input pricing around $3 per million tokens, the 5.04M agent-token savings per full-rubric accepted patch is about $15.12, roughly 5.8x the benchmark-scale K² platform allocation.
These numbers do not include developer time. One avoided review or re-prompt hour at a $150 loaded engineering cost dwarfs the token and platform costs combined.

Full-rubric formula: K² cost per accepted patch = 1.55M agent tokens times the model-token rate plus $2.59 platform allocation; repo-only baseline = 6.59M agent tokens times the model-token rate.

The economic case should also count developer time. If K² saves even one review or re-prompt hour per accepted patch, the labor savings exceed both token cost and benchmark-scale platform allocation by a large multiple.

Reproducibility statement

Task definitions, scorer configurations, prompt templates, selected raw responses, patch artifacts, and demo asset manifests are published in the repository. An external reviewer can rerun the public benchmark with the same coding-agent model, a K² API key, and the published bundle; customer-specific claims still require customer-frozen tasks and customer-owned corpora.

RepositoryInspect the harness

Review the public scorecard inputs, scripts, selected answers, accepted patch diff, and K² asset bundle.

QuickstartReproduce one cited query

Load the demo corpora and verify that K² returns role-scoped evidence to an MCP-capable coding agent.

Two ways to act on the demo

Developers need a short reproducible setup. Enterprise buyers need a controlled pilot against their current coding-agent workflow. The same architecture supports both paths.

Developer quickstartTry in 10 minutes

Load the public demo bundle, connect the MCP server to a coding agent, and run one cited retrieval query.

Enterprise pilotRun a pilot on your codebase

Freeze tasks and scoring with the customer technical lead, ingest customer docs/code, and compare against their current workflow.

Pilot ask

The customer-facing claim should be earned on customer assets. Use this public benchmark to scope a bounded replication, not as a forecast for a financial Java application.

Enterprise path

Run the same workflow on customer assets.

Pick one representative Java module and one Confluence page tree.
Freeze 5-10 real feature-development tasks before the pilot run.
Run the same coding agent, for example Codex or Claude Code, with and without K² retrieval.
Score accepted patches, review rework, focused tests, token use, wall-clock time, and retrieval cost.

Design partners wanted

Validate on customer code before making any customer-specific claim.

This is a public benchmark and demo bundle, not a named customer replication.
Design partners should freeze tasks, expected files, guide checks, and scorer logic before indexing.
Publish customer-approved relevance findings, even if the customer name remains anonymized.

K² for coding agents, not only Java

Java R&D is the inaugural case study because regulated teams often have the richest guide and compliance corpus. The same collections, agents, feed, pipeline, and MCP pattern is language-agnostic.

Available nowJava R&D case studyRegulated enterprise services with Confluence-heavy guardrails.

Planned Q3 2026TypeScript case studyInternal component library and product-platform conventions.

Planned Q3 2026Python case studyData/ML platform workflows with notebooks, pipelines, and model-serving code.