K² Coding Context Demo: Java R&D Case Study
K² ships scoped context for coding agents
Coding agents need source-role-separated context before they edit. This case study shows how your coding agent, including Codex, Claude Code, or Cursor, can use K² corpora, named agents, a Knowledge Feed, and a pipeline to retrieve private guides, source, tests, architecture notes, and guardrails with citations. The benchmark is supporting evidence; the reusable architecture is the product.
Core K² primitives: collectionsK² docs concept: a collection is a separately indexed corpus. The demo keeps guide rules, versioned docs, and Java source/tests in different collections so filters can preserve source role., AgentsK² docs concept: an Agent is a named retrieval/synthesis worker with bounded instructions and corpus access. The demo uses guide, docs, code, and architect agents., Knowledge FeedK² docs concept: a Knowledge Feed moves repeated findings from a source agent into a target corpus. The demo feed promotes recurring REST handler source findings into guide material., and PipelineK² docs concept: a Pipeline declares how corpora, agents, feeds, and subscriptions connect so the topology can be inspected and re-run..
K² connects your coding agent to scoped Java evidence.
Inspect the topology first: private knowledge is indexed by role, routed through bounded K² agents, exposed over MCP, and consumed by Codex, Claude Code, Cursor, or another coding agent before it edits Java. The benchmark below is supporting evidence; the architecture is the part a customer would replicate.
K² keeps guides, docs, code, and tests separated, asks bounded agents in a declared order, and lets a Knowledge Feed promote repeated source findings back into durable guidance.
Coding agents need to know whether retrieved text is a rule, an API contract, implementation precedent, or a test expectation. K² preserves that role through collections, metadata filters, named agents, and citations instead of flattening every source into one undifferentiated prompt.
Before a coding agent edits Java, K² should answer operational questions like these with citations from the customer's private guides, source, tests, and architecture notes.
Each card maps a K² feature to the development value shown in this demo.
Separate corpora for guides, docs, and code
K² keeps Confluence-style guidance, versioned documentation, source, and tests queryable without mixing their roles.
- Demo assets: java-rd-guides, flink-docs-2.2, flink-code-2.2.
- Customer value: the agent knows whether evidence is a rule, API doc, implementation, or test.
Precise retrieval for legacy Java
Framework, version, source kind, module, package, class, API surface, and path metadata keep answers scoped.
- Demo filter examples target Flink 2.2 REST handlers and tests.
- Customer value: fewer irrelevant snippets and less prompt waste.
Semantic questions plus exact symbols
Dense retrieval helps with conceptual questions, while sparse matching protects exact Java class and method names.
- Demo symbols: JobVertexWatermarksHandler, MessageQueryParameter.
- Customer value: code references survive even when names are obscure.
Specialized context workers
Guide, docs, code, and architect agents each answer a bounded part of the development question.
- Demo agents return guardrails, release docs, source/test anchors, and a cited plan.
- Customer value: the coding agent receives structured context instead of a pile of text.
Turn repeated discoveries into durable guidance
The feed loop can promote recurring source findings back into the guide corpus for future sessions.
- Demo feed: Flink REST Guide Feed from code agent to java-rd-guides.
- Customer value: institutional knowledge improves instead of being rediscovered every time.
Auditable route from K² to your coding agent
The Pipeline Spec shows the topology; MCP makes the same coding-agent workflow use K² context before editing.
- Demo result: answer excerpts, code diff excerpts, and verification artifacts are linked.
- Customer value: reviewers can inspect what evidence influenced the patch.
On the dimensions that exclude guide-compliance scoring, K² is narrowly ahead of the repo-only baseline and materially ahead of public-docs-only context. The full rubric then shows the additional lift from retrieving guide rules that the baseline does not have.
Full-rubric accepted patches were K² 96/100, repo-only baseline 31/100, and Context7 public-docs MCP 24/100. Read that gap as a guide-retrieval result: K² retrieved the same Confluence-style guardrails that the full scorer checks.
Token math is agent-side: retrieved snippets count once they enter the coding-agent prompt, but K² platform retrieval, ingestion, storage, and subscription costs are reported separately below. Do not quote this as a broad Context7 ranking or expected customer outcome.
The circularity risk is explicit: K² retrieves guide rules and the full rubric rewards guide compliance. The table therefore reports the full score beside a guardrail-ablated pass rate, where guide-compliance failures are removed from pass/fail attribution.
Scoring rubric
| Component | Weight |
|---|---|
| Focused tests + build verification | 40% |
| Expected files/modules touched | 25% |
| Required behavior or diff-pattern checks | 15% |
| Confluence/internal guide compliance | 10% |
| Review scope and safety | 10% |
Guardrail-ablated versus full pass rate
| Arm | Guardrail-ablated score | Full score |
|---|---|---|
| K² MCPProject guides, source, tests, and versioned docs through K². | 98 / 100 | 96 / 100 |
| Repo-only baselineLocal checkout and model memory without an external context service. | 96 / 100 | 31 / 100 |
| Context7 public-docs MCPPublic documentation through Context7, without private guide/source/test corpora. | 52 / 100 | 24 / 100 |
Authorship/freeze disclosure
The public artifact does not independently prove that task authors, guide authors, and scorer authors were blind to K² outputs before freezing. The defensible public claim is therefore narrower: K² improved this guide-retrieval-heavy benchmark, and customer-specific claims require a frozen customer replication before indexing or running either arm.
Cost model
- Agent-token numbers count prompt and completion tokens captured by the benchmark runner. Retrieved K² snippets are included once they enter the agent prompt.
- K² platform cost is not hidden in the token-savings number. It includes ingestion, retrieval queries, storage, and subscription.
- Illustrative benchmark-scale platform allocation: Pro tier at $249/month for this demo corpus and run.
- If full-rubric accepted patches are the customer-relevant outcome because guide violations create review rework, K² platform allocation is $249 / 96 = $2.59 per full-rubric accepted patch before model-token cost.
- If the guardrail-ablated frame is used as the raw code-quality denominator, K² adds 2 incremental ablated accepted patches over the repo-only baseline, or about $124.50 per incremental ablated patch before model-token effects.
- That is the point of reporting both frames: the public run shows a narrow ablated code-quality lead and a large guide-compliance/review-rework lead.
- Break-even against the repo-only baseline on the full-rubric outcome occurs when blended agent-token price exceeds roughly $0.51 per million tokens saved.
- At Claude Sonnet-style input pricing around $3 per million tokens, the 5.04M agent-token savings per full-rubric accepted patch is about $15.12, roughly 5.8x the benchmark-scale K² platform allocation.
- These numbers do not include developer time. One avoided review or re-prompt hour at a $150 loaded engineering cost dwarfs the token and platform costs combined.
Full-rubric formula: K² cost per accepted patch = 1.55M agent tokens times the model-token rate plus $2.59 platform allocation; repo-only baseline = 6.59M agent tokens times the model-token rate.
The economic case should also count developer time. If K² saves even one review or re-prompt hour per accepted patch, the labor savings exceed both token cost and benchmark-scale platform allocation by a large multiple.
Task definitions, scorer configurations, prompt templates, selected raw responses, patch artifacts, and demo asset manifests are published in the repository. An external reviewer can rerun the public benchmark with the same coding-agent model, a K² API key, and the published bundle; customer-specific claims still require customer-frozen tasks and customer-owned corpora.
Developers need a short reproducible setup. Enterprise buyers need a controlled pilot against their current coding-agent workflow. The same architecture supports both paths.
Load the public demo bundle, connect the MCP server to a coding agent, and run one cited retrieval query.
Enterprise pilotRun a pilot on your codebaseFreeze tasks and scoring with the customer technical lead, ingest customer docs/code, and compare against their current workflow.
The customer-facing claim should be earned on customer assets. Use this public benchmark to scope a bounded replication, not as a forecast for a financial Java application.
Run the same workflow on customer assets.
- Pick one representative Java module and one Confluence page tree.
- Freeze 5-10 real feature-development tasks before the pilot run.
- Run the same coding agent, for example Codex or Claude Code, with and without K² retrieval.
- Score accepted patches, review rework, focused tests, token use, wall-clock time, and retrieval cost.
Validate on customer code before making any customer-specific claim.
- This is a public benchmark and demo bundle, not a named customer replication.
- Design partners should freeze tasks, expected files, guide checks, and scorer logic before indexing.
- Publish customer-approved relevance findings, even if the customer name remains anonymized.
Java R&D is the inaugural case study because regulated teams often have the richest guide and compliance corpus. The same collections, agents, feed, pipeline, and MCP pattern is language-agnostic.