K² Coding Context Demo: Java R&D Case Study
K² ships scoped context for coding agents
Coding agents need source-role-separated context before they edit. This case study shows how your coding agent, including Codex, Claude Code, or Cursor, can use K² corpora, named agents, a Knowledge Feed, and a pipeline to retrieve private guides, source, tests, architecture notes, and guardrails with citations. The benchmark is supporting evidence; the reusable architecture is the product.
Core K² primitives: collectionsK² docs concept: a collection is a separately indexed corpus. The demo keeps guide rules, versioned docs, and Java source/tests in different collections so filters can preserve source role., AgentsK² docs concept: an Agent is a named retrieval/synthesis worker with bounded instructions and corpus access. The demo uses guide, docs, code, and architect agents., Knowledge FeedK² docs concept: a Knowledge Feed moves repeated findings from a source agent into a target corpus. The demo feed promotes recurring REST handler source findings into guide material., and PipelineK² docs concept: a Pipeline declares how corpora, agents, feeds, and subscriptions connect so the topology can be inspected and re-run..
K² connects your coding agent to scoped Java evidence.
Inspect the topology first: private knowledge is indexed by role, routed through bounded K² agents, exposed over MCP, and consumed by Codex, Claude Code, Cursor, or another coding agent before it edits Java. The benchmark below is supporting evidence; the architecture is the part a customer would replicate.
Each card maps a K² feature to the development value shown in this demo.
Separate corpora for guides, docs, and code
K² keeps Confluence-style guidance, versioned documentation, source, and tests queryable without mixing their roles.
- Demo assets: java-rd-guides, flink-docs-2.2, flink-code-2.2.
- Customer value: the agent knows whether evidence is a rule, API doc, implementation, or test.
Precise retrieval for legacy Java
Framework, version, source kind, module, package, class, API surface, and path metadata keep answers scoped.
- Demo filter examples target Flink 2.2 REST handlers and tests.
- Customer value: fewer irrelevant snippets and less prompt waste.
Semantic questions plus exact symbols
Dense retrieval helps with conceptual questions, while sparse matching protects exact Java class and method names.
- Demo symbols: JobVertexWatermarksHandler, MessageQueryParameter.
- Customer value: code references survive even when names are obscure.
Specialized context workers
Guide, docs, code, and architect agents each answer a bounded part of the development question.
- Demo agents return guardrails, release docs, source/test anchors, and a cited plan.
- Customer value: the coding agent receives structured context instead of a pile of text.
Turn repeated discoveries into durable guidance
The feed loop can promote recurring source findings back into the guide corpus for future sessions.
- Demo feed: Flink REST Guide Feed from code agent to java-rd-guides.
- Customer value: institutional knowledge improves instead of being rediscovered every time.
Auditable route from K² to your coding agent
The Pipeline Spec shows the topology; MCP makes the same coding-agent workflow use K² context before editing.
- Demo result: answer excerpts, code diff excerpts, and verification artifacts are linked.
- Customer value: reviewers can inspect what evidence influenced the patch.
On a 100-task Apache Flink/Kafka Java benchmark, K² improved full-rubric accepted patches from 31 to 96 while using fewer agent-side tokens per accepted patch. The result supports the architecture story; it is not a customer forecast.
Token math is agent-side: retrieved snippets count once they enter the coding-agent prompt, but K² platform retrieval, ingestion, storage, and subscription costs are not included in this percentage. Do not quote this as a ranking headline or expected customer outcome.
Methodology callout: guide-guardrail scoring is the circularity risk in this benchmark because K² retrieves the same guide rules the scorer checks. The reported control removes guide-compliance failures from pass/fail attribution: K² is 98/100, the repo-only baseline is 96/100, and Context7 public-docs MCP is 52/100. This means the full rubric shows the large guide-retrieval advantage, while the ablated control shows a narrow K² lead over the local baseline and a larger gap over public-docs-only context.
The circularity risk is explicit: K² retrieves guide rules and the full rubric rewards guide compliance. The table therefore reports the full score beside a guardrail-ablated pass rate, where guide-compliance failures are removed from pass/fail attribution.
Scoring rubric
| Component | Weight |
|---|---|
| Focused tests + build verification | 40% |
| Expected files/modules touched | 25% |
| Required behavior or diff-pattern checks | 15% |
| Confluence/internal guide compliance | 10% |
| Review scope and safety | 10% |
Full versus guardrail-ablated pass rate
| Arm | Full score | Guardrail-ablated score |
|---|---|---|
| K² MCPProject guides, source, tests, and versioned docs through K². | 96 / 100 | 98 / 100 |
| Repo-only baselineLocal checkout and model memory without an external context service. | 31 / 100 | 96 / 100 |
| Context7 public-docs MCPPublic documentation through Context7, without private guide/source/test corpora. | 24 / 100 | 52 / 100 |
Authorship/freeze disclosure
The public artifact does not independently prove that task authors, guide authors, and scorer authors were blind to K² outputs before freezing. The defensible public claim is therefore narrower: K² improved this guide-retrieval-heavy benchmark, and customer-specific claims require a frozen customer replication before indexing or running either arm.
Cost model
- Agent-token numbers count prompt and completion tokens captured by the benchmark runner. Retrieved K² snippets are included once they enter the agent prompt.
- K² platform cost is not hidden in the token-savings number. It includes ingestion, retrieval queries, storage, and subscription.
- Illustrative benchmark-scale platform allocation: Pro tier at $249/month for this demo corpus and run.
- K² platform cost per full-rubric accepted patch under that allocation: $249 / 96 = $2.59 before model-token cost.
- Break-even against the repo-only baseline occurs when blended agent-token price exceeds roughly $0.51 per million tokens saved.
Formula: K² cost per accepted patch = 1.55M agent tokens times the model-token rate plus $2.59 platform allocation; repo-only baseline = 6.59M agent tokens times the model-token rate.
Developers need a short reproducible setup. Enterprise buyers need a controlled pilot against their current coding-agent workflow. The same architecture supports both paths.
Load the public demo bundle, connect the MCP server to a coding agent, and run one cited retrieval query.
Enterprise pilotRun a pilot on your codebaseFreeze tasks and scoring with the customer technical lead, ingest customer docs/code, and compare against their current workflow.
The customer-facing claim should be earned on customer assets. Use this public benchmark to scope a bounded replication, not as a forecast for a financial Java application.
Run the same workflow on customer assets.
- Pick one representative Java module and one Confluence page tree.
- Freeze 5-10 real feature-development tasks before the pilot run.
- Run the same coding agent, for example Codex or Claude Code, with and without K² retrieval.
- Score accepted patches, review rework, focused tests, token use, wall-clock time, and retrieval cost.
Validate on customer code before making any customer-specific claim.
- This is a public benchmark and demo bundle, not a named customer replication.
- Design partners should freeze tasks, expected files, guide checks, and scorer logic before indexing.
- Publish customer-approved relevance findings, even if the customer name remains anonymized.
- Flink/Kafka are public OSS stand-ins, not proof of expected results on a financial Java application.
- The benchmark does not replace a comparison against the customer current workflow or internal RAG/search.
- Context7 is included only as a public-docs MCP reference arm; private Context7 deployments were not tested.
- Token savings are agent-side only and must be evaluated together with K² ingestion, retrieval, storage, and subscription costs.
- Guardrail compliance is reported with an ablation control: removing guide-compliance failures narrows the comparison to K² 98/100, repo-only baseline 96/100, and Context7 public-docs MCP 52/100.
- Tasks, scoring rules, and raw task rows should be frozen and inspectable before customer pilot execution.
Java R&D is the inaugural case study because regulated teams often have the richest guide and compliance corpus. The same collections, agents, feed, pipeline, and MCP pattern is language-agnostic.
Context7 is the right public-docs MCP reference point. The fair reading is narrow: this arm checks whether public library documentation alone gives a coding agent enough project-specific context for the same Java feature tasks. The Context7 arm used more agent tokens than the repo-only baseline because the MCP tool could return additional public-doc context the local-only arm did not have; the finding is a context-scope result, not a claim that Context7 degrades agents. It does not test Context7 private deployments or act as a general ranking of Context7 behavior.
K² MCP
K² collections for generated guides, versioned docs, selected source, and neighboring tests.
96 / 100 accepted patchesMeasures whether governed project context improves the coding workflow.Repo-only baseline
Local checkout, local search, and the model baseline. No K² or public-docs MCP tool.
31 / 100 accepted patchesMeasures what the same agent can do without a context service.Context7 public-docs MCP
Public library documentation through Context7. No private guide, source, or test corpus. This arm can return more public-doc context than the repo-only baseline.
24 / 100 accepted patchesTests whether public docs alone solve this project-specific Java task set. It is not a broad Context7 quality ranking.Credibility bar for customer use: rerun the harness on the customer's own current workflow, include any existing internal RAG/search arm, freeze tasks before execution, and publish raw task rows plus failure analysis.