K² Coding Context Demo: Java R&D Case Study

K² ships scoped context for coding agents

Coding agents need source-role-separated context before they edit. This case study shows how your coding agent, including Codex, Claude Code, or Cursor, can use K² corpora, named agents, a Knowledge Feed, and a pipeline to retrieve private guides, source, tests, architecture notes, and guardrails with citations. The benchmark is supporting evidence; the reusable architecture is the product.

Core K² primitives: collectionsK² docs concept: a collection is a separately indexed corpus. The demo keeps guide rules, versioned docs, and Java source/tests in different collections so filters can preserve source role., AgentsK² docs concept: an Agent is a named retrieval/synthesis worker with bounded instructions and corpus access. The demo uses guide, docs, code, and architect agents., Knowledge FeedK² docs concept: a Knowledge Feed moves repeated findings from a source agent into a target corpus. The demo feed promotes recurring REST handler source findings into guide material., and PipelineK² docs concept: a Pipeline declares how corpora, agents, feeds, and subscriptions connect so the topology can be inspected and re-run..

Architecture first

K² connects your coding agent to scoped Java evidence.

Inspect the topology first: private knowledge is indexed by role, routed through bounded K² agents, exposed over MCP, and consumed by Codex, Claude Code, Cursor, or another coding agent before it edits Java. The benchmark below is supporting evidence; the architecture is the part a customer would replicate.

Customer knowledgeGuides, docs, code, tests
K² platformCollections, Agents, Feed, Pipeline
MCP serverScoped evidence with citations
Coding agentPlans, edits, and explains
Java patchFocused tests and review trail
K² platform value map

Each card maps a K² feature to the development value shown in this demo.

CollectionsK² docs concept: a collection is a separately indexed corpus. The demo keeps guide rules, versioned docs, and Java source/tests in different collections so filters can preserve source role.

Separate corpora for guides, docs, and code

K² keeps Confluence-style guidance, versioned documentation, source, and tests queryable without mixing their roles.

  • Demo assets: java-rd-guides, flink-docs-2.2, flink-code-2.2.
  • Customer value: the agent knows whether evidence is a rule, API doc, implementation, or test.
Metadata filtersK² docs concept: metadata filters constrain retrieval by fields such as framework, version, source_kind, module, package, class_name, and path before your coding agent receives context.

Precise retrieval for legacy Java

Framework, version, source kind, module, package, class, API surface, and path metadata keep answers scoped.

  • Demo filter examples target Flink 2.2 REST handlers and tests.
  • Customer value: fewer irrelevant snippets and less prompt waste.
Hybrid searchK² docs concept: hybrid retrieval combines dense semantic search with sparse exact matching. That matters for Java because class names and method names must be found exactly.

Semantic questions plus exact symbols

Dense retrieval helps with conceptual questions, while sparse matching protects exact Java class and method names.

  • Demo symbols: JobVertexWatermarksHandler, MessageQueryParameter.
  • Customer value: code references survive even when names are obscure.
AgentsK² docs concept: an Agent is a named retrieval/synthesis worker with bounded instructions and corpus access. The demo uses guide, docs, code, and architect agents.

Specialized context workers

Guide, docs, code, and architect agents each answer a bounded part of the development question.

  • Demo agents return guardrails, release docs, source/test anchors, and a cited plan.
  • Customer value: the coding agent receives structured context instead of a pile of text.
Knowledge FeedK² docs concept: a Knowledge Feed moves repeated findings from a source agent into a target corpus. The demo feed promotes recurring REST handler source findings into guide material.

Turn repeated discoveries into durable guidance

The feed loop can promote recurring source findings back into the guide corpus for future sessions.

  • Demo feed: Flink REST Guide Feed from code agent to java-rd-guides.
  • Customer value: institutional knowledge improves instead of being rediscovered every time.
Pipeline + MCPK² docs concept: a Pipeline declares how corpora, agents, feeds, and subscriptions connect so the topology can be inspected and re-run.

Auditable route from K² to your coding agent

The Pipeline Spec shows the topology; MCP makes the same coding-agent workflow use K² context before editing.

  • Demo result: answer excerpts, code diff excerpts, and verification artifacts are linked.
  • Customer value: reviewers can inspect what evidence influenced the patch.
Benchmark evidence, with methodology caveats

On a 100-task Apache Flink/Kafka Java benchmark, K² improved full-rubric accepted patches from 31 to 96 while using fewer agent-side tokens per accepted patch. The result supports the architecture story; it is not a customer forecast.

96 / 100K² MCP full-rubric accepted patches with project guides, source, tests, and docs
31 / 100repo-only baseline full-rubric accepted patches in the same harness
24 / 100Context7 public-docs MCP full-rubric accepted patches
76.5%lower agent-side tokens per accepted patch

Token math is agent-side: retrieved snippets count once they enter the coding-agent prompt, but K² platform retrieval, ingestion, storage, and subscription costs are not included in this percentage. Do not quote this as a ranking headline or expected customer outcome.

Methodology callout: guide-guardrail scoring is the circularity risk in this benchmark because K² retrieves the same guide rules the scorer checks. The reported control removes guide-compliance failures from pass/fail attribution: K² is 98/100, the repo-only baseline is 96/100, and Context7 public-docs MCP is 52/100. This means the full rubric shows the large guide-retrieval advantage, while the ablated control shows a narrow K² lead over the local baseline and a larger gap over public-docs-only context.

Methodology and ablation

The circularity risk is explicit: K² retrieves guide rules and the full rubric rewards guide compliance. The table therefore reports the full score beside a guardrail-ablated pass rate, where guide-compliance failures are removed from pass/fail attribution.

Scoring rubric

ComponentWeight
Focused tests + build verification40%
Expected files/modules touched25%
Required behavior or diff-pattern checks15%
Confluence/internal guide compliance10%
Review scope and safety10%

Full versus guardrail-ablated pass rate

ArmFull scoreGuardrail-ablated score
K² MCPProject guides, source, tests, and versioned docs through K².96 / 10098 / 100
Repo-only baselineLocal checkout and model memory without an external context service.31 / 10096 / 100
Context7 public-docs MCPPublic documentation through Context7, without private guide/source/test corpora.24 / 10052 / 100

Authorship/freeze disclosure

The public artifact does not independently prove that task authors, guide authors, and scorer authors were blind to K² outputs before freezing. The defensible public claim is therefore narrower: K² improved this guide-retrieval-heavy benchmark, and customer-specific claims require a frozen customer replication before indexing or running either arm.

Cost model

  • Agent-token numbers count prompt and completion tokens captured by the benchmark runner. Retrieved K² snippets are included once they enter the agent prompt.
  • K² platform cost is not hidden in the token-savings number. It includes ingestion, retrieval queries, storage, and subscription.
  • Illustrative benchmark-scale platform allocation: Pro tier at $249/month for this demo corpus and run.
  • K² platform cost per full-rubric accepted patch under that allocation: $249 / 96 = $2.59 before model-token cost.
  • Break-even against the repo-only baseline occurs when blended agent-token price exceeds roughly $0.51 per million tokens saved.

Formula: K² cost per accepted patch = 1.55M agent tokens times the model-token rate plus $2.59 platform allocation; repo-only baseline = 6.59M agent tokens times the model-token rate.

Two ways to act on the demo

Developers need a short reproducible setup. Enterprise buyers need a controlled pilot against their current coding-agent workflow. The same architecture supports both paths.

Pilot ask

The customer-facing claim should be earned on customer assets. Use this public benchmark to scope a bounded replication, not as a forecast for a financial Java application.

Enterprise path

Run the same workflow on customer assets.

  • Pick one representative Java module and one Confluence page tree.
  • Freeze 5-10 real feature-development tasks before the pilot run.
  • Run the same coding agent, for example Codex or Claude Code, with and without K² retrieval.
  • Score accepted patches, review rework, focused tests, token use, wall-clock time, and retrieval cost.
Design partners wanted

Validate on customer code before making any customer-specific claim.

  • This is a public benchmark and demo bundle, not a named customer replication.
  • Design partners should freeze tasks, expected files, guide checks, and scorer logic before indexing.
  • Publish customer-approved relevance findings, even if the customer name remains anonymized.
  • Flink/Kafka are public OSS stand-ins, not proof of expected results on a financial Java application.
  • The benchmark does not replace a comparison against the customer current workflow or internal RAG/search.
  • Context7 is included only as a public-docs MCP reference arm; private Context7 deployments were not tested.
  • Token savings are agent-side only and must be evaluated together with K² ingestion, retrieval, storage, and subscription costs.
  • Guardrail compliance is reported with an ablation control: removing guide-compliance failures narrows the comparison to K² 98/100, repo-only baseline 96/100, and Context7 public-docs MCP 52/100.
  • Tasks, scoring rules, and raw task rows should be frozen and inspectable before customer pilot execution.
K² for coding agents, not only Java

Java R&D is the inaugural case study because regulated teams often have the richest guide and compliance corpus. The same collections, agents, feed, pipeline, and MCP pattern is language-agnostic.

Java R&D case studyregulated enterprise services with Confluence-heavy guardrails.
TypeScript case studyinternal component library and product-platform conventions.
Python case studydata/ML platform workflows with notebooks, pipelines, and model-serving code.
Market reference arm: Context7 public docs

Context7 is the right public-docs MCP reference point. The fair reading is narrow: this arm checks whether public library documentation alone gives a coding agent enough project-specific context for the same Java feature tasks. The Context7 arm used more agent tokens than the repo-only baseline because the MCP tool could return additional public-doc context the local-only arm did not have; the finding is a context-scope result, not a claim that Context7 degrades agents. It does not test Context7 private deployments or act as a general ranking of Context7 behavior.

Private/project-context arm

K² MCP

K² collections for generated guides, versioned docs, selected source, and neighboring tests.

96 / 100 accepted patchesMeasures whether governed project context improves the coding workflow.
No external MCP arm

Repo-only baseline

Local checkout, local search, and the model baseline. No K² or public-docs MCP tool.

31 / 100 accepted patchesMeasures what the same agent can do without a context service.
Public-docs reference arm

Context7 public-docs MCP

Public library documentation through Context7. No private guide, source, or test corpus. This arm can return more public-doc context than the repo-only baseline.

24 / 100 accepted patchesTests whether public docs alone solve this project-specific Java task set. It is not a broad Context7 quality ranking.

Credibility bar for customer use: rerun the harness on the customer's own current workflow, include any existing internal RAG/search arm, freeze tasks before execution, and publish raw task rows plus failure analysis.