# K2 for Java R&D Development Workflows

Audience: financial-services Java R&D team, Java technical lead, VP R&D

## Executive Summary

The customer has the exact problem K2 is designed to test: a large Java financial application, years of engineering conventions, and substantial Confluence documentation that developers manually bring into coding-agent sessions.

The public Flink/Kafka benchmark is directional evidence to validate in a controlled pilot, not final proof. It shows that targeted retrieval can materially improve coding-agent output on large Java projects. The customer decision should be based on a replication using their own Java module, Confluence pages, focused tests, and current agent workflow.

Recommended customer ask:

> Give us one representative Java service or module, one Confluence page tree that developers actually use, and 5-10 feature-development tasks with focused tests. We will run the same agent with and without K2 retrieval and report accepted patches, test results, Confluence compliance, provenance, token usage, and retrieval cost.

## Why This Matters To The Customer

The customer's likely pain is not Java syntax. Modern coding agents such as Codex, Claude Code, and Cursor can write Java. The harder problem is that the agent often lacks the customer-specific context needed to make a correct change:

- which controller/service/DTO patterns are allowed;
- which financial-domain rules or audit constraints apply;
- which Confluence page defines the implementation path;
- which tests should be copied or extended;
- which internal APIs are deprecated or mandatory;
- which module boundaries must not be crossed.

K2's value proposition:

> K2 gives coding agents controlled, cited access to the engineering knowledge buried across source code, tests, Confluence, ADRs, generated API docs, and team-specific rules.

## K2 Architecture To Demonstrate

| K2 capability | External value |
| --- | --- |
| Collections | Source code, tests, Confluence, ADRs, API docs, and compliance rules remain distinct and filterable. |
| Agents | Guide, docs, code, test, and architect workers return scoped evidence before the coding agent edits. |
| Knowledge Feed | Repeated findings from code/test analysis can become durable guide material for future sessions. |
| Pipeline | The context topology is explicit enough for platform and security teams to inspect. |

## Public Evidence, Bounded Correctly

K2 has a useful directional signal from the Apache Flink/Kafka benchmark:

| Arm | Context available | Guardrail-ablated accepted patches | Full-rubric accepted patches | Tokens per full-rubric accepted patch | Seconds per full-rubric accepted patch |
| --- | --- | ---: | ---: | ---: | ---: |
| K2 MCP | Project guides, docs, source, and tests through K2 | 98 / 100 | 96 / 100 | 1.55M | 214.2s |
| Baseline | Local checkout and model memory | 96 / 100 | 31 / 100 | 6.59M | 640.7s |
| Context7 public-docs MCP | Public library docs only, no private guide/source/test corpus | 52 / 100 | 24 / 100 | 12.11M | 935.5s |

What this proves:

- On a large Java benchmark, K2 was narrowly ahead of the local baseline on guardrail-ablated accepted patches and materially ahead of public-docs-only context.
- The full-rubric gap shows the value of retrieving project guide rules when those rules are part of the scoring and review process.
- K2 reduced coding-agent tokens per accepted patch in that benchmark.
- The token comparison is agent-side only: retrieved snippets count once inserted into the coding-agent prompt, but K2 ingestion, retrieval, storage, and subscription costs are outside that percentage.
- Context7 is the right public-docs MCP reference arm for this question, and public docs alone did not provide enough project-specific evidence for this task set.
- The Context7 arm used more agent tokens than the repo-only baseline because the MCP tool could return additional public-doc context unavailable to the local-only arm; this should be read as a context-scope result, not as a claim that Context7 degrades agents.
- The result is consistent with K2's architecture: project guides, source, tests, and docs are retrieved as separate source roles before the agent edits.

What this does not prove:

- It does not establish the customer's expected success rate.
- It does not compare against the customer's current internal RAG/search system.
- It does not test Context7 private deployments or act as a general ranking of Context7 behavior.
- It does not include the customer's actual Confluence quality, ACLs, stale pages, or compliance rules.
- It does not include K2 ingestion/retrieval/storage/subscription cost in the token savings.

This honest boundary is important. It makes the pilot credible to the Java TL.

## Proposed Pilot

### Scope

- One representative Java service or module.
- One Confluence page tree used by developers.
- Existing focused tests where possible.
- 5-10 real feature-development tasks.
- Customer technical lead reviews task definitions, expected files, and guardrails before execution.

### Arms

| Arm | Description |
| --- | --- |
| Current workflow | Customer's current coding-agent workflow, for example Codex, Claude Code, or Cursor. |
| K2 workflow | Same agent and task, with K2 MCP retrieval over code, tests, Confluence, docs, and guardrails. |
| Optional existing RAG | If the customer already has internal RAG/search, include it as a third arm. |

### Scoring

Use a balanced scoring rule so K2 is not only rewarded for retrieving guides:

| Component | Weight |
| --- | ---: |
| Focused tests and build verification | 40% |
| Expected files/modules touched | 25% |
| Required behavior or diff-pattern checks | 15% |
| Confluence/internal guide compliance | 10% |
| Review scope and safety | 10% |

Also report pass/fail with the guide-compliance component removed. This directly addresses the "circular benchmark" critique.

For the public Flink/Kafka run, the full rubric shows K2 MCP at 96 / 100 accepted patches, the repo-only baseline at 31 / 100, and the Context7 public-docs MCP arm at 24 / 100. The guide-guardrail-ablated control narrows the comparison to K2 98 / 100, repo-only baseline 96 / 100, and Context7 public-docs MCP 52 / 100. This should be disclosed wherever the benchmark is cited: the full score shows the guide-retrieval advantage, while the ablation prevents the guide-compliance component from carrying the claim by itself.

## Data To Ingest

| Source | Minimum metadata | Why it matters |
| --- | --- | --- |
| Java source | repo, module, package, class, owner team | Exact symbols, controller/service boundaries, DTO usage. |
| Tests | repo, module, package, test class, source type | Test examples and verification commands. |
| Confluence | space, page ID, title, version, owner, last updated | Internal guidance, architecture rules, onboarding docs. |
| ADRs/RFCs | doc type, component, decision status | Why patterns exist and what should not change. |
| API docs | API surface, version, generated source | API contracts and request/response structures. |

## Demo Flow

1. Start with the architecture: customer knowledge -> K2 collections/agents/feed/pipeline -> MCP -> coding agent -> reviewed patch.
2. Show the Flink/Kafka result as supporting evidence with caveats.
3. Show the pilot design and ask for one module plus one Confluence page tree.
4. Demonstrate what K2 retrieval would look like:
   - "How do we add a controller in module X?"
   - "Which DTO pattern should this endpoint use?"
   - "Which test should be extended?"
   - "Which Confluence rule applies?"
5. Close on the replication ask: 5-10 tasks, scored together, using their code and docs.

## Success Criteria

Minimum success:

- K2 improves accepted patch rate by at least 20 percentage points over current workflow.
- K2 retrieves relevant Confluence/code/test sources with citations.
- K2 reduces repeated developer context-pasting.
- Lead developer agrees retrieved sources are relevant.

Strong success:

- K2 improves accepted patch rate and tokens per accepted patch.
- K2 reduces guide/compliance-related review comments.
- K2 passes focused tests more often.
- VP R&D sees a path to lower rework and faster onboarding.
- Security reviewer accepts the deployment model.

## Buying Argument

The buying argument is not "K2 beat a benchmark." The buying argument is:

> If K2 improves accepted patches and reduces review rework on your own Java service and Confluence guidance, it becomes a development workflow layer for every team using coding agents against the financial application.

## Decision Request

Ask the customer to agree to a short pilot:

- nominate one Java module;
- export or grant access to one Confluence page tree;
- identify 5-10 representative tasks;
- assign one Java TL to validate scoring;
- choose SaaS, single-tenant, VPC, or self-hosted deployment path for the pilot.