# K2 Java Development Workflow Pilot Proposal

Audience: Java technical lead, VP R&D, platform engineering, security review

## One-Line Setup

Baseline means the same coding-agent workflow, for example Codex, Claude Code, or Cursor, without K2 retrieval; K2 means the same agent connected through K2 MCP to indexed source code, tests, Confluence guidance, ADRs, API docs, examples, and engineering guardrails.

## Executive Summary

Large Java R&D teams do not usually fail because an AI agent cannot write Java syntax. They fail because the agent does not know the internal implementation rules that live across Confluence, legacy modules, test examples, service boundaries, security guidance, and architecture decisions.

K2 is designed to make that internal engineering knowledge available to coding agents through controlled, cited retrieval. Its differentiated architecture is collections for separated source roles, named agents for bounded retrieval, Knowledge Feeds for durable learning, and Pipelines for repeatable context topology. The public Apache Flink/Kafka benchmark is a useful directional signal when read correctly: the guardrail-ablated control is K2 98 / 100, repo-only baseline 96 / 100, and Context7 public-docs MCP 52 / 100. The full rubric is K2 96 / 100 versus 31 / 100 for the local baseline and 24 / 100 for Context7, showing the additional lift from retrieving guide rules that the scorer also checks. That result is not proof of expected customer performance and is not a broad Context7 quality claim. The Context7 arm used public docs only and could return additional public-doc context; this is a context-scope result, not evidence that Context7 degrades agents. It is evidence that justifies a controlled pilot on the customer's own Java application and documentation.

## What We Can Claim Today

The public benchmark supports these narrow claims:

- K2 can improve coding-agent output when the task depends on project-specific Java conventions and supporting documentation.
- Hybrid retrieval over code, tests, docs, and guide material can reduce repeated context-pasting.
- Source-cited retrieval gives reviewers an audit trail for why an agent chose a controller pattern, DTO shape, test fixture, or integration path.

The benchmark does not support these claims:

- It does not establish a success rate for the customer's financial Java application.
- It does not prove that Flink/Kafka are representative of the customer's legacy codebase.
- It does not replace a comparison against the customer's current coding agent, search, or internal RAG workflow.
- It does not test Context7 private deployments; the Context7 arm used public library documentation only.
- It does not claim Context7 degrades agents; its higher agent-token use came from additional public-doc context outside the repo-only baseline.
- It does not prove total cost savings until K2 ingestion, retrieval, storage, and subscription costs are included.

## Why This Matters For A Legacy Java Team

The valuable customer workflow is not a generic prompt such as "write a Java controller." The valuable workflow is:

- find the internal page that says how controllers must be organized;
- retrieve nearby source files that implement the same pattern;
- identify the DTO, validation, audit, and test conventions;
- generate the patch in the correct module;
- cite the sources used so the Java TL can review the reasoning.

K2 turns those scattered sources into a retrieval layer for the customer's coding agent. Developers keep their agent, whether Codex, Claude Code, Cursor, or another MCP-capable tool, while K2 provides the context layer that knows the customer environment.

## K2 Architecture In The Pilot

| K2 capability | Role in the customer workflow |
| --- | --- |
| Collections | Keep source code, tests, Confluence, ADRs, API docs, and guardrails separated by source role and ACL boundary. |
| Agents | Route guide, docs, code, test, and architecture questions to named workers with bounded instructions. |
| Knowledge Feeds | Promote recurring implementation findings from source/test retrieval into durable guide context. |
| Pipelines | Make the context graph inspectable, repeatable, and reviewable before it is connected to a coding agent. |

## Proposed Pilot

Run a controlled 5-10 task pilot, not a production rollout. The core comparison is the same agent, same task, with K2 retrieval versus the customer's current agent workflow.

Inputs requested:

- One representative Java module or service.
- One Confluence page tree that developers actually use.
- Existing focused tests or a short test harness for each task.
- One Java TL to freeze task definitions, expected files, guardrails, and scoring before execution.
- A deployment path for the pilot: SaaS, single-tenant, VPC, or self-hosted.

Evaluation arms:

| Arm | Description |
| --- | --- |
| Current workflow | Customer's current coding-agent workflow, for example Codex, Claude Code, or Cursor. |
| K2 workflow | Same agent, same task, with K2 retrieval over customer code, tests, Confluence, API docs, ADRs, and guardrails. |
| Existing RAG/search | Optional arm if the customer already uses internal retrieval. |

## Scoring Model

The scoring rule should not reward K2 only for retrieving guide material. Use a balanced score:

| Component | Weight |
| --- | ---: |
| Focused tests and build verification | 40% |
| Expected files and modules touched | 25% |
| Required behavior or diff-pattern checks | 15% |
| Confluence/internal guide compliance | 10% |
| Review scope and safety | 10% |

Report two versions of the score:

- Full score, including Confluence/internal guide compliance.
- Ablated score, with the guide-compliance component removed.

The ablated score directly addresses the circularity critique: K2 should still improve useful engineering outcomes, not merely retrieve rules that the scorer later rewards.

For the public Flink/Kafka run, the full rubric shows K2 MCP at 96 / 100 accepted patches, the repo-only baseline at 31 / 100, and the Context7 public-docs MCP arm at 24 / 100. The guardrail-ablated control narrows the comparison to K2 98 / 100, repo-only baseline 96 / 100, and Context7 public-docs MCP 52 / 100. That is the point of reporting both numbers: the full rubric captures the guide-retrieval advantage, while the ablated control prevents the guide-compliance component from carrying the claim by itself.

## K2 Corpus For The Pilot

| Source | Required metadata | Use in agent workflow |
| --- | --- | --- |
| Java source | repo, module, package, class, owner, last commit | Exact implementation patterns and symbol references. |
| Tests | repo, module, package, test class, fixture type | Examples to extend and commands to run. |
| Confluence | space, page ID, title, owner, version, last updated, ACL group | Internal implementation rules and onboarding guidance. |
| ADRs/RFCs | component, decision status, owner, date | Why a pattern exists and whether it is current. |
| API docs | API surface, version, generated source | Request/response contracts and deprecated APIs. |
| Compliance/security rules | domain, control ID, owner, effective date | Audit, logging, data handling, and restricted flows. |

## Success Criteria

Minimum pilot success:

- K2 improves accepted patch rate by at least 20 percentage points over the current workflow.
- Focused tests pass more often or with less rework.
- Java TL agrees retrieved sources are relevant to the generated patch.
- K2 reduces repeated manual context-pasting.

Strong pilot success:

- K2 improves accepted patch rate, agent-side tokens per accepted patch, and review rework.
- K2 reduces guide-related review comments.
- K2 produces source-cited answers that developers can audit.
- Security accepts the deployment and data-handling model for broader rollout.

## Cost And Efficiency Reporting

The pilot report should include:

- agent prompt and completion tokens;
- K2 retrieval calls and retrieved-token volume;
- K2 ingestion, storage, and query-cost assumptions;
- wall-clock time per task;
- accepted patches per hour;
- cost per accepted patch.

Token savings alone are not the business case. Cost per accepted, reviewed, and test-passing patch is the business case.

## Decision Request

Approve a short pilot using one Java module, one Confluence page tree, and 5-10 representative feature-development tasks. If K2 improves accepted patches and reduces review rework on the customer's own code and documents, expand by module. If it does not, stop.
