Enterprise path

Run a controlled pilot on your codebase

The public Flink/Kafka benchmark is directional evidence. The buying decision should come from a customer-run replication using the same coding agent, frozen tasks, customer Confluence guidance, customer Java code, focused tests, and total cost reporting.

Inputs

What the customer provides

  • One representative Java service or module.
  • One Confluence page tree that developers already use.
  • Five to ten real feature-development tasks frozen before indexing.
  • Focused tests or reviewer-accepted verification commands.
  • One customer Java technical lead who approves tasks, guardrails, expected files, and scoring.
Controls

How to avoid a benchmark nobody trusts

  • Freeze tasks, guide rules, and scorer logic before either arm runs.
  • Use the same model family and version pin in every arm.
  • Randomize arm order and keep raw task rows available for audit.
  • Include the customer's current RAG/search if they have one.
  • Report K² platform cost separately from agent-token savings.
Pilot arms and scoring

The pilot should compare K² against the customer's current workflow, not against a straw man.

Evaluation arms

Current workflowThe customer coding-agent workflow as used today, including their normal search or context-pasting practices.
K² workflowThe same coding agent connected to K² MCP over customer code, tests, Confluence, docs, and guardrails.
Existing RAG/searchOptional control if the customer already has an internal retrieval system.

Scoring rubric

Focused tests + build verification40%
Expected files/modules touched25%
Required behavior or diff-pattern checks15%
Confluence/internal guide compliance10%
Review scope and safety10%
Typical pilot timeline

Keep the evaluation bounded enough to run, but strict enough that the scorecard means something.

Week 1Customer technical lead freezes tasks, guide checks, expected files, and scoring rubric.
Week 2K² FDE ingests the agreed Confluence tree, Java module, neighboring tests, and docs.
Week 3Paired runs compare current workflow, K² workflow, and any existing internal RAG/search arm.
Week 4K² and the customer review accepted patches, rework, tokens, retrieval cost, and failure analysis.
After the pilot

If the pilot succeeds against the customer-frozen rubric, K² and the customer agree a production scope: additional Java modules, Confluence trees, teams, indexing policy, and retention rules. The customer owns all configurations, tasks, corpora, and scorecards produced during the pilot regardless of whether they continue.

Success conditionAccepted-patch lift, reduced review rework, useful retrieved-source relevance, and visible TCO.
Production scopeBounded rollout by Java module, Confluence tree, coding-agent group, and retrieval policy.
Exit storyCustomer retains task definitions, scoring artifacts, and corpus configuration exports.
DisclosureNamed or anonymized case-study material is published only with customer approval.
Design partner framing

K² should publish a customer-run replication only after the customer's lead engineer confirms retrieved source relevance and the customer approves any named or anonymized disclosure.

Open Knowledge² website