Coming Soon

Agent Harness Framework

Enable AI agents to work effectively across multiple context windows. A two-part solution inspired by how human engineers collaborate across shifts.

Read Original Article

The Core Challenge

AI agents must work in discrete sessions, and each new session begins with no memory of what came before. Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no recollection of what happened on the previous shift.

Complex projects cannot be completed within a single context window
Two-Part Solution

Agent Architecture

A specialised prompt for the first run, and incremental progress thereafter

Initialiser Agent

First run only

Setup

Sets up the environment with all necessary context for future sessions:

  • Creates init.sh script for environment setup
  • Writes claude-progress.txt for tracking
  • Generates feature_list.json with all requirements
  • Makes initial git commit with scaffolding

Coding Agent

Every subsequent run

Execute

Makes incremental progress whilst leaving clean state:

  • Works on ONE feature at a time
  • Commits progress with descriptive messages
  • Updates progress file with summaries
  • Tests features end-to-end before marking done
Session Workflow

How Each Session Starts

A structured approach to getting up to speed quickly

1
Run pwdSee working directory
2
Read git logsCheck recent work
3
Read featuresChoose priority task
4
Run init.shStart dev server
5
Test basicsVerify functionality
6
Build featureMake progress
Problem Solving

Failure Modes & Solutions

How the harness addresses common agent behaviours

Problem
Initialiser Agent
Coding Agent
Agent declares victory too earlySet up feature list file with all requirements marked as failingRead feature list at session start, choose a single feature to work on
Agent leaves environment in broken stateCreate initial git repo and progress notes fileRead progress notes and git logs, test dev server before starting work
Agent marks features done prematurelySet up structured feature list with pass/fail statusSelf-verify all features with end-to-end testing before marking complete
Agent spends time figuring out how to run the appWrite an init.sh script that starts the development serverRead init.sh at session start to quickly bootstrap environment
Example Output

Typical Session Start

What the agent does when beginning a new context window

session_output.log
[Assistant] I'll start by getting my bearings and understanding the current state.
[Tool Use] <bash - pwd>
[Tool Use] <read - claude-progress.txt>
[Tool Use] <read - feature_list.json>
[Assistant] Let me check the git log to see recent work.
[Tool Use] <bash - git log --oneline -20>
[Assistant] Now let me check if there's an init.sh script.
<Starts the development server>
[Assistant] Let me navigate to the application and verify fundamental features.
<Tests basic functionality>
[Assistant] Fundamental functionality is working. Now let me review the tests.json file.
<Starts work on a new feature>
Roadmap

What's Coming

Features we're building to bring Agent Harness to this platform

Configure Your Harness

Define custom initialiser prompts and coding agent behaviours tailored to your workflow.

In Development

Feature List Builder

Visual editor for creating structured feature requirements with pass/fail tracking.

Planned

Automated Testing

Integrate browser automation and unit tests for end-to-end feature verification.

Research
Agentic Intelligence

Beyond Traditional RAG: Agentic Retrieval

Autonomous AI agents that reason, plan, and adapt. Our agentic harness transforms simple retrieval into intelligent, multi-step problem-solving.

User Query"Analyse Q3 sales""across all regions"AGENT HARNESSPLANNINGTask DecompositionStrategy SelectionStep SequencingLLMREASONINGClaude Opus 4.5 / GPT-5.2TOOL LAYERVector SearchGraph RAGExternal APIsMEMORYConversation HistoryWorking MemoryRetrieved ContextVERIFICATIONOutput ValidationSafety GuardrailsQuality ThresholdsITERATION LOOPRefined ResponseMulti-source synthesisVerified accuracyCited sources
Planning & Memory
Tools & Verification
Self-Correction Loop

Multi-Step Reasoning

Autonomous planning and decomposition of complex queries into manageable sub-tasks, enabling sophisticated problem-solving across multiple retrieval cycles.

Tool & Function Calling

Dynamic integration with external tools, APIs, and databases. The agent decides which tools to invoke based on query requirements.

Self-Correction

Continuous reflection and refinement of outputs. When initial results are insufficient, the agent iterates until quality thresholds are met.

Context-Aware Retrieval

Adaptive retrieval strategies that consider conversation history, user intent, and document relevance to deliver precise results.

What is an Agent Harness?

The agent harness is the complete architectural system surrounding an LLM—everything except the model itself. It manages the entire context lifecycle: planning queries, orchestrating tool calls, maintaining memory, and verifying outputs.

Think of it as the "operating system" for your AI agent. Whilst the LLM provides reasoning capabilities, the harness provides structure, control, and reliability.

Planning & Decomposition

Breaks complex queries into executable steps

Memory & State Management

Maintains context across conversation turns

Verification & Guardrails

Ensures output quality and safety constraints

Tool Integration Layer

Orchestrates connections to external systems

Traditional RAG vs Agentic RAG

See how agentic capabilities transform the retrieval experience from simple question-answering to intelligent problem-solving.

1x

Traditional RAG

  • Single retrieval pass

    One query, one search, one response

  • Static context window

    Fixed chunk retrieval without adaptation

  • No self-correction

    Cannot identify or fix retrieval failures

  • Limited to vector similarity

    Cannot leverage external tools or APIs

N+

Agentic RAG

  • Multi-step reasoning

    Iterative retrieval with query refinement

  • Dynamic context management

    Adapts retrieval strategy per query

  • Self-correction & reflection

    Detects gaps and re-queries as needed

  • Tool & function calling

    Integrates databases, APIs, calculators

Use Case Example

"Analyse our Q3 performance and recommend improvements"

STEP 1
Plan & Decompose

Agent breaks query into sub-tasks: retrieve Q3 data, compare to Q2, identify patterns

STEP 2
Multi-Source Retrieval

Queries vector DB, calls SQL database, fetches external market data

STEP 3
Verify & Iterate

Checks completeness, identifies gaps in regional data, performs additional retrieval

STEP 4
Synthesise & Respond

Combines insights into actionable recommendations with cited sources

Enterprise Applications

Real-World Use Cases

Discover how enterprise organisations leverage the Agent Harness Framework to automate complex, time-intensive workflows that previously required significant manual effort.

Quarterly Financial Report Generation

2-4 Hours

Automate the creation of comprehensive Q1-Q4 financial reports by leveraging your organisation's RAG database containing historical financial data, market analysis, and company performance metrics.

How It Works

  • Initialiser agent scans RAG database for relevant financial documents
  • Coding agent generates structured report sections incrementally
  • Progress file tracks completed sections with verification status
  • Final output includes executive summary, charts, and recommendations

Key Benefits

  • Reduces report generation time from 2 weeks to 4 hours
  • Ensures consistency across quarterly reports
  • Automatic cross-referencing of data sources
  • Git-tracked revisions for audit trail compliance

Legacy Codebase Modernisation

8-24 Hours

Transform legacy systems written in COBOL, Java 6, or older frameworks into modern, maintainable codebases. The agent systematically analyses, refactors, and tests thousands of files whilst preserving business logic.

Migration Process

  • Analyses existing codebase architecture and dependencies
  • Creates feature list mapping legacy to modern patterns
  • Incrementally converts modules with automated testing
  • Maintains backwards compatibility during transition

Transformation Capabilities

  • COBOL to modern Java/Python conversion
  • Monolith to microservices decomposition
  • Framework upgrades (Angular 1.x to 17+)
  • Database migration (Oracle to PostgreSQL)

Compliance Audit Documentation

4-8 Hours

Generate comprehensive compliance documentation for SOX, GDPR, HIPAA, and ISO 27001 audits. The agent analyses your systems, policies, and controls to produce audit-ready documentation packages.

Documentation Workflow

  • Scans existing policies and control documentation
  • Maps controls to regulatory framework requirements
  • Identifies gaps and generates remediation recommendations
  • Produces formatted audit evidence packages

Supported Frameworks

SOXGDPRHIPAAISO 27001SOC 2PCI DSSNIST CSFFedRAMP

The agent maintains updated knowledge of regulatory requirements and automatically incorporates framework updates into documentation.

Defence Safety Case Creation

6-12 Hours

Generate comprehensive safety cases compliant with Def Stan 00-056 and MOD requirements. The agent creates structured safety arguments using Goal Structuring Notation (GSN), compiles evidence, and produces documentation ready for expert review and Defence Safety Authority approval.

Safety Case Workflow

  • Identifies hazards and potential accidents across system scope
  • Generates GSN diagrams with goals, strategies, and evidence links
  • Compiles Hazard Log with ALARP risk assessments
  • Produces Safety Case Report for expert review and sign-off

Deliverables

  • Structured safety argument with GSN notation
  • Complete Hazard Log with risk mitigation evidence
  • Safety Case Maturity Tool (SCMT) assessment
  • DSA-ready documentation package

Supported Standards & Frameworks

Def Stan 00-056Def Stan 00-055GSN Community StandardALARP MethodologyMIL-STD-882IEC 61508DO-178CISO 26262

The agent maintains comprehensive knowledge of MOD safety requirements, Defence Safety Authority guidelines, and international safety standards to ensure compliant documentation for expert review.

85%

Average Time Reduction

99.2%

Task Completion Rate

24/7

Autonomous Operation

100%

Audit Trail Coverage

Ready to Transform Your Agent Workflows?

The Agent Harness Framework represents a significant step forward in enabling AI agents to tackle complex, long-running tasks. Stay tuned for our implementation.

Based on "Effective harnesses for long-running agents" by Anthropic Engineering