← Back to Friday overview

The Self-Evolving Harness

A cognition layer bolted on top of Friday that turns "an LLM with tools" into a system that sets goals, plans, verifies, experiments and measures whether it is actually getting better.

1. Why this harness

Before this upgrade, Friday could already talk, remember, schedule tasks, read email, reflect on the day and even infer preferences from repeated corrections. That is a lot — but all of it is reactive. The system waits for a message, does what it's told, optionally learns a trivia item, and goes quiet.

A practical AGI-like system has to learn by action oriented to goals:

goal → hypothesis → plan → execution → verification → reward/punishment → model update

None of that was represented in Friday. There was no persistent goal object. No plan tree. No notion of what Friday believed, what confidence, where the belief came from, or when it should expire. No separation between "I think this" and "I verified this". No place to A/B test alternatives. No metrics that told us whether the system was actually improving or just getting more talkative.

The harness fills those gaps. It is a thin, entirely additive layer — no existing tables or endpoints were removed — that gives Friday the scaffolding to act as if it were trying to get better, and the receipts to check whether it is.

2. The blueprint

Everything lives in the same SQLite database that the Friday memory API (~/proyectos/memory-graph) was already using. The API gained 13 new tables and roughly 60 new endpoints across 8 subsystems:

SubsystemTablesPurpose
Goal enginegoalsPersistent intentions with utility, deadline, constraints, success criteria, subgoals, progress.
Plannerplan_treeHierarchical trees: goal → sub-goal → action → tool → expected result → exit condition → rollback.
Self-knowledgecapabilities, autonomy_levelsWhat Friday can do, with what calibrated confidence, at what cost, under what supervision.
Three-layer memoryexisting memories, entities, skills + new columnsEpisodic / semantic / procedural split, with provenance, confidence and decay per row.
Causal world modelwm_entities, wm_relations, wm_events, wm_predictionsStructured state, subject-predicate-object facts, events with causes & effects, testable predictions with calibration gap.
Safetyverifications, sandbox_executionsExplicit fact/goal/hallucination checks, and dry-run/simulation before any live action.
Learningexperiments (plus extended skills)A/B variants with minimum delta & minimum sample guardrails; skills have maturity (draft → beta → stable → deprecated) with promotion rules.
MetricsmetricsCatalog of 11 KPIs that tell us whether the system is actually improving — hallucination rate, calibration gap, goals completed, skill success, etc.

Design rule. Everything is additive. No existing table was dropped. No existing endpoint was broken. Every new column ships as an ALTER TABLE … IF NOT EXISTS, so older Friday databases keep working and just gain the new fields.

3. Goal engine & hierarchical plans

Before this harness, "what Friday is trying to do right now" lived in the conversation buffer. Now it is a first-class row:

{
  "goal_id": "g_2041",
  "title": "Get 3 qualified leads for product X",
  "utility": 0.83,
  "deadline": "2026-05-10",
  "constraints": ["no spending", "no spam"],
  "success_criteria": ["3 positive replies", "1 meeting booked"],
  "subgoals": ["identify niches", "build contact list", "draft outreach", "follow up"],
  "risk_tier": "medium",
  "autonomy_level": 2,
  "status": "active"
}

The endpoint GET /goal/next ranks active goals by utility × urgency × (1 − progress) where urgency grows as the deadline approaches. A daily cron at 09:37 flags anything with deadline < 3 days or no progress > 5 days.

Each non-trivial goal gets a plan tree. A node carries node_type, tool, expected_result, exit_condition and rollback. Plans are not text — they are executable structures, so we can compare two plans for the same goal, reuse sub-trees, and observe which kinds of decomposition actually work.

4. Three-layer memory

The memory API used to expose a single memories table with embeddings, FTS5 and a hybrid RRF ranker. Good for retrieval. Not enough for cognition. The layers are now:

Two cross-cutting concepts are added to every layer:

5. Causal world model

The previous world_model table stored loose observations ("Bruno is more responsive at night"). The harness promotes the idea from correlation-blob to causal structure:

The old world_model table is not deleted — it becomes a soft observation inbox. When an observation earns structure (becomes testable, causal, or S-P-O), it is promoted with POST /worldmodel/<id>/promote. The source row stays so the audit chain is preserved (promoted_to + promoted_ref), and the promoted row carries provenance: ["worldmodel:<id>"].

6. Self-knowledge & autonomy levels

A system that doesn't know what it's good at will over-reach. The capabilities table gives Friday a live self-portrait:

{
  "name": "coding",
  "domain": "engineering",
  "confidence": 0.58,       // Bayesian blend: prior 0.5 (weight 5) + observed rate
  "success_count": 4,
  "failure_count": 1,
  "cost_avg": 0.0020,       // rolling average
  "time_avg_sec": 14.2,
  "error_types": ["syntax_error", "import_missing"],
  "autonomy_max": 2,
  "max_risk_tier": "medium",
  "supervision_needed": true
}

After every task, POST /capability/<name>/record updates these fields. The confidence formula is a Bayesian blend so a single failure doesn't tank a long track record, and a single lucky success doesn't inflate it.

On top sits a 6-rung autonomy ladder:

L0 · Suggest only L1 · Sandbox L2 · Low-risk act L3 · Bounded act L4 · Long chain w/ checkpoints L5 · Self-modify (rollback required)

The gate is POST /autonomy/check. Given {capability, proposed_level, risk_tier} it returns allowed: false with a reason whenever the level exceeds either the capability's own autonomy_max or the level's max_risk_tier. No unrecorded jump of autonomy.

7. Verifier and sandbox

Agents look very intelligent right up until they confidently commit a mistake. Two tables address that:

Rule of thumb embedded in CLAUDE.md: without evidence, don't act; without verification, don't learn from that action as if it had been correct.

8. Experiments and skill compiler

Reflection and preference learning are useful but they measure yesterday's impressions. The experiment engine measures cause and effect:

POST /experiment {
  "hypothesis": "short reminders get more replies than long ones",
  "metric": "response_rate",
  "variants": [{"name":"short"}, {"name":"long"}],
  "min_delta": 0.1,
  "min_samples": 3
}
POST /experiment/<id>/observation {"variant":"short","value":0.8}
…
PATCH /experiment/<id>/conclude {}   // auto-picks winner only if delta > threshold AND samples > threshold

If the delta or sample count isn't enough, the experiment concludes as inconclusive — no winner declared. "Prove it improves, don't believe it."

The skill compiler extends the existing skills table with preconditions, tools, success/failure counts, cost, time averages, failure domains, tests and a maturity field. Promotion is gated:

A nightly cron at 02:37 walks the skills table and applies these rules automatically.

9. Metrics that prove improvement

If nothing is measured, "it feels like Friday is getting smarter" is just that — a feeling. The catalog of 11 KPIs:

KPIWhat it tells us
tasks_solved_no_correction_pctDid the user accept the first answer?
hallucination_rateSelf-reported "I was wrong" rate over total assistant messages.
time_to_complete_goal_secFrom goal creation to goal completion.
skill_reuse_rateAre the skills we compile actually being picked up again?
skill_success_rateAverage success across all stable skills.
world_model_precisionProxy: average capability confidence. Sanity floor.
calibration_gapAverage |confidence − outcome| on resolved predictions.
actions_reverted_pctLive actions that needed a rollback.
cost_per_useful_task$ per task the user accepted.
goals_completed_per_weekThroughput of closed goals.
approved_improvements_effective_pctOf the self-improvement proposals Bruno approved, how many actually moved a KPI.

A daily 22:23 cron computes the day's values and posts them to /metric. /metric/summary returns latest value + 7-day min / avg / max per metric.

10. Wiring: how Friday actually uses it

Infrastructure with no caller is dead code. The harness is wired via two mechanisms: rules in CLAUDE.md and crons that exercise the tables automatically.

Ten rules in CLAUDE.md, summarized:

  1. Before any task that takes more than ~3 tool calls, create a goal + plan.
  2. Before any risky action (medium+ risk), call /autonomy/check.
  3. After any task, record outcome on the matching capability.
  4. Before closing important work, post a verification.
  5. Before irreversible actions, run a sandbox dry-run first.
  6. When claiming the future, log a prediction; resolve it later.
  7. When choosing between approaches repeatedly, run an experiment.
  8. Record the day's KPIs.
  9. Compile recurring successful patterns as skills; only promote once evidence is there.
  10. Every stable belief carries provenance; decay runs weekly.

Fifteen crons keep the lights on — ten inherited from Friday-v1 (email check, cron watchdog, daily briefing, heartbeat, monthly usage, reflection, preference learning, AI model monitor, memory API health, weekly summarization) and five new ones wired for the harness:

09:37 · Goal priorizer 22:23 · Daily metrics 21:53 · Predictions resolver 02:37 · Skill promotion Sun 05:17 · Memory decay

11. The Brain dashboard

Auditability is a feature. The Memory Graph dashboard collapses every subsystem above into one scrollable page — Brain — with a sticky sub-nav so the user can jump to any subsystem instantly. Eight sections:

  1. Overview — twelve count cards plus six KPI cards sourced from /metric/summary.
  2. Goals & Plans — active goals, next-best suggestion, plan trees with status-colored nodes and rollback annotations.
  3. Memory — three responsive columns: episodic / semantic (memories + identity entities) / procedural (skills with maturity).
  4. World Model — structured entities, S-P-O relations, events, predictions with calibration-gap coloring.
  5. Self-knowledge — autonomy level reference grid + capability cards with calibrated confidence.
  6. Safety — verifications and sandbox executions side by side.
  7. Learning — experiments at the top, then preferences, reflections, soft observations, insights, proposals and keywords in a responsive grid.
  8. Metrics — KPI catalog + full sample list.
Brain dashboard showing Overview, Goals & Plans, and three-layer Memory sections
Brain — the single audit surface for everything the harness touches. Sub-nav pills at the top scroll to each subsystem.

12. The Crons dashboard

The harness only works if its 15 cron jobs are actually running. Because Claude Code crons are session-local (they die with the process), drift is possible: the disk has a prompt, the runtime doesn't have the job. The Crons tab is a two-column diff:

Crons dashboard showing 15 runtime-active crons with live countdowns and 15 persisted prompts all synced
Crons — left: 15 runtime jobs with live countdowns; right: 15 persisted prompts, all marked sincronizado.

13. What this buys us

A system like Friday starts to feel AGI-like not when it talks better or integrates more APIs, but when it can:

The harness doesn't make Friday that system overnight. But it installs the scaffolding, every decision now leaves a trace in one of these tables, and every claim carries provenance, confidence and an expiration date. That is what makes the difference between "an LLM with tools" and "a system that can be said to have learned something today".

Golden rule in CLAUDE.md: no unrecorded autonomy. Every operational decision — a goal created, a plan node executed, an action sandboxed, a prediction resolved, a skill promoted — leaves a row. The dashboard is where a human audits whether the system is earning its autonomy, one row at a time.