diff --git a/REPORT.md b/REPORT.md index 92bf3c6..28d6548 100644 --- a/REPORT.md +++ b/REPORT.md @@ -4,33 +4,49 @@ --- -# REPORT — Hub: ingest agent pbs_snapshots (v0.7.4) (2026-06-09) +# REPORT — Hub: restore-test "passed with warnings" visibility (v0.7.5) (2026-06-09) ## Outcome -**Code committed + pushed (changelogged as `v0.7.4`); image build/deploy deferred to an -operator decision.** The felhom-agent slice-6 Phase B work populates the host-report's -`pbs_snapshots` (PBS offsite inventory + per-snapshot verify-state). This is the hub half: -accept + persist them. Minimal — the authoritative offsite policy is hub-owned (slice 10). +**Phase B (hub half) of `TASK — Restore-test must not false-fail on benign start warnings`.** +The agent (v0.7.0, already deployed + live-validated) now treats a benign guest-start advisory +(e.g. `WARN: Systemd 257 detected. You may need to enable nesting.`) as a PASS — verdict is +liveness, not the start-task exitstatus — and carries the warning text on the wire. This is the +hub half: ingest those fields and make a passed-with-warnings restore-test visible to the +operator instead of indistinguishable from a clean pass. -## What landed (`hub/internal/api/handler.go`, `host_test.go`, golden) +## What landed (`hub/internal/api/handler.go`, golden, `host_test.go`) -- `hostReportPayload` gains a `hostPBSSnapshot` mirror struct matching the agent's - `hub.PBSSnapshot` field-for-field, persisted via the existing `report_json` column. -- The handler logs a **FAILED PBS verify prominently** (`[WARN]` — the loudest offsite-DR - signal); the host-report info line now counts pbs-snapshots too. -- The shared `testdata/host-report.golden.json` carries a populated `pbs_snapshots[0]`, - **byte-identical** with felhom-agent's copy; `TestHostPBSSnapshot_GoldenContract` is the - hub half of the bidirectional key-set test. `go test ./internal/api/` is green. +- **Wire mirror:** `hostRestoreTest` gains `warnings []string` + `warnings_recognized bool` + (`omitempty`), matching the agent's `hub.RestoreTest` field-for-field. An absent + `warnings_recognized` ⇒ `false` ⇒ the **louder** unrecognized path, so a missing flag can only + over-notice, never hide a real warning. +- **Ingest behaviour:** a passed restore-test that carried warnings now logs + `[INFO] restore-test passed WITH WARNINGS (recognized)` when every warning is the known-benign + anchor, escalated to `[WARN] … UNRECOGNIZED WARNINGS` otherwise (as loud as a failed PBS + verify). A FAILED restore-test still logs the existing `[WARN] … FAILED`. +- **Contract:** `restore_tests[0]` in the host-report golden gains the two keys; the golden stays + **byte-identical** with felhom-agent's copy (sha256 `e6999d77…`), and the bidirectional + key-set contract test round-trips the new keys through `hostRestoreTest`. `go test ./...` green. + +## Scope note — no dashboard widget this slice + +The task asked to "surface in the dashboard distinctly from a clean pass." The hub web layer +currently renders **only controller-report data** — there is no host-domain dashboard surface +yet (guests/storage/restore_tests/pbs_snapshots are log+persist only; the failed-PBS-verify +signal is likewise log-only). Building one is out of scope here; distinct dashboard treatment +should land with the host-domain dashboard (slice 10). The operator signal this slice is the log +line, consistent with the established failed-PBS-verify precedent. ## Backward compatibility -An agent that omits/empties `pbs_snapshots` is accepted unchanged. The legacy controller -report path is untouched (frozen until the slice-10 cutover). +An agent that omits/empties `warnings`/`warnings_recognized` is accepted unchanged (the deployed +v0.7.4 hub already ignores them). The legacy controller report path is untouched. -## Deploy +## Deploy (GitOps) -> Per the GitOps flow (`CLAUDE.md`): build+push `gitea.dooplex.hu/admin/felhom-hub:v0.7.4`, -> bump `manifests/hub.yaml`, commit, then sync the `felhom` ArgoCD app. **Deferred** at this -> checkpoint — the change is backward-compatible, so the live hub (v0.7.3) keeps ingesting -> host-reports fine until then. +Build+push `gitea.dooplex.hu/admin/felhom-hub:v0.7.5` → bump the `image:` tag in +`manifests/hub.yaml` → commit → sync the `felhom` ArgoCD app (auto-sync off). Live-validated +after sync: the demo host's restore-test (agent v0.7.0, which passes-with-recognized-warnings on +the Debian-13 guest 9999) reflects on the hub as `passed WITH WARNINGS (recognized)` — not a +plain pass and not a FAILED. diff --git a/hub/CHANGELOG.md b/hub/CHANGELOG.md index 64febb0..9215e4b 100644 --- a/hub/CHANGELOG.md +++ b/hub/CHANGELOG.md @@ -1,5 +1,35 @@ # Felhom Hub — Changelog +## v0.7.5 — restore-test "passed with warnings" visibility (2026-06-09) + +Hub half of `TASK — Restore-test must not false-fail on benign start warnings` (Phase B). The +agent (v0.7.0) now treats a guest-start advisory like the systemd-nesting warning as a PASS +(verdict is liveness, not the start exitstatus) and carries the warning text on the wire. This +makes that visible to the operator instead of indistinguishable from a clean pass. + +### Added +- `hostRestoreTest.warnings` (`[]string`) + `warnings_recognized` (`bool`) mirror fields, matching + the agent's `hub.RestoreTest` wire contract (`omitempty`; an absent `warnings_recognized` ⇒ + `false` ⇒ treated as the louder unrecognized case — a missing flag can only over-notice). + +### Changed +- Host-report ingest now surfaces a **passed** restore-test that carried warnings: + `[INFO] restore-test passed WITH WARNINGS (recognized)` when every warning is the known-benign + anchor, escalated to `[WARN] … UNRECOGNIZED WARNINGS` otherwise — as loud as a failed PBS + verify, so a real restore warning can't hide behind a green pass. A FAILED restore-test still + logs the existing `[WARN] … FAILED`. + +### Tests / contract +- `restore_tests[0]` in the host-report golden gains `warnings` + `warnings_recognized`; the golden + stays **byte-identical** with felhom-agent's copy (sha256-verified) and the bidirectional + key-set contract test now round-trips the new keys through `hostRestoreTest`. + +### Not in this slice +- No dashboard widget: the hub web layer renders only controller-report data — there is no + host-domain dashboard surface yet (guests/storage/restore_tests/pbs_snapshots are log+persist + only, same as the failed-PBS-verify signal). Distinct dashboard treatment lands when the + host-domain dashboard does (slice 10). The operator signal this slice is the log line. + ## v0.7.4 — ingest agent pbs_snapshots (slice 6 Phase B) (2026-06-09) The agent's slice-6 Phase B work populates the host-report's `pbs_snapshots` (the PBS offsite diff --git a/hub/internal/api/handler.go b/hub/internal/api/handler.go index 4064c66..ece7fe9 100644 --- a/hub/internal/api/handler.go +++ b/hub/internal/api/handler.go @@ -310,6 +310,13 @@ type hostRestoreTest struct { Error string `json:"error,omitempty"` TestedAt string `json:"tested_at"` DurationSeconds float64 `json:"duration_seconds"` + // Warnings are the guest-start task's warning line(s) on a PASS (e.g. the systemd-nesting + // advisory). The verdict is liveness-only, so a passed restore-test can carry warnings. + Warnings []string `json:"warnings,omitempty"` + // WarningsRecognized is true iff every warning is the known-benign anchor. Absent ⇒ false, + // which is the SAFE default: the hub then treats it as an unrecognized warning (the louder + // path), so a missing flag can only over-notice, never hide a real warning. + WarningsRecognized bool `json:"warnings_recognized,omitempty"` } // hostStorageTarget mirrors the agent's hub.StorageTarget wire contract field-for-field. @@ -448,11 +455,24 @@ func (h *Handler) handleHostReport(w http.ResponseWriter, r *http.Request) { } // restore_tests (slice 6): a FAILED self-restore-test is the loudest DR signal there is - // — surface it prominently. A backup whose vzdump failed is also worth a warning. + // — surface it prominently. A PASS that carried start warnings (e.g. the systemd-nesting + // advisory) is surfaced too: INFO when every warning is recognized-benign, escalated to + // WARN when an UNRECOGNIZED warning stood out (as loud as a failed PBS verify is for + // backups), so a real restore warning can't hide behind a green pass. A backup whose + // vzdump failed is also worth a warning. for _, rt := range rep.RestoreTests { - if !rt.Pass { + switch { + case !rt.Pass: h.logger.Printf("[WARN] host %s restore-test FAILED: archive=%s tier=%s scratch=%d err=%q", hostID, rt.SourceArchive, rt.SourceTier, rt.ScratchVMID, rt.Error) + case len(rt.Warnings) == 0: + // clean pass — nothing to surface here (counted in the summary line below). + case rt.WarningsRecognized: + h.logger.Printf("[INFO] host %s restore-test passed WITH WARNINGS (recognized): archive=%s tier=%s warnings=%v", + hostID, rt.SourceArchive, rt.SourceTier, rt.Warnings) + default: + h.logger.Printf("[WARN] host %s restore-test passed WITH UNRECOGNIZED WARNINGS: archive=%s tier=%s warnings=%v", + hostID, rt.SourceArchive, rt.SourceTier, rt.Warnings) } } for _, bk := range rep.Backups { diff --git a/hub/internal/api/testdata/host-report.golden.json b/hub/internal/api/testdata/host-report.golden.json index 5f66eb9..695f894 100644 --- a/hub/internal/api/testdata/host-report.golden.json +++ b/hub/internal/api/testdata/host-report.golden.json @@ -108,7 +108,11 @@ "pass": true, "verified": "boot+running", "tested_at": "2026-06-09T11:05:00Z", - "duration_seconds": 38.2 + "duration_seconds": 38.2, + "warnings": [ + "WARN: Systemd 257 detected. You may need to enable nesting." + ], + "warnings_recognized": true } ], "pbs_snapshots": [