hub: restore-test "passed with warnings" visibility (v0.7.5)

Phase B (hub half) of the restore-test warning fix. The agent v0.7.0 now passes a
restore-test that emitted a benign start advisory (systemd-nesting) and carries the
warning text on the wire.

- hostRestoreTest gains warnings + warnings_recognized mirror fields (omitempty;
  absent recognized => false => louder unrecognized path)
- ingest logs [INFO] passed WITH WARNINGS (recognized), [WARN] for unrecognized;
  FAILED still [WARN]
- golden restore_tests[0] gains the keys, byte-identical with felhom-agent (sha256
  e6999d77...); bidirectional key-set contract test round-trips them
- no dashboard widget: no host-domain dashboard surface exists yet (log+persist only,
  as with pbs_snapshots) -- deferred to slice 10

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-09 19:41:21 +02:00
parent 5268411014
commit 4bd0909f2b
4 changed files with 93 additions and 23 deletions
+36 -20
View File
@@ -4,33 +4,49 @@
---
# REPORT — Hub: ingest agent pbs_snapshots (v0.7.4) (2026-06-09)
# REPORT — Hub: restore-test "passed with warnings" visibility (v0.7.5) (2026-06-09)
## Outcome
**Code committed + pushed (changelogged as `v0.7.4`); image build/deploy deferred to an
operator decision.** The felhom-agent slice-6 Phase B work populates the host-report's
`pbs_snapshots` (PBS offsite inventory + per-snapshot verify-state). This is the hub half:
accept + persist them. Minimal — the authoritative offsite policy is hub-owned (slice 10).
**Phase B (hub half) of `TASK — Restore-test must not false-fail on benign start warnings`.**
The agent (v0.7.0, already deployed + live-validated) now treats a benign guest-start advisory
(e.g. `WARN: Systemd 257 detected. You may need to enable nesting.`) as a PASS — verdict is
liveness, not the start-task exitstatus — and carries the warning text on the wire. This is the
hub half: ingest those fields and make a passed-with-warnings restore-test visible to the
operator instead of indistinguishable from a clean pass.
## What landed (`hub/internal/api/handler.go`, `host_test.go`, golden)
## What landed (`hub/internal/api/handler.go`, golden, `host_test.go`)
- `hostReportPayload` gains a `hostPBSSnapshot` mirror struct matching the agent's
`hub.PBSSnapshot` field-for-field, persisted via the existing `report_json` column.
- The handler logs a **FAILED PBS verify prominently** (`[WARN]` the loudest offsite-DR
signal); the host-report info line now counts pbs-snapshots too.
- The shared `testdata/host-report.golden.json` carries a populated `pbs_snapshots[0]`,
**byte-identical** with felhom-agent's copy; `TestHostPBSSnapshot_GoldenContract` is the
hub half of the bidirectional key-set test. `go test ./internal/api/` is green.
- **Wire mirror:** `hostRestoreTest` gains `warnings []string` + `warnings_recognized bool`
(`omitempty`), matching the agent's `hub.RestoreTest` field-for-field. An absent
`warnings_recognized``false` the **louder** unrecognized path, so a missing flag can only
over-notice, never hide a real warning.
- **Ingest behaviour:** a passed restore-test that carried warnings now logs
`[INFO] restore-test passed WITH WARNINGS (recognized)` when every warning is the known-benign
anchor, escalated to `[WARN] … UNRECOGNIZED WARNINGS` otherwise (as loud as a failed PBS
verify). A FAILED restore-test still logs the existing `[WARN] … FAILED`.
- **Contract:** `restore_tests[0]` in the host-report golden gains the two keys; the golden stays
**byte-identical** with felhom-agent's copy (sha256 `e6999d77…`), and the bidirectional
key-set contract test round-trips the new keys through `hostRestoreTest`. `go test ./...` green.
## Scope note — no dashboard widget this slice
The task asked to "surface in the dashboard distinctly from a clean pass." The hub web layer
currently renders **only controller-report data** — there is no host-domain dashboard surface
yet (guests/storage/restore_tests/pbs_snapshots are log+persist only; the failed-PBS-verify
signal is likewise log-only). Building one is out of scope here; distinct dashboard treatment
should land with the host-domain dashboard (slice 10). The operator signal this slice is the log
line, consistent with the established failed-PBS-verify precedent.
## Backward compatibility
An agent that omits/empties `pbs_snapshots` is accepted unchanged. The legacy controller
report path is untouched (frozen until the slice-10 cutover).
An agent that omits/empties `warnings`/`warnings_recognized` is accepted unchanged (the deployed
v0.7.4 hub already ignores them). The legacy controller report path is untouched.
## Deploy
## Deploy (GitOps)
> Per the GitOps flow (`CLAUDE.md`): build+push `gitea.dooplex.hu/admin/felhom-hub:v0.7.4`,
> bump `manifests/hub.yaml`, commit, then sync the `felhom` ArgoCD app. **Deferred** at this
> checkpoint — the change is backward-compatible, so the live hub (v0.7.3) keeps ingesting
> host-reports fine until then.
Build+push `gitea.dooplex.hu/admin/felhom-hub:v0.7.5` → bump the `image:` tag in
`manifests/hub.yaml` commit sync the `felhom` ArgoCD app (auto-sync off). Live-validated
after sync: the demo host's restore-test (agent v0.7.0, which passes-with-recognized-warnings on
the Debian-13 guest 9999) reflects on the hub as `passed WITH WARNINGS (recognized)` — not a
plain pass and not a FAILED.
+30
View File
@@ -1,5 +1,35 @@
# Felhom Hub — Changelog
## v0.7.5 — restore-test "passed with warnings" visibility (2026-06-09)
Hub half of `TASK — Restore-test must not false-fail on benign start warnings` (Phase B). The
agent (v0.7.0) now treats a guest-start advisory like the systemd-nesting warning as a PASS
(verdict is liveness, not the start exitstatus) and carries the warning text on the wire. This
makes that visible to the operator instead of indistinguishable from a clean pass.
### Added
- `hostRestoreTest.warnings` (`[]string`) + `warnings_recognized` (`bool`) mirror fields, matching
the agent's `hub.RestoreTest` wire contract (`omitempty`; an absent `warnings_recognized`
`false` ⇒ treated as the louder unrecognized case — a missing flag can only over-notice).
### Changed
- Host-report ingest now surfaces a **passed** restore-test that carried warnings:
`[INFO] restore-test passed WITH WARNINGS (recognized)` when every warning is the known-benign
anchor, escalated to `[WARN] … UNRECOGNIZED WARNINGS` otherwise — as loud as a failed PBS
verify, so a real restore warning can't hide behind a green pass. A FAILED restore-test still
logs the existing `[WARN] … FAILED`.
### Tests / contract
- `restore_tests[0]` in the host-report golden gains `warnings` + `warnings_recognized`; the golden
stays **byte-identical** with felhom-agent's copy (sha256-verified) and the bidirectional
key-set contract test now round-trips the new keys through `hostRestoreTest`.
### Not in this slice
- No dashboard widget: the hub web layer renders only controller-report data — there is no
host-domain dashboard surface yet (guests/storage/restore_tests/pbs_snapshots are log+persist
only, same as the failed-PBS-verify signal). Distinct dashboard treatment lands when the
host-domain dashboard does (slice 10). The operator signal this slice is the log line.
## v0.7.4 — ingest agent pbs_snapshots (slice 6 Phase B) (2026-06-09)
The agent's slice-6 Phase B work populates the host-report's `pbs_snapshots` (the PBS offsite
+22 -2
View File
@@ -310,6 +310,13 @@ type hostRestoreTest struct {
Error string `json:"error,omitempty"`
TestedAt string `json:"tested_at"`
DurationSeconds float64 `json:"duration_seconds"`
// Warnings are the guest-start task's warning line(s) on a PASS (e.g. the systemd-nesting
// advisory). The verdict is liveness-only, so a passed restore-test can carry warnings.
Warnings []string `json:"warnings,omitempty"`
// WarningsRecognized is true iff every warning is the known-benign anchor. Absent ⇒ false,
// which is the SAFE default: the hub then treats it as an unrecognized warning (the louder
// path), so a missing flag can only over-notice, never hide a real warning.
WarningsRecognized bool `json:"warnings_recognized,omitempty"`
}
// hostStorageTarget mirrors the agent's hub.StorageTarget wire contract field-for-field.
@@ -448,11 +455,24 @@ func (h *Handler) handleHostReport(w http.ResponseWriter, r *http.Request) {
}
// restore_tests (slice 6): a FAILED self-restore-test is the loudest DR signal there is
// — surface it prominently. A backup whose vzdump failed is also worth a warning.
// — surface it prominently. A PASS that carried start warnings (e.g. the systemd-nesting
// advisory) is surfaced too: INFO when every warning is recognized-benign, escalated to
// WARN when an UNRECOGNIZED warning stood out (as loud as a failed PBS verify is for
// backups), so a real restore warning can't hide behind a green pass. A backup whose
// vzdump failed is also worth a warning.
for _, rt := range rep.RestoreTests {
if !rt.Pass {
switch {
case !rt.Pass:
h.logger.Printf("[WARN] host %s restore-test FAILED: archive=%s tier=%s scratch=%d err=%q",
hostID, rt.SourceArchive, rt.SourceTier, rt.ScratchVMID, rt.Error)
case len(rt.Warnings) == 0:
// clean pass — nothing to surface here (counted in the summary line below).
case rt.WarningsRecognized:
h.logger.Printf("[INFO] host %s restore-test passed WITH WARNINGS (recognized): archive=%s tier=%s warnings=%v",
hostID, rt.SourceArchive, rt.SourceTier, rt.Warnings)
default:
h.logger.Printf("[WARN] host %s restore-test passed WITH UNRECOGNIZED WARNINGS: archive=%s tier=%s warnings=%v",
hostID, rt.SourceArchive, rt.SourceTier, rt.Warnings)
}
}
for _, bk := range rep.Backups {
+5 -1
View File
@@ -108,7 +108,11 @@
"pass": true,
"verified": "boot+running",
"tested_at": "2026-06-09T11:05:00Z",
"duration_seconds": 38.2
"duration_seconds": 38.2,
"warnings": [
"WARN: Systemd 257 detected. You may need to enable nesting."
],
"warnings_recognized": true
}
],
"pbs_snapshots": [