Revise disaster recovery documentation for consistency, etc. by susan-pgedge · Pull Request #277 · pgEdge/control-plane

susan-pgedge · 2026-02-24T12:22:29Z

Updated formatting and improved clarity in the disaster recovery guide.

Summary

Changes

...

Testing

...

Checklist

Tests added or updated (unit and/or e2e, as needed)
Documentation updated (if needed)
Issue is linked (branch name or URL in PR description)
Changelog entry added for user-facing behavior changes
Breaking changes (if any) are clearly called out in the PR description

Notes for Reviewers

Updated formatting and improved clarity in the disaster recovery guide.

coderabbitai · 2026-02-24T12:23:06Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Updated the Disaster Recovery guide and site navigation: the guide's prose, formatting, section order, prerequisites, variable setup, and procedure blocks (Swarm, Control Plane, etcd/data-volume) were reorganized and reworded for clarity; mkdocs.yml adds a Troubleshooting nav entry linking the guide.

Changes

Cohort / File(s)	Summary
Disaster Recovery doc `docs/disaster-recovery/disaster-recovery.md`	Major prose and formatting rewrite: reworded introduction and warnings, expanded prerequisites, reorganized procedure blocks (Swarm restore, control plane, etcd/data-volume), adjusted command/JSON formatting, renamed headings and anchors, tightened verification steps. No functional recovery steps changed.
Site navigation `mkdocs.yml`	Added a top-level "Troubleshooting" navigation item with a nested entry pointing to `disaster-recovery/disaster-recovery.md` (Recovering Control Plane and Docker Swarm).

Poem

🐇 I hopped through lines and banished the fray,
Tidied anchors, commands, and the way.
Prereqs snug, snapshots held tight,
Swarm and control plane back to light.
A rabbit's hop — recovery done right.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is largely incomplete with placeholder content (ellipses) in Summary, Changes, and Testing sections, and lacks changelog entry and issue linkage despite being required.	Fill in Summary with 1-2 sentences explaining the guide revisions; list specific changes in Changes section; add Testing details; check issue linkage and changelog entry checklist items, or provide justification.
Title check	❓ Inconclusive	The title references disaster recovery documentation revisions but uses vague phrasing ('etc.') that obscures the specific changes made.	Replace 'etc.' with specifics like 'formatting and clarity improvements' or follow Conventional Commits format (e.g., 'docs: revise disaster recovery guide for clarity and consistency').

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch susan-disaster-recovery

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Revised instructions for restoring Docker Swarm and Control Plane, including clearer steps and formatting improvements.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Line 435: The table row linking "Docker Swarm commands hang" uses a broken
anchor `#reinitializing-the-swarm`; update that anchor in the table entry (the
row containing "Docker Swarm commands hang") to the correct heading anchor
`#restoring-docker-swarm` which matches the "Restoring Docker Swarm" section
(the "1. Reinitialize the Swarm" heading) so the link resolves.
- Line 435: The table row referencing the broken anchor
'#reinitializing-the-swarm' (the "Docker Swarm commands hang" entry) points to a
numbered list item that generates no ID; update that link target to the existing
valid heading anchor '#restoring-docker-swarm' or alternatively convert the "1.
Reinitialize the Swarm" list item into a proper level-2 heading "Reinitializing
the Swarm" so the '#reinitializing-the-swarm' anchor becomes valid; edit the
table cell replacing '#reinitializing-the-swarm' with '#restoring-docker-swarm'
unless you prefer adding a new "## Reinitializing the Swarm" heading to match
the current link.
- Around line 109-168: The numbered step headings ("1. Reinitialize the Swarm",
"2. Join Hosts to the New Swarm", "3. Removing Old Swarm Nodes" and the
equivalent headings in "Restoring the Control Plane" and "Re-adding Hosts") are
indented four spaces, which makes them render as code blocks; outdent those list
markers to 0–3 spaces so they become proper ordered list items, and ensure the
fenced blocks use an unindented ```bash line before and after the commands
(e.g., move the opening ```bash to its own non-indented line and remove the
leading four-space indentation from the "docker swarm init ..." and subsequent
list markers). Apply this same fix to each affected section heading and its
surrounding fenced code blocks.
- Around line 109-168: The numbered list markers ("1.", "2.", "3.") under the
"Reinitialize the Swarm", "Join Hosts to the New Swarm", and "Removing Old Swarm
Nodes" headings are prefixed with 4 spaces causing Python-Markdown to render
them as code blocks; remove the leading 4-space indentation from the lines that
start with the list markers so they are real list items, but keep the following
code blocks (the fenced ```bash``` blocks and the indented command lines like
"docker swarm init --force-new-cluster ..." and "docker swarm join --token ...
${RECOVERY_HOST_IP}:2377") indented exactly as shown so they remain code blocks;
update the three sections' numbered lines accordingly (the list marker lines
shown in the diff and any other similarly-indented numbered steps in "Restoring
the Control Plane" and "Re-adding Hosts").

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7e8a31 and 6f9fedf.

📒 Files selected for processing (2)

docs/disaster-recovery/disaster-recovery.md
mkdocs.yml

docs/disaster-recovery/disaster-recovery.md

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/disaster-recovery/disaster-recovery.md (1)

273-289: ⚠️ Potential issue | 🟡 Minor

markdownlint MD046: fenced blocks vs. indented style.

MD046 expects indented code blocks in these sections, so the newly fenced blocks will keep failing lint. Either convert these to indented blocks or adjust the rule for this doc.

✏️ Representative conversion (apply to all flagged blocks)

-```sh
-curl -X POST \
-    "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>" \
-    -H "Content-Type: application/json" \
-    -d '{
-        "spec": {
-            "database_name": "<DB_NAME>",
-            "database_users": [{"username": "admin", "db_owner": true,
-                                "attributes": ["SUPERUSER", "LOGIN"]}],
-            "port": 5432,
-            "nodes": [
-                { "name": "n1", "host_ids": ["host-1"] },
-                { "name": "n2", "host_ids": ["host-2"] }
-            ]
-        }
-    }'
-```
+    curl -X POST \
+        "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>" \
+        -H "Content-Type: application/json" \
+        -d '{
+            "spec": {
+                "database_name": "<DB_NAME>",
+                "database_users": [{"username": "admin", "db_owner": true,
+                                    "attributes": ["SUPERUSER", "LOGIN"]}],
+                "port": 5432,
+                "nodes": [
+                    { "name": "n1", "host_ids": ["host-1"] },
+                    { "name": "n2", "host_ids": ["host-2"] }
+                ]
+            }
+        }'

Also applies to: 301-305, 322-325, 386-403, 413-416

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 273 - 289, Convert
the fenced shell code blocks in disaster-recovery.md (the curl POST examples
such as the block starting with curl -X POST
"http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>"
...) to indented code block style to satisfy markdownlint MD046 (or
alternatively add an explicit rule disable for MD046 in this document's
frontmatter); apply the same conversion to the other flagged blocks (the
additional curl/command examples referenced in the review) so all examples use
indented code blocks or the rule is disabled consistently.

♻️ Duplicate comments (1)

docs/disaster-recovery/disaster-recovery.md (1)
109-167: ⚠️ Potential issue | 🟡 Minor

Outdent top‑level numbered steps so they render as lists.

These numbered steps are still indented four spaces at the document root, so Markdown will render them as code blocks rather than ordered lists. Please outdent the list markers (and keep their content indented) in all three sections.
✏️ Representative fix (apply to all three sections)
-    1. Reinitialize the Swarm
+1. Reinitialize the Swarm

-    On a surviving manager, invoke the following command:
+    On a surviving manager, invoke the following command:

     ```bash
     docker swarm init --force-new-cluster \
         --advertise-addr ${RECOVERY_HOST_IP}
     ```
Also applies to: 195-265, 340-377
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 109 - 167, The
numbered sections "1. Reinitialize the Swarm", "2. Join Hosts to the New Swarm",
and "3. Removing Old Swarm Nodes" are indented and render as code blocks;
outdent each top-level list marker (the "1.", "2.", "3." lines) to the document
root while keeping the subsequent code fences and wrapped lines indented as
shown (e.g., the docker swarm init block, docker swarm join-token / join
examples, and docker node demote / docker service rm examples). Apply the same
outdent fix to the other occurrences noted (the ranges around lines 195-265 and
340-377) so all three sections render as proper ordered lists instead of code
blocks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 181-193: The doc omits an explicit step to stop/scale down Control
Plane services on the recovery host before renaming/moving the etcd directory
(Step 1), which can cause corruption/races; update the "Then on the recovery
host, perform the following steps:" sequence to add a clear pre-step that
instructs operators to stop or scale down Control Plane services (e.g., using
docker service scale control-plane_<host-id>=0 or equivalent) on the recovery
host prior to any backup/restore or moving of the etcd data directory, and
reference the related sections ("Restoring Docker Swarm" / "Data Volume
Restore") so readers know when to skip the stop if services are already offline.

---

Outside diff comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 273-289: Convert the fenced shell code blocks in
disaster-recovery.md (the curl POST examples such as the block starting with
curl -X POST
"http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>"
...) to indented code block style to satisfy markdownlint MD046 (or
alternatively add an explicit rule disable for MD046 in this document's
frontmatter); apply the same conversion to the other flagged blocks (the
additional curl/command examples referenced in the review) so all examples use
indented code blocks or the rule is disabled consistently.

---

Duplicate comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 109-167: The numbered sections "1. Reinitialize the Swarm", "2.
Join Hosts to the New Swarm", and "3. Removing Old Swarm Nodes" are indented and
render as code blocks; outdent each top-level list marker (the "1.", "2.", "3."
lines) to the document root while keeping the subsequent code fences and wrapped
lines indented as shown (e.g., the docker swarm init block, docker swarm
join-token / join examples, and docker node demote / docker service rm
examples). Apply the same outdent fix to the other occurrences noted (the ranges
around lines 195-265 and 340-377) so all three sections render as proper ordered
lists instead of code blocks.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7e8a31 and ffd5d54.

📒 Files selected for processing (2)

docs/disaster-recovery/disaster-recovery.md
mkdocs.yml

coderabbitai · 2026-02-24T13:29:09Z

docs/disaster-recovery/disaster-recovery.md

+- all server-mode hosts were offline: On a Swarm manager, stop all
+  Control Plane services with the command:
+  `docker service scale control-plane_<host-id-1>=0 control-plane_<host-id-2>=0 ...` 
+  Then, restore the data volume from your
+  backup (see [Data Volume Restore](#data-volume-restore)).
+- at least one server-mode host was still up: Use that host as the
+  recovery host. You do not need to stop services or restore the volume
+  first.
+- you already completed [Restoring Docker Swarm](#restoring-docker-swarm): 
+  Restore the data volume on the surviving host if it was lost (see
+  [Data volume restore](#data-volume-restore)); otherwise skip.
+
+Then on the recovery host, perform the following steps:


⚠️ Potential issue | 🟠 Major

Stop Control Plane services before moving live etcd data.

Step 1 renames the etcd directory. If the recovery host still has Control Plane services running, moving that directory can corrupt or race active processes. Add an explicit stop/scale‑down step before the backup/restore actions.

🛠️ Suggested clarification

Then on the recovery host, perform the following steps: +If any Control Plane services are running on the recovery host, stop them +before step 1 to avoid moving live etcd data (for example: +`docker service scale control-plane_<host-id>=0`).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/disaster-recovery.md` around lines 181 - 193, The doc omits an explicit step to stop/scale down Control Plane services on the recovery host before renaming/moving the etcd directory (Step 1), which can cause corruption/races; update the "Then on the recovery host, perform the following steps:" sequence to add a clear pre-step that instructs operators to stop or scale down Control Plane services (e.g., using docker service scale control-plane_<host-id>=0 or equivalent) on the recovery host prior to any backup/restore or moving of the etcd data directory, and reference the related sections ("Restoring Docker Swarm" / "Data Volume Restore") so readers know when to skip the stop if services are already offline.

@mmols or @moizpgedge

Does this require an update?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Yes, we should update it. For the case where at least one server-mode host was still up, add a step to stop the Control Plane service on the recovery host (e.g. scale it to 0) before they run the steps that rename/move the etcd directory, so we don’t move live data while the process is running.

@moizpgedge Can you make that change inline here?

docs/disaster-recovery/disaster-recovery.md

tsivaprasad · 2026-02-24T17:06:20Z

docs/disaster-recovery/disaster-recovery.md

+
+       You should see one host with `status: "reachable"` and 
+       `etcd_mode: "server"`. Then continue with the next section: 
+       [Updating Databases to Remove Old Hosts](#updating-databases-to-remove-old-hosts).


In the document view, the hyperlink is not displayed and appears as plain text.

tsivaprasad · 2026-02-24T17:11:27Z

docs/disaster-recovery/disaster-recovery.md

+
+       - If Control Plane is already deployed as Swarm services (if the etcd
+         quorum was lost and you did not run
+         [Restoring Docker Swarm](#restoring-docker-swarm)):


it shows plain text Restoring Docker Swarm in document view

tsivaprasad · 2026-02-24T17:12:42Z

docs/disaster-recovery/disaster-recovery.md

+         [Restoring Docker Swarm](#restoring-docker-swarm)):
+         `docker service scale control-plane_${RECOVERY_HOST_ID}=1`
+
+       - If you completed [Restoring Docker Swarm](#restoring-docker-swarm)


Restoring Docker Swarm too

Reformatted...

Fixing link comment

coderabbitai

♻️ Duplicate comments (2)

docs/disaster-recovery/disaster-recovery.md (2)

109-164: ⚠️ Potential issue | 🟡 Minor

Ordered steps still render as code blocks due to 4-space indentation.

Line 109 and the equivalent step markers in Line 196+ and Line 341+ are still indented by four spaces, so they render as code instead of numbered steps in MkDocs/Python-Markdown.

Proposed fix

-    1. Reinitialize the Swarm
+1. Reinitialize the Swarm
...
-    2. Join Hosts to the New Swarm
+2. Join Hosts to the New Swarm
...
-    3. Removing Old Swarm Nodes
+3. Removing Old Swarm Nodes

Apply the same outdent pattern to the numbered markers in “Restoring the Control Plane” and “Re-adding Hosts”.

Also applies to: 196-267, 341-378

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 109 - 164, The
numbered list markers (e.g., the "1. Reinitialize the Swarm" heading and the
equivalent markers in the "Restoring the Control Plane" and "Re-adding Hosts"
sections) are prefixed with four spaces which causes Markdown to render them as
code blocks; remove the leading four-space indent from those numbered lines so
the lists render as normal ordered lists, ensuring fenced code blocks and their
internal indentation remain unchanged and apply the same outdent change to the
other occurrences you noted (the sections around lines 196–267 and 341–378).

187-202: ⚠️ Potential issue | 🟠 Major

Conflicting guidance: services may still be live when etcd data is moved.

Line 187-Line 189 says services do not need to be stopped when a server-mode host remains, but Line 196-Line 202 immediately renames/moves etcd data. That can race a running Control Plane service on the recovery host.

Proposed fix

-- at least one server-mode host was still up: Use that host as the
-  recovery host. You do not need to stop services or restore the volume
-  first.
+- at least one server-mode host was still up: Use that host as the
+  recovery host. On that recovery host, stop/scale down the Control Plane
+  service before step 1 so etcd data is not moved while live.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 187 - 202, The
current instructions move the etcd data directory (PGEDGE_DATA_DIR/etcd ->
ETCD_BACKUP_DIR) while earlier text says services need not be stopped, creating
a race with a running control-plane; update the procedure to either (a)
explicitly stop the control-plane/etcd-related services on the recovery host
before moving the directory (add a short step to stop the service(s) hosting
etcd), or (b) if downtime is unacceptable, replace the mv step with an
etcd-consistent snapshot approach (e.g., use etcdctl snapshot save) and document
restoring from that snapshot; reference the existing variables and operations
(PGEDGE_DATA_DIR, etcd directory move to ETCD_BACKUP_DIR, and the mv action) so
the change is applied where the backup/move is described.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 109-164: The numbered list markers (e.g., the "1. Reinitialize the
Swarm" heading and the equivalent markers in the "Restoring the Control Plane"
and "Re-adding Hosts" sections) are prefixed with four spaces which causes
Markdown to render them as code blocks; remove the leading four-space indent
from those numbered lines so the lists render as normal ordered lists, ensuring
fenced code blocks and their internal indentation remain unchanged and apply the
same outdent change to the other occurrences you noted (the sections around
lines 196–267 and 341–378).
- Around line 187-202: The current instructions move the etcd data directory
(PGEDGE_DATA_DIR/etcd -> ETCD_BACKUP_DIR) while earlier text says services need
not be stopped, creating a race with a running control-plane; update the
procedure to either (a) explicitly stop the control-plane/etcd-related services
on the recovery host before moving the directory (add a short step to stop the
service(s) hosting etcd), or (b) if downtime is unacceptable, replace the mv
step with an etcd-consistent snapshot approach (e.g., use etcdctl snapshot save)
and document restoring from that snapshot; reference the existing variables and
operations (PGEDGE_DATA_DIR, etcd directory move to ETCD_BACKUP_DIR, and the mv
action) so the change is applied where the backup/move is described.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffd5d54 and ecb21fa.

📒 Files selected for processing (1)

docs/disaster-recovery/disaster-recovery.md

coderabbitai

♻️ Duplicate comments (2)

docs/disaster-recovery/disaster-recovery.md (2)

181-183: ⚠️ Potential issue | 🟠 Major

Add an explicit “stop Control Plane service” pre-step before moving etcd data.

Line 181-Line 183 currently says services do not need to be stopped when at least one server-mode host remains, but Line 190+ immediately moves etcd data. That can race with live processes. Add a clear pre-step to scale/stop Control Plane on the recovery host before Step 1.

Suggested clarification

-- at least one server-mode host was still up: Use that host as the
-  recovery host. You do not need to stop services or restore the volume
-  first.
+- at least one server-mode host was still up: Use that host as the
+  recovery host. Before moving etcd data, stop/scale down Control Plane
+  service(s) on that recovery host (for example,
+  `docker service scale control-plane_<host-id>=0`). Restore the volume
+  only if needed.

Also applies to: 188-205

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 181 - 183, Add an
explicit pre-step to stop/scale down the Control Plane services on the chosen
recovery host before any instructions that move etcd data: insert a clear step
such as "Stop/scale Control Plane services on the recovery host (e.g.,
kube-apiserver, kube-controller-manager, kube-scheduler and any
orchestration-managed control-plane processes) and verify they are not running"
immediately before the existing Step 1 where etcd data is moved in the section
currently describing recovery when "at least one server-mode host was still up"
(also apply the same pre-step to the parallel block covering lines 188-205).
Mention verifying services are stopped to avoid races with live processes.

109-157: ⚠️ Potential issue | 🟡 Minor

Fix ordered-list indentation so recovery steps render as steps, not code.

The numbered items are still indented by 4 spaces, so Markdown renders them as code blocks instead of ordered lists (e.g., Line 109, Line 190, Line 335). This makes operational steps harder to follow in docs UI.

Suggested fix (representative)

-    1. Reinitialize the Swarm; on a surviving manager, invoke the following command:
+1. Reinitialize the Swarm; on a surviving manager, invoke the following command:

-       ```bash
-       docker swarm init --force-new-cluster \
-           --advertise-addr ${RECOVERY_HOST_IP}
-       ```
+    ```bash
+    docker swarm init --force-new-cluster \
+        --advertise-addr ${RECOVERY_HOST_IP}
+    ```

-    2. Join Hosts to the New Swarm.  If you have other surviving nodes that should be
+2. Join Hosts to the New Swarm. If you have other surviving nodes that should be

Also applies to: 190-260, 335-372

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 109 - 157, Remove
the leading 4-space indentation that turns ordered steps into code blocks in
disaster-recovery.md: un-indent the numbered headings ("Reinitialize the Swarm",
"Join Hosts to the New Swarm", "Removing Old Swarm Nodes") so they start at
column 0 (e.g., change "    1. ..." to "1. ..."), and ensure the fenced code
blocks for commands use non-indented triple-backtick fences aligned with the
list (e.g., ```bash on its own line, then the commands, then ```), fixing the
blocks shown around the docker swarm init/join/node/service examples so they
render as proper ordered list steps rather than code blocks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 181-183: Add an explicit pre-step to stop/scale down the Control
Plane services on the chosen recovery host before any instructions that move
etcd data: insert a clear step such as "Stop/scale Control Plane services on the
recovery host (e.g., kube-apiserver, kube-controller-manager, kube-scheduler and
any orchestration-managed control-plane processes) and verify they are not
running" immediately before the existing Step 1 where etcd data is moved in the
section currently describing recovery when "at least one server-mode host was
still up" (also apply the same pre-step to the parallel block covering lines
188-205). Mention verifying services are stopped to avoid races with live
processes.
- Around line 109-157: Remove the leading 4-space indentation that turns ordered
steps into code blocks in disaster-recovery.md: un-indent the numbered headings
("Reinitialize the Swarm", "Join Hosts to the New Swarm", "Removing Old Swarm
Nodes") so they start at column 0 (e.g., change "    1. ..." to "1. ..."), and
ensure the fenced code blocks for commands use non-indented triple-backtick
fences aligned with the list (e.g., ```bash on its own line, then the commands,
then ```), fixing the blocks shown around the docker swarm
init/join/node/service examples so they render as proper ordered list steps
rather than code blocks.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecb21fa and 2a0955b.

📒 Files selected for processing (1)

docs/disaster-recovery/disaster-recovery.md

Fixed formatting

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

docs/disaster-recovery/disaster-recovery.md (1)
105-105: Optional: Consider simplifying "a majority of" to "most".

For slightly more concise language, you could replace "a majority of managers" with "most managers," though the current phrasing is perfectly acceptable in technical documentation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` at line 105, Replace the phrase
"a majority of managers" with the simpler wording "most managers" in the
disaster recovery text (locate the exact string "a majority of managers" in
disaster-recovery.md) so the sentence reads more concise while preserving
meaning; ensure surrounding sentence grammar and any linked references remain
correct after the substitution.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 180-189: Update the guidance so the recovery host's Control Plane
service is stopped before moving etcd data: change the sentence that currently
says "You do not need to stop services..." to explicitly require stopping the
Control Plane service on the recovery host, and insert a clear preparatory step
before "Back up existing etcd data..." that instructs operators to stop the
Control Plane service (and confirm it's not running) on the recovery host to
avoid races with the etcd directory move.

---

Nitpick comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Line 105: Replace the phrase "a majority of managers" with the simpler wording
"most managers" in the disaster recovery text (locate the exact string "a
majority of managers" in disaster-recovery.md) so the sentence reads more
concise while preserving meaning; ensure surrounding sentence grammar and any
linked references remain correct after the substitution.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3ca510 and b623f42.

📒 Files selected for processing (1)

docs/disaster-recovery/disaster-recovery.md

coderabbitai · 2026-02-27T15:42:03Z

docs/disaster-recovery/disaster-recovery.md

+- at least one server-mode host was still up: Use that host as the
+  recovery host. You do not need to stop services or restore the volume
+  first.
+- you already completed [Restoring Docker Swarm](#restoring-docker-swarm): 
+  Restore the data volume on the surviving host if it was lost (see
+  [Data volume restore](#data-volume-restore)); otherwise skip.
+
+Then on the recovery host, perform the following steps:
+
+1.  Back up existing etcd data and set aside for restore:


⚠️ Potential issue | 🟠 Major

Stop the Control Plane service on the recovery host before moving etcd data.

The current text at lines 180-182 states "You do not need to stop services or restore the volume first" for the case where at least one server-mode host is still up. However, step 1 (line 189) immediately moves the etcd directory, which can cause corruption if the Control Plane service is still running on the recovery host and actively using that directory.

Per the discussion in past reviews (moizpgedge confirmed), you must stop the service on the recovery host before moving the etcd directory to avoid racing with live processes.

🛠️ Proposed fix

- at least one server-mode host was still up: Use that host as the - recovery host. You do not need to stop services or restore the volume - first. + recovery host. Stop the Control Plane service on that host with: + `docker service scale control-plane_<recovery-host-id>=0` + You do not need to restore the volume from backup.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/disaster-recovery.md` around lines 180 - 189, Update the guidance so the recovery host's Control Plane service is stopped before moving etcd data: change the sentence that currently says "You do not need to stop services..." to explicitly require stopping the Control Plane service on the recovery host, and insert a clear preparatory step before "Back up existing etcd data..." that instructs operators to stop the Control Plane service (and confirm it's not running) on the recovery host to avoid races with the etcd directory move.

Revise disaster recovery documentation for consistency, etc.

51cec8c

Updated formatting and improved clarity in the disaster recovery guide.

susan-pgedge added 2 commits February 24, 2026 08:07

Update disaster recovery documentation for clarity

3f4a9ab

Revised instructions for restoring Docker Swarm and Control Plane, including clearer steps and formatting improvements.

Add Troubleshooting section to mkdocs.yml

6f9fedf

susan-pgedge requested a review from mmols February 24, 2026 13:17

Update reference for reinitializing Docker Swarm

ffd5d54

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

docs/disaster-recovery/disaster-recovery.md Outdated Show resolved Hide resolved

docs/disaster-recovery/disaster-recovery.md Outdated Show resolved Hide resolved

susan-pgedge marked this pull request as ready for review February 24, 2026 13:23

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

tsivaprasad reviewed Feb 24, 2026

View reviewed changes

docs/disaster-recovery/disaster-recovery.md Outdated Show resolved Hide resolved

docs/disaster-recovery/disaster-recovery.md Outdated Show resolved Hide resolved

docs/disaster-recovery/disaster-recovery.md Outdated Show resolved Hide resolved

tsivaprasad reviewed Feb 24, 2026

View reviewed changes

Update disaster-recovery.md

ecb21fa

Fixing link comment

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

Update disaster-recovery.md

2a0955b

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

susan-pgedge and others added 2 commits February 27, 2026 09:01

Update disaster-recovery.md

d3ca510

Fixed formatting

Updated formatting in disaster-recovery.md

b623f42

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

Updated formatting in disaster-recovery.md

ebfc0bf

Conversation

susan-pgedge commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Notes for Reviewers

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

susan-pgedge Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

moizpgedge Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

susan-pgedge Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tsivaprasad Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

tsivaprasad Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

tsivaprasad Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

susan-pgedge Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

susan-pgedge commented Feb 24, 2026 •

edited

Loading

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

coderabbitai bot Feb 24, 2026 •

edited

Loading