Skip to content

Revise disaster recovery documentation for consistency, etc.#277

Open
susan-pgedge wants to merge 9 commits intomainfrom
susan-disaster-recovery
Open

Revise disaster recovery documentation for consistency, etc.#277
susan-pgedge wants to merge 9 commits intomainfrom
susan-disaster-recovery

Conversation

@susan-pgedge
Copy link
Member

@susan-pgedge susan-pgedge commented Feb 24, 2026

Updated formatting and improved clarity in the disaster recovery guide.

Summary

Changes

  • ...

Testing

  • ...

Checklist

  • Tests added or updated (unit and/or e2e, as needed)
  • Documentation updated (if needed)
  • Issue is linked (branch name or URL in PR description)
  • Changelog entry added for user-facing behavior changes
  • Breaking changes (if any) are clearly called out in the PR description

Notes for Reviewers

Updated formatting and improved clarity in the disaster recovery guide.
@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated the Disaster Recovery guide and site navigation: the guide's prose, formatting, section order, prerequisites, variable setup, and procedure blocks (Swarm, Control Plane, etcd/data-volume) were reorganized and reworded for clarity; mkdocs.yml adds a Troubleshooting nav entry linking the guide.

Changes

Cohort / File(s) Summary
Disaster Recovery doc
docs/disaster-recovery/disaster-recovery.md
Major prose and formatting rewrite: reworded introduction and warnings, expanded prerequisites, reorganized procedure blocks (Swarm restore, control plane, etcd/data-volume), adjusted command/JSON formatting, renamed headings and anchors, tightened verification steps. No functional recovery steps changed.
Site navigation
mkdocs.yml
Added a top-level "Troubleshooting" navigation item with a nested entry pointing to disaster-recovery/disaster-recovery.md (Recovering Control Plane and Docker Swarm).

Poem

🐇 I hopped through lines and banished the fray,
Tidied anchors, commands, and the way.
Prereqs snug, snapshots held tight,
Swarm and control plane back to light.
A rabbit's hop — recovery done right.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is largely incomplete with placeholder content (ellipses) in Summary, Changes, and Testing sections, and lacks changelog entry and issue linkage despite being required. Fill in Summary with 1-2 sentences explaining the guide revisions; list specific changes in Changes section; add Testing details; check issue linkage and changelog entry checklist items, or provide justification.
Title check ❓ Inconclusive The title references disaster recovery documentation revisions but uses vague phrasing ('etc.') that obscures the specific changes made. Replace 'etc.' with specifics like 'formatting and clarity improvements' or follow Conventional Commits format (e.g., 'docs: revise disaster recovery guide for clarity and consistency').
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch susan-disaster-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

Revised instructions for restoring Docker Swarm and Control Plane, including clearer steps and formatting improvements.
@susan-pgedge susan-pgedge requested a review from mmols February 24, 2026 13:17
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Line 435: The table row linking "Docker Swarm commands hang" uses a broken
anchor `#reinitializing-the-swarm`; update that anchor in the table entry (the
row containing "Docker Swarm commands hang") to the correct heading anchor
`#restoring-docker-swarm` which matches the "Restoring Docker Swarm" section
(the "1. Reinitialize the Swarm" heading) so the link resolves.
- Line 435: The table row referencing the broken anchor
'#reinitializing-the-swarm' (the "Docker Swarm commands hang" entry) points to a
numbered list item that generates no ID; update that link target to the existing
valid heading anchor '#restoring-docker-swarm' or alternatively convert the "1.
Reinitialize the Swarm" list item into a proper level-2 heading "Reinitializing
the Swarm" so the '#reinitializing-the-swarm' anchor becomes valid; edit the
table cell replacing '#reinitializing-the-swarm' with '#restoring-docker-swarm'
unless you prefer adding a new "## Reinitializing the Swarm" heading to match
the current link.
- Around line 109-168: The numbered step headings ("1. Reinitialize the Swarm",
"2. Join Hosts to the New Swarm", "3. Removing Old Swarm Nodes" and the
equivalent headings in "Restoring the Control Plane" and "Re-adding Hosts") are
indented four spaces, which makes them render as code blocks; outdent those list
markers to 0–3 spaces so they become proper ordered list items, and ensure the
fenced blocks use an unindented ```bash line before and after the commands
(e.g., move the opening ```bash to its own non-indented line and remove the
leading four-space indentation from the "docker swarm init ..." and subsequent
list markers). Apply this same fix to each affected section heading and its
surrounding fenced code blocks.
- Around line 109-168: The numbered list markers ("1.", "2.", "3.") under the
"Reinitialize the Swarm", "Join Hosts to the New Swarm", and "Removing Old Swarm
Nodes" headings are prefixed with 4 spaces causing Python-Markdown to render
them as code blocks; remove the leading 4-space indentation from the lines that
start with the list markers so they are real list items, but keep the following
code blocks (the fenced ```bash``` blocks and the indented command lines like
"docker swarm init --force-new-cluster ..." and "docker swarm join --token ...
${RECOVERY_HOST_IP}:2377") indented exactly as shown so they remain code blocks;
update the three sections' numbered lines accordingly (the list marker lines
shown in the diff and any other similarly-indented numbered steps in "Restoring
the Control Plane" and "Re-adding Hosts").

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7e8a31 and 6f9fedf.

📒 Files selected for processing (2)
  • docs/disaster-recovery/disaster-recovery.md
  • mkdocs.yml

@susan-pgedge susan-pgedge marked this pull request as ready for review February 24, 2026 13:23
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/disaster-recovery/disaster-recovery.md (1)

273-289: ⚠️ Potential issue | 🟡 Minor

markdownlint MD046: fenced blocks vs. indented style.

MD046 expects indented code blocks in these sections, so the newly fenced blocks will keep failing lint. Either convert these to indented blocks or adjust the rule for this doc.

✏️ Representative conversion (apply to all flagged blocks)
-```sh
-curl -X POST \
-    "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>" \
-    -H "Content-Type: application/json" \
-    -d '{
-        "spec": {
-            "database_name": "<DB_NAME>",
-            "database_users": [{"username": "admin", "db_owner": true,
-                                "attributes": ["SUPERUSER", "LOGIN"]}],
-            "port": 5432,
-            "nodes": [
-                { "name": "n1", "host_ids": ["host-1"] },
-                { "name": "n2", "host_ids": ["host-2"] }
-            ]
-        }
-    }'
-```
+    curl -X POST \
+        "http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>" \
+        -H "Content-Type: application/json" \
+        -d '{
+            "spec": {
+                "database_name": "<DB_NAME>",
+                "database_users": [{"username": "admin", "db_owner": true,
+                                    "attributes": ["SUPERUSER", "LOGIN"]}],
+                "port": 5432,
+                "nodes": [
+                    { "name": "n1", "host_ids": ["host-1"] },
+                    { "name": "n2", "host_ids": ["host-2"] }
+                ]
+            }
+        }'

Also applies to: 301-305, 322-325, 386-403, 413-416

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 273 - 289, Convert
the fenced shell code blocks in disaster-recovery.md (the curl POST examples
such as the block starting with curl -X POST
"http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>"
...) to indented code block style to satisfy markdownlint MD046 (or
alternatively add an explicit rule disable for MD046 in this document's
frontmatter); apply the same conversion to the other flagged blocks (the
additional curl/command examples referenced in the review) so all examples use
indented code blocks or the rule is disabled consistently.
♻️ Duplicate comments (1)
docs/disaster-recovery/disaster-recovery.md (1)

109-167: ⚠️ Potential issue | 🟡 Minor

Outdent top‑level numbered steps so they render as lists.

These numbered steps are still indented four spaces at the document root, so Markdown will render them as code blocks rather than ordered lists. Please outdent the list markers (and keep their content indented) in all three sections.

✏️ Representative fix (apply to all three sections)
-    1. Reinitialize the Swarm
+1. Reinitialize the Swarm

-    On a surviving manager, invoke the following command:
+    On a surviving manager, invoke the following command:

     ```bash
     docker swarm init --force-new-cluster \
         --advertise-addr ${RECOVERY_HOST_IP}
     ```

Also applies to: 195-265, 340-377

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 109 - 167, The
numbered sections "1. Reinitialize the Swarm", "2. Join Hosts to the New Swarm",
and "3. Removing Old Swarm Nodes" are indented and render as code blocks;
outdent each top-level list marker (the "1.", "2.", "3." lines) to the document
root while keeping the subsequent code fences and wrapped lines indented as
shown (e.g., the docker swarm init block, docker swarm join-token / join
examples, and docker node demote / docker service rm examples). Apply the same
outdent fix to the other occurrences noted (the ranges around lines 195-265 and
340-377) so all three sections render as proper ordered lists instead of code
blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 181-193: The doc omits an explicit step to stop/scale down Control
Plane services on the recovery host before renaming/moving the etcd directory
(Step 1), which can cause corruption/races; update the "Then on the recovery
host, perform the following steps:" sequence to add a clear pre-step that
instructs operators to stop or scale down Control Plane services (e.g., using
docker service scale control-plane_<host-id>=0 or equivalent) on the recovery
host prior to any backup/restore or moving of the etcd data directory, and
reference the related sections ("Restoring Docker Swarm" / "Data Volume
Restore") so readers know when to skip the stop if services are already offline.

---

Outside diff comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 273-289: Convert the fenced shell code blocks in
disaster-recovery.md (the curl POST examples such as the block starting with
curl -X POST
"http://${RECOVERY_HOST_IP}:${API_PORT}/v1/databases/<DB_ID>?remove_host=<LOST_HOST_ID>"
...) to indented code block style to satisfy markdownlint MD046 (or
alternatively add an explicit rule disable for MD046 in this document's
frontmatter); apply the same conversion to the other flagged blocks (the
additional curl/command examples referenced in the review) so all examples use
indented code blocks or the rule is disabled consistently.

---

Duplicate comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 109-167: The numbered sections "1. Reinitialize the Swarm", "2.
Join Hosts to the New Swarm", and "3. Removing Old Swarm Nodes" are indented and
render as code blocks; outdent each top-level list marker (the "1.", "2.", "3."
lines) to the document root while keeping the subsequent code fences and wrapped
lines indented as shown (e.g., the docker swarm init block, docker swarm
join-token / join examples, and docker node demote / docker service rm
examples). Apply the same outdent fix to the other occurrences noted (the ranges
around lines 195-265 and 340-377) so all three sections render as proper ordered
lists instead of code blocks.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7e8a31 and ffd5d54.

📒 Files selected for processing (2)
  • docs/disaster-recovery/disaster-recovery.md
  • mkdocs.yml

Comment on lines 181 to 193
- all server-mode hosts were offline: On a Swarm manager, stop all
Control Plane services with the command:
`docker service scale control-plane_<host-id-1>=0 control-plane_<host-id-2>=0 ...`
Then, restore the data volume from your
backup (see [Data Volume Restore](#data-volume-restore)).
- at least one server-mode host was still up: Use that host as the
recovery host. You do not need to stop services or restore the volume
first.
- you already completed [Restoring Docker Swarm](#restoring-docker-swarm):
Restore the data volume on the surviving host if it was lost (see
[Data volume restore](#data-volume-restore)); otherwise skip.

Then on the recovery host, perform the following steps:
Copy link

@coderabbitai coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stop Control Plane services before moving live etcd data.

Step 1 renames the etcd directory. If the recovery host still has Control Plane services running, moving that directory can corrupt or race active processes. Add an explicit stop/scale‑down step before the backup/restore actions.

🛠️ Suggested clarification
 Then on the recovery host, perform the following steps:

+If any Control Plane services are running on the recovery host, stop them
+before step 1 to avoid moving live etcd data (for example:
+`docker service scale control-plane_<host-id>=0`).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 181 - 193, The doc
omits an explicit step to stop/scale down Control Plane services on the recovery
host before renaming/moving the etcd directory (Step 1), which can cause
corruption/races; update the "Then on the recovery host, perform the following
steps:" sequence to add a clear pre-step that instructs operators to stop or
scale down Control Plane services (e.g., using docker service scale
control-plane_<host-id>=0 or equivalent) on the recovery host prior to any
backup/restore or moving of the etcd data directory, and reference the related
sections ("Restoring Docker Swarm" / "Data Volume Restore") so readers know when
to skip the stop if services are already offline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmols or @moizpgedge

Does this require an update?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should update it. For the case where at least one server-mode host was still up, add a step to stop the Control Plane service on the recovery host (e.g. scale it to 0) before they run the steps that rename/move the etcd directory, so we don’t move live data while the process is running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moizpgedge Can you make that change inline here?


You should see one host with `status: "reachable"` and
`etcd_mode: "server"`. Then continue with the next section:
[Updating Databases to Remove Old Hosts](#updating-databases-to-remove-old-hosts).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the document view, the hyperlink is not displayed and appears as plain text.


- If Control Plane is already deployed as Swarm services (if the etcd
quorum was lost and you did not run
[Restoring Docker Swarm](#restoring-docker-swarm)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shows plain text Restoring Docker Swarm in document view

[Restoring Docker Swarm](#restoring-docker-swarm)):
`docker service scale control-plane_${RECOVERY_HOST_ID}=1`

- If you completed [Restoring Docker Swarm](#restoring-docker-swarm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformatted...

Fixing link comment
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/disaster-recovery/disaster-recovery.md (2)

109-164: ⚠️ Potential issue | 🟡 Minor

Ordered steps still render as code blocks due to 4-space indentation.

Line 109 and the equivalent step markers in Line 196+ and Line 341+ are still indented by four spaces, so they render as code instead of numbered steps in MkDocs/Python-Markdown.

Proposed fix
-    1. Reinitialize the Swarm
+1. Reinitialize the Swarm
...
-    2. Join Hosts to the New Swarm
+2. Join Hosts to the New Swarm
...
-    3. Removing Old Swarm Nodes
+3. Removing Old Swarm Nodes

Apply the same outdent pattern to the numbered markers in “Restoring the Control Plane” and “Re-adding Hosts”.

Also applies to: 196-267, 341-378

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 109 - 164, The
numbered list markers (e.g., the "1. Reinitialize the Swarm" heading and the
equivalent markers in the "Restoring the Control Plane" and "Re-adding Hosts"
sections) are prefixed with four spaces which causes Markdown to render them as
code blocks; remove the leading four-space indent from those numbered lines so
the lists render as normal ordered lists, ensuring fenced code blocks and their
internal indentation remain unchanged and apply the same outdent change to the
other occurrences you noted (the sections around lines 196–267 and 341–378).

187-202: ⚠️ Potential issue | 🟠 Major

Conflicting guidance: services may still be live when etcd data is moved.

Line 187-Line 189 says services do not need to be stopped when a server-mode host remains, but Line 196-Line 202 immediately renames/moves etcd data. That can race a running Control Plane service on the recovery host.

Proposed fix
-- at least one server-mode host was still up: Use that host as the
-  recovery host. You do not need to stop services or restore the volume
-  first.
+- at least one server-mode host was still up: Use that host as the
+  recovery host. On that recovery host, stop/scale down the Control Plane
+  service before step 1 so etcd data is not moved while live.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 187 - 202, The
current instructions move the etcd data directory (PGEDGE_DATA_DIR/etcd ->
ETCD_BACKUP_DIR) while earlier text says services need not be stopped, creating
a race with a running control-plane; update the procedure to either (a)
explicitly stop the control-plane/etcd-related services on the recovery host
before moving the directory (add a short step to stop the service(s) hosting
etcd), or (b) if downtime is unacceptable, replace the mv step with an
etcd-consistent snapshot approach (e.g., use etcdctl snapshot save) and document
restoring from that snapshot; reference the existing variables and operations
(PGEDGE_DATA_DIR, etcd directory move to ETCD_BACKUP_DIR, and the mv action) so
the change is applied where the backup/move is described.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 109-164: The numbered list markers (e.g., the "1. Reinitialize the
Swarm" heading and the equivalent markers in the "Restoring the Control Plane"
and "Re-adding Hosts" sections) are prefixed with four spaces which causes
Markdown to render them as code blocks; remove the leading four-space indent
from those numbered lines so the lists render as normal ordered lists, ensuring
fenced code blocks and their internal indentation remain unchanged and apply the
same outdent change to the other occurrences you noted (the sections around
lines 196–267 and 341–378).
- Around line 187-202: The current instructions move the etcd data directory
(PGEDGE_DATA_DIR/etcd -> ETCD_BACKUP_DIR) while earlier text says services need
not be stopped, creating a race with a running control-plane; update the
procedure to either (a) explicitly stop the control-plane/etcd-related services
on the recovery host before moving the directory (add a short step to stop the
service(s) hosting etcd), or (b) if downtime is unacceptable, replace the mv
step with an etcd-consistent snapshot approach (e.g., use etcdctl snapshot save)
and document restoring from that snapshot; reference the existing variables and
operations (PGEDGE_DATA_DIR, etcd directory move to ETCD_BACKUP_DIR, and the mv
action) so the change is applied where the backup/move is described.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffd5d54 and ecb21fa.

📒 Files selected for processing (1)
  • docs/disaster-recovery/disaster-recovery.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
docs/disaster-recovery/disaster-recovery.md (2)

181-183: ⚠️ Potential issue | 🟠 Major

Add an explicit “stop Control Plane service” pre-step before moving etcd data.

Line 181-Line 183 currently says services do not need to be stopped when at least one server-mode host remains, but Line 190+ immediately moves etcd data. That can race with live processes. Add a clear pre-step to scale/stop Control Plane on the recovery host before Step 1.

Suggested clarification
-- at least one server-mode host was still up: Use that host as the
-  recovery host. You do not need to stop services or restore the volume
-  first.
+- at least one server-mode host was still up: Use that host as the
+  recovery host. Before moving etcd data, stop/scale down Control Plane
+  service(s) on that recovery host (for example,
+  `docker service scale control-plane_<host-id>=0`). Restore the volume
+  only if needed.

Also applies to: 188-205

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 181 - 183, Add an
explicit pre-step to stop/scale down the Control Plane services on the chosen
recovery host before any instructions that move etcd data: insert a clear step
such as "Stop/scale Control Plane services on the recovery host (e.g.,
kube-apiserver, kube-controller-manager, kube-scheduler and any
orchestration-managed control-plane processes) and verify they are not running"
immediately before the existing Step 1 where etcd data is moved in the section
currently describing recovery when "at least one server-mode host was still up"
(also apply the same pre-step to the parallel block covering lines 188-205).
Mention verifying services are stopped to avoid races with live processes.

109-157: ⚠️ Potential issue | 🟡 Minor

Fix ordered-list indentation so recovery steps render as steps, not code.

The numbered items are still indented by 4 spaces, so Markdown renders them as code blocks instead of ordered lists (e.g., Line 109, Line 190, Line 335). This makes operational steps harder to follow in docs UI.

Suggested fix (representative)
-    1. Reinitialize the Swarm; on a surviving manager, invoke the following command:
+1. Reinitialize the Swarm; on a surviving manager, invoke the following command:

-       ```bash
-       docker swarm init --force-new-cluster \
-           --advertise-addr ${RECOVERY_HOST_IP}
-       ```
+    ```bash
+    docker swarm init --force-new-cluster \
+        --advertise-addr ${RECOVERY_HOST_IP}
+    ```

-    2. Join Hosts to the New Swarm.  If you have other surviving nodes that should be
+2. Join Hosts to the New Swarm. If you have other surviving nodes that should be

Also applies to: 190-260, 335-372

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 109 - 157, Remove
the leading 4-space indentation that turns ordered steps into code blocks in
disaster-recovery.md: un-indent the numbered headings ("Reinitialize the Swarm",
"Join Hosts to the New Swarm", "Removing Old Swarm Nodes") so they start at
column 0 (e.g., change "    1. ..." to "1. ..."), and ensure the fenced code
blocks for commands use non-indented triple-backtick fences aligned with the
list (e.g., ```bash on its own line, then the commands, then ```), fixing the
blocks shown around the docker swarm init/join/node/service examples so they
render as proper ordered list steps rather than code blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 181-183: Add an explicit pre-step to stop/scale down the Control
Plane services on the chosen recovery host before any instructions that move
etcd data: insert a clear step such as "Stop/scale Control Plane services on the
recovery host (e.g., kube-apiserver, kube-controller-manager, kube-scheduler and
any orchestration-managed control-plane processes) and verify they are not
running" immediately before the existing Step 1 where etcd data is moved in the
section currently describing recovery when "at least one server-mode host was
still up" (also apply the same pre-step to the parallel block covering lines
188-205). Mention verifying services are stopped to avoid races with live
processes.
- Around line 109-157: Remove the leading 4-space indentation that turns ordered
steps into code blocks in disaster-recovery.md: un-indent the numbered headings
("Reinitialize the Swarm", "Join Hosts to the New Swarm", "Removing Old Swarm
Nodes") so they start at column 0 (e.g., change "    1. ..." to "1. ..."), and
ensure the fenced code blocks for commands use non-indented triple-backtick
fences aligned with the list (e.g., ```bash on its own line, then the commands,
then ```), fixing the blocks shown around the docker swarm
init/join/node/service examples so they render as proper ordered list steps
rather than code blocks.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecb21fa and 2a0955b.

📒 Files selected for processing (1)
  • docs/disaster-recovery/disaster-recovery.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/disaster-recovery/disaster-recovery.md (1)

105-105: Optional: Consider simplifying "a majority of" to "most".

For slightly more concise language, you could replace "a majority of managers" with "most managers," though the current phrasing is perfectly acceptable in technical documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` at line 105, Replace the phrase
"a majority of managers" with the simpler wording "most managers" in the
disaster recovery text (locate the exact string "a majority of managers" in
disaster-recovery.md) so the sentence reads more concise while preserving
meaning; ensure surrounding sentence grammar and any linked references remain
correct after the substitution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Around line 180-189: Update the guidance so the recovery host's Control Plane
service is stopped before moving etcd data: change the sentence that currently
says "You do not need to stop services..." to explicitly require stopping the
Control Plane service on the recovery host, and insert a clear preparatory step
before "Back up existing etcd data..." that instructs operators to stop the
Control Plane service (and confirm it's not running) on the recovery host to
avoid races with the etcd directory move.

---

Nitpick comments:
In `@docs/disaster-recovery/disaster-recovery.md`:
- Line 105: Replace the phrase "a majority of managers" with the simpler wording
"most managers" in the disaster recovery text (locate the exact string "a
majority of managers" in disaster-recovery.md) so the sentence reads more
concise while preserving meaning; ensure surrounding sentence grammar and any
linked references remain correct after the substitution.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3ca510 and b623f42.

📒 Files selected for processing (1)
  • docs/disaster-recovery/disaster-recovery.md

Comment on lines 180 to 189
- at least one server-mode host was still up: Use that host as the
recovery host. You do not need to stop services or restore the volume
first.
- you already completed [Restoring Docker Swarm](#restoring-docker-swarm):
Restore the data volume on the surviving host if it was lost (see
[Data volume restore](#data-volume-restore)); otherwise skip.

Then on the recovery host, perform the following steps:

1. Back up existing etcd data and set aside for restore:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stop the Control Plane service on the recovery host before moving etcd data.

The current text at lines 180-182 states "You do not need to stop services or restore the volume first" for the case where at least one server-mode host is still up. However, step 1 (line 189) immediately moves the etcd directory, which can cause corruption if the Control Plane service is still running on the recovery host and actively using that directory.

Per the discussion in past reviews (moizpgedge confirmed), you must stop the service on the recovery host before moving the etcd directory to avoid racing with live processes.

🛠️ Proposed fix
 - at least one server-mode host was still up: Use that host as the
-  recovery host. You do not need to stop services or restore the volume
-  first.
+  recovery host. Stop the Control Plane service on that host with:
+  `docker service scale control-plane_<recovery-host-id>=0`
+  You do not need to restore the volume from backup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/disaster-recovery.md` around lines 180 - 189, Update
the guidance so the recovery host's Control Plane service is stopped before
moving etcd data: change the sentence that currently says "You do not need to
stop services..." to explicitly require stopping the Control Plane service on
the recovery host, and insert a clear preparatory step before "Back up existing
etcd data..." that instructs operators to stop the Control Plane service (and
confirm it's not running) on the recovery host to avoid races with the etcd
directory move.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants