Skip to content

Conversation

@teamconfx
Copy link
Contributor

A fix for HBASE-29806: RegionRemoteProcedureBase.afterReplay() causes NPE when parent procedure has already completed.

Root Cause

When the master restarts after a crash:

  1. Parent TransitRegionStateProcedure may have completed and is placed in the completed map
  2. Child RegionRemoteProcedureBase may still be pending and is placed in the procedures map
  3. afterReplay() is called on the child, which calls getParent(env).attachRemoteProc(this)
  4. getParent() uses getProcedure(procId) which only checks the procedures map, not the completed map
  5. This returns null, causing NPE when calling attachRemoteProc() on it

Changes Made

File: hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java

  1. Fixed afterReplay() method (lines 434-451):
    - Added null check for parent procedure
    - If parent is null, logs a warning and returns gracefully
    - The orphaned child procedure will be cleaned up by the procedure executor
  2. Fixed unattach() method (lines 289-296):
    - Added null check for parent procedure
    - If parent is null, silently skips unattach since parent has already completed

Test Added

File: hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionRemoteProcedureBaseOrphanAfterReplay.java

A unit test that:

  • Mocks the MasterProcedureEnv to return null when getProcedure() is called
  • Simulates the orphaned child scenario
  • Verifies that afterReplay() doesn't throw NPE

Files Changed

  • hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java (modified)
  • hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionRemoteProcedureBaseOrphanAfterReplay.java (new)

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+1 💚 mvninstall 3m 31s master passed
+1 💚 compile 3m 32s master passed
+1 💚 checkstyle 1m 2s master passed
+1 💚 spotbugs 1m 43s master passed
+1 💚 spotless 0m 51s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 6s the patch passed
+1 💚 compile 3m 28s the patch passed
+1 💚 javac 3m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 59s the patch passed
+1 💚 spotbugs 1m 45s the patch passed
+1 💚 hadoopcheck 12m 3s Patch does not cause any errors with Hadoop 3.3.6 3.4.1.
+1 💚 spotless 0m 46s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 11s The patch does not generate ASF License warnings.
41m 18s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7667/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7667
JIRA Issue HBASE-29806
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux d584cb8053ae 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 50399a5
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 84 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7667/1/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 29s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 25s master passed
+1 💚 compile 0m 58s master passed
+1 💚 javadoc 0m 32s master passed
+1 💚 shadedjars 5m 46s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 0s the patch passed
+1 💚 compile 0m 58s the patch passed
+1 💚 javac 0m 58s the patch passed
+1 💚 javadoc 0m 27s the patch passed
+1 💚 shadedjars 5m 42s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 231m 35s hbase-server in the patch passed.
257m 55s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7667/1/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7667
JIRA Issue HBASE-29806
Optional Tests javac javadoc unit compile shadedjars
uname Linux 307bdb571a67 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 50399a5
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7667/1/testReport/
Max. process+thread count 4361 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7667/1/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Jan 23, 2026

A parent procedure can not be completed while a sub procedure is still pending.

@teamconfx
Copy link
Contributor Author

Thank you for the feedback. You're correct that under normal operation, a parent procedure cannot complete while a sub procedure is still pending - this is the intended invariant.

However, this NPE occurs specifically during crash recovery (afterReplay()), not during normal execution. During crash recovery, the procedure store may contain inconsistent state that violates this invariant due to:

  1. Partial state persisted before crash - The multi-step rollback process can be interrupted at various points
  2. Non-atomic store operations - The store.delete() calls for parent and child procedures are not atomic

The HBase codebase already handles this scenario in other places. For example, in ProcedureExecutor.countDownChildren() (line 1985-1989):

  Procedure<TEnvironment> parent = procedures.get(procedure.getParentProcId());                                                                                                                                                             
  if (parent == null) {                                                                                                                                                                                                                     
    assert procStack.isRollingback();                                                                                                                                                                                                       
    return;                                                                                                                                                                                                                                 
  }         

This shows that null parent scenarios are already recognized and handled during rollback.

The setKillAndToggleBeforeStoreUpdateInRollback test mechanism exists specifically to test crash recovery scenarios. The fact that this test exposes the NPE demonstrates that crash recovery can lead to states where a child exists
without its parent.

The proposed fix is a defensive null check - consistent with the pattern already used in countDownChildren(). This ensures the master can recover even from unexpected procedure store states, rather than failing startup with an NPE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants