Add hotspot shard split design document and versioned route catalog persistence helpers by bootjp · Pull Request #316 · bootjp/elastickv

bootjp · 2026-02-17T16:37:45Z

No description provided.

- introduce RouteDescriptor, CatalogSnapshot, and RouteState - add reserved catalog key helpers and route/version codecs - implement CatalogStore Snapshot/Save with optimistic version checks - add catalog tests for round-trip, validation, mismatch, and overflow cases

gemini-code-assist · 2026-02-17T16:38:01Z

Summary of Changes

Hello @bootjp, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the foundational groundwork for Elastickv's hotspot shard splitting feature. It introduces a durable and versioned route catalog, which is crucial for maintaining consistent routing information across the distributed system. Additionally, it provides comprehensive design and implementation plans, guiding the future development of this complex feature by detailing the architecture, data models, and a phased rollout strategy.

Highlights

Durable Route Catalog Persistence: Introduced a new CatalogStore in distribution/catalog.go to provide persistence helpers for the route catalog state, enabling durable storage and retrieval of RouteDescriptor and catalog versions using an MVCC store.
Versioned Route Catalog Management: Implemented mechanisms for encoding and decoding catalog versions and RouteDescriptor objects, along with methods for taking consistent snapshots and saving updates with optimistic concurrency control, ensuring data integrity and preventing race conditions.
Hotspot Shard Split Design Document: Added a comprehensive design document (docs/hotspot_shard_split_design.md) outlining the architecture, data models, execution flow, and phased rollout plan for implementing automatic hotspot shard splitting in Elastickv.
Milestone 1 PR Plan: Included a detailed plan (docs/hotspot_split_milestone1_pr_plan.md) breaking down the first milestone of the hotspot shard split feature into individual pull requests, specifying goals, tasks, and validation for each.

Changelog

distribution/catalog.go
- Added CatalogStore for managing versioned route catalog persistence.
- Defined RouteDescriptor and CatalogSnapshot structs for route and catalog state.
- Implemented encoding and decoding functions for catalog versions and route descriptors.
- Introduced methods for reading consistent catalog snapshots and saving updates with optimistic version checks.
distribution/catalog_test.go
- Added unit tests for CatalogVersionCodec and RouteDescriptorCodec round-trip serialization.
- Included tests for CatalogStore snapshot and save operations, covering versioning, duplicate route IDs, and invalid route ranges.
docs/hotspot_shard_split_design.md
- Added a new design document detailing the background, goals, requirements, high-level architecture, data model, hotspot detection, split execution flow, routing consistency, key coverage, API changes, implementation touchpoints, phased rollout, test strategy, risks, and acceptance criteria for hotspot shard splitting.
docs/hotspot_split_milestone1_pr_plan.md
- Added a new document outlining the PR plan for Milestone 1 of the hotspot shard split feature, including objectives, scope, implementation strategy, PR breakdown, merge order, operational rollout checklist, and exit criteria.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the foundational components for hotspot shard splitting by adding a durable, versioned route catalog. The changes include the core logic for catalog persistence with optimistic concurrency control, along with comprehensive unit tests. Additionally, it adds detailed design documents outlining the overall strategy and the plan for the first milestone.

The code is well-structured and the implementation of the catalog store is robust. The tests cover a good range of scenarios, ensuring the reliability of the new components. The design documents are thorough and provide excellent context for the feature.

I have a few suggestions to improve code clarity, enhance test coverage for an edge case, and align the design documentation with the current implementation and established design principles.

gemini-code-assist · 2026-02-17T16:40:36Z

distribution/catalog.go

+func u64ToInt(v uint64) (int, error) {
+	if strconv.IntSize == 32 && v > uint64(^uint32(0)>>1) {
+		return 0, errors.WithStack(ErrCatalogInvalidRouteRecord)
+	}
+	if strconv.IntSize == 64 && v > (^uint64(0)>>1) {
+		return 0, errors.WithStack(ErrCatalogInvalidRouteRecord)
+	}
+	return int(v), nil


This function correctly checks for uint64 to int conversion overflow, but it can be simplified for better readability and maintainability by using a constant for the maximum integer value, which avoids platform-specific checks.

func u64ToInt(v uint64) (int, error) { const maxInt = int(^uint(0) >> 1) if v > uint64(maxInt) { return 0, errors.WithStack(ErrCatalogInvalidRouteRecord) } return int(v), nil }

gemini-code-assist · 2026-02-17T16:40:36Z

distribution/catalog_test.go

+
+func assertRouteEqual(t *testing.T, want, got RouteDescriptor) {
+	t.Helper()
+	if want.RouteID != got.RouteID {
+		t.Fatalf("route id mismatch: want %d, got %d", want.RouteID, got.RouteID)
+	}
+	if want.GroupID != got.GroupID {
+		t.Fatalf("group id mismatch: want %d, got %d", want.GroupID, got.GroupID)
+	}
+	if want.ParentRouteID != got.ParentRouteID {
+		t.Fatalf("parent route id mismatch: want %d, got %d", want.ParentRouteID, got.ParentRouteID)
+	}
+	if want.State != got.State {
+		t.Fatalf("state mismatch: want %d, got %d", want.State, got.State)
+	}
+	if !bytes.Equal(want.Start, got.Start) {
+		t.Fatalf("start mismatch: want %q, got %q", want.Start, got.Start)
+	}
+	if !bytes.Equal(want.End, got.End) {
+		t.Fatalf("end mismatch: want %q, got %q", want.End, got.End)
+	}
+}


The code correctly handles commitTS overflow in prepareSave, but this case is not covered by tests. Adding a test for this scenario would improve coverage and prevent future regressions.

func assertRouteEqual(t *testing.T, want, got RouteDescriptor) { t.Helper() if want.RouteID != got.RouteID { t.Fatalf("route id mismatch: want %d, got %d", want.RouteID, got.RouteID) } if want.GroupID != got.GroupID { t.Fatalf("group id mismatch: want %d, got %d", want.GroupID, got.GroupID) } if want.ParentRouteID != got.ParentRouteID { t.Fatalf("parent route id mismatch: want %d, got %d", want.ParentRouteID, got.ParentRouteID) } if want.State != got.State { t.Fatalf("state mismatch: want %d, got %d", want.State, got.State) } if !bytes.Equal(want.Start, got.Start) { t.Fatalf("start mismatch: want %q, got %q", want.Start, got.Start) } if !bytes.Equal(want.End, got.End) { t.Fatalf("end mismatch: want %q, got %q", want.End, got.End) } } func TestCatalogStoreSaveRejectsCommitTSOverflow(t *testing.T) { st := store.NewMVCCStore() ctx := context.Background() // Set catalog version to 10 at a high timestamp. if err := st.PutAt(ctx, CatalogVersionKey(), EncodeCatalogVersion(10), ^uint64(0)-1, 0); err != nil { t.Fatalf("setup: %v", err) } // Set last commit TS to max uint64. This will be the readTS. if err := st.PutAt(ctx, []byte("foo"), []byte("bar"), ^uint64(0), 0); err != nil { t.Fatalf("setup: %v", err) } cs := NewCatalogStore(st) // Save with expected version 10. This will read at readTS = ^uint64(0), // and then try to commit at commitTS = readTS + 1, which overflows to 0. _, err := cs.Save(ctx, 10, []RouteDescriptor{ {RouteID: 1, Start: []byte(""), End: nil, GroupID: 1, State: RouteStateActive}, }) if !errors.Is(err, ErrCatalogVersionOverflow) { t.Fatalf("expected ErrCatalogVersionOverflow, got %v", err) } }

gemini-code-assist · 2026-02-17T16:40:36Z

docs/hotspot_shard_split_design.md

+- `start []byte` (inclusive)
+- `end []byte` (exclusive, nil=+inf)
+- `group_id uint64`
+- `version uint64` (catalog generation)


The RouteDescriptor data model defined here includes a version field. However, the Go implementation in distribution/catalog.go does not have this field on the RouteDescriptor struct. Instead, it uses a single global version for the entire catalog snapshot. This aligns with the design principle that Optimistic Concurrency Control (OCC) conflict checks are not required at the store layer (Rule 1), as the leader is responsible for issuing consistent timestamps. Please update the documentation to reflect the current implementation and this design principle, clarifying that per-record versioning for OCC is not used at the store layer.

References

Optimistic Concurrency Control (OCC) conflict checks are not required at the store layer because the leader is responsible for issuing consistent timestamps, which is expected to prevent write conflicts.

- fix gosec G115 by simplifying uint64->int overflow guard in u64ToInt - add commit timestamp overflow test for CatalogStore.Save - align hotspot split design doc with implementation: - remove per-route version from RouteDescriptor - document global catalog snapshot versioning and catalog-level CAS model

Copilot

Pull request overview

Adds milestone/planning documentation for hotspot shard splitting and introduces a first control-plane building block: a durable, versioned route catalog persisted in MVCC state.

Changes:

Added hotspot split design doc and Milestone 1 PR plan docs.
Introduced distribution.CatalogStore with reserved-key encoding/decoding for catalog version + route descriptors.
Added unit tests covering catalog codecs, optimistic version checks, deletes, and edge cases (duplicate IDs, invalid ranges, overflow).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
docs/hotspot_split_milestone1_pr_plan.md	Milestone 1 implementation plan + PR breakdown and rollout checklist.
docs/hotspot_shard_split_design.md	End-to-end design proposal for hotspot detection, splitting, migration, and routing consistency.
distribution/catalog.go	New MVCC-backed route catalog persistence helpers (keys, codecs, snapshot/save).
distribution/catalog_test.go	Unit tests for catalog encoding/decoding and versioned save/snapshot behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-17T16:49:05Z

distribution/catalog.go

+	commitTS := readTS + 1
+	if commitTS == 0 {
+		return savePlan{}, errors.WithStack(ErrCatalogVersionOverflow)
+	}
+


CatalogStore.Save computes commitTS as readTS+1 in prepareSave. If the underlying MVCC store advances LastCommitTS (e.g., due to concurrent user writes with higher HLC timestamps) between prepareSave and ApplyMutations, this commitTS can become < current LastCommitTS and the catalog write will appear to have committed “in the past”, breaking snapshot/time ordering. Consider choosing commitTS immediately before ApplyMutations as max(store.LastCommitTS()+1, readTS+1) (with overflow handling), or require the caller to supply a commitTS/clock so catalog commits are always strictly increasing.

Suggested change

commitTS := readTS + 1

if commitTS == 0 {

return savePlan{}, errors.WithStack(ErrCatalogVersionOverflow)

}

// Choose a commit timestamp that is strictly greater than both the read

// timestamp and the current store LastCommitTS to preserve ordering.

candidateFromRead := readTS + 1

if candidateFromRead == 0 {

return savePlan{}, errors.WithStack(ErrCatalogVersionOverflow)

}

currentLast := s.store.LastCommitTS()

candidateFromStore := currentLast + 1

if candidateFromStore == 0 {

return savePlan{}, errors.WithStack(ErrCatalogVersionOverflow)

}

commitTS := candidateFromRead

if candidateFromStore > commitTS {

commitTS = candidateFromStore

}

- recompute catalog commit TS at apply time to avoid stale planned timestamps - ensure commit TS is greater than both planned minimum and current store LastCommitTS - keep overflow checks for commit TS calculation - add regression test for concurrent LastCommitTS advancement between prepare and apply

…stickv into feature/hotspot-re-allocation

bootjp · 2026-02-17T16:58:05Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the foundational components for hotspot shard splitting by adding a durable and versioned route catalog. The changes include the core logic for catalog persistence, serialization/deserialization of route data, and a comprehensive test suite. Additionally, two markdown files are added, providing a detailed design document and a phased implementation plan. The code is well-structured and the tests are thorough. I have identified one critical compilation issue that needs to be addressed and one medium-severity suggestion for a potential performance improvement in the future.

distribution/catalog.go

gemini-code-assist · 2026-02-17T16:59:55Z

distribution/catalog.go

+func (s *CatalogStore) buildSaveMutations(ctx context.Context, plan savePlan) ([]*store.KVPairMutation, error) {
+	existingKeys, err := s.routeKeysAt(ctx, plan.readTS)
+	if err != nil {
+		return nil, err
+	}
+
+	mutations := make([]*store.KVPairMutation, 0, len(existingKeys)+len(plan.routes)+1)
+	mutations = appendDeleteMutations(mutations, existingKeys)
+	mutations, err = appendRouteMutations(mutations, plan.routes)
+	if err != nil {
+		return nil, err
+	}
+	mutations = append(mutations, &store.KVPairMutation{
+		Op:    store.OpTypePut,
+		Key:   CatalogVersionKey(),
+		Value: EncodeCatalogVersion(plan.nextVersion),
+	})
+	return mutations, nil
+}


The current implementation of buildSaveMutations uses a "delete-all, then write-all" strategy to update the route catalog. It fetches all existing keys, creates delete mutations for them, and then creates put mutations for the new set of routes. While simple, this can be inefficient if the catalog contains many routes and only a few are changing, as it leads to unnecessary writes and can increase transaction contention.

For a future improvement, consider implementing a diff-based approach:

Read the full current set of RouteDescriptors, not just the keys.

Compare the current routes with the new routes to identify which are added, modified, or deleted.

Generate mutations only for the routes that have actually changed.

This would reduce write amplification and improve performance for large catalogs.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T04:13:56Z

distribution/catalog.go

+		out = append(out, cloneRouteDescriptor(route))
+	}
+	sort.Slice(out, func(i, j int) bool {
+		return out[i].RouteID < out[j].RouteID


Routes are normalized/sorted by RouteID here, but the in-memory routing engine assumes routes are sorted by Start (key range) for binary search. Unless all callers re-sort by Start before applying a snapshot, returning RouteDescriptor slices in RouteID order is likely to cause subtle misrouting bugs. Consider sorting by Start (then End/RouteID for tie-breaks), or explicitly documenting the ordering contract in CatalogSnapshot and enforcing/validating it in Engine snapshot apply.

Suggested change

return out[i].RouteID < out[j].RouteID

ri, rj := out[i], out[j]

// Primary ordering: Start key (ascending).

if c := bytes.Compare(ri.Start, rj.Start); c != 0 {

return c < 0

}

// Secondary ordering: End key (ascending, with nil treated as empty slice).

if c := bytes.Compare(ri.End, rj.End); c != 0 {

return c < 0

}

// Tertiary ordering: RouteID for deterministic tie-breaking.

return ri.RouteID < rj.RouteID

bootjp added 3 commits February 18, 2026 01:12

Add hotspot shard split design document

3b539d0

Update document for hotspot shard split design

3ce44ff

Copilot AI review requested due to automatic review settings February 17, 2026 16:37

Merge branch 'main' into feature/hotspot-re-allocation

5fc0acc

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

Copilot started reviewing on behalf of bootjp February 17, 2026 16:42 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

bootjp added 2 commits February 18, 2026 01:56

Merge branch 'feature/hotspot-re-allocation' of github.com:bootjp/ela…

927e777

…stickv into feature/hotspot-re-allocation

bootjp requested a review from Copilot February 17, 2026 16:57

Copilot started reviewing on behalf of bootjp February 17, 2026 16:58 View session

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

Copilot AI reviewed Feb 17, 2026

View reviewed changes

bootjp requested a review from Copilot February 18, 2026 04:08

Copilot started reviewing on behalf of bootjp February 18, 2026 04:09 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

-	commitTS := readTS + 1
-	if commitTS == 0 {
-		return savePlan{}, errors.WithStack(ErrCatalogVersionOverflow)
-	}
+	// Choose a commit timestamp that is strictly greater than both the read
+	// timestamp and the current store LastCommitTS to preserve ordering.
+	candidateFromRead := readTS + 1
+	if candidateFromRead == 0 {
+		return savePlan{}, errors.WithStack(ErrCatalogVersionOverflow)
+	}
+	currentLast := s.store.LastCommitTS()
+	candidateFromStore := currentLast + 1
+	if candidateFromStore == 0 {
+		return savePlan{}, errors.WithStack(ErrCatalogVersionOverflow)
+	}
+	commitTS := candidateFromRead
+	if candidateFromStore > commitTS {
+		commitTS = candidateFromStore
+	}

-		return out[i].RouteID < out[j].RouteID
+		ri, rj := out[i], out[j]
+		// Primary ordering: Start key (ascending).
+		if c := bytes.Compare(ri.Start, rj.Start); c != 0 {
+			return c < 0
+		}
+		// Secondary ordering: End key (ascending, with nil treated as empty slice).
+		if c := bytes.Compare(ri.End, rj.End); c != 0 {
+			return c < 0
+		}
+		// Tertiary ordering: RouteID for deterministic tie-breaking.
+		return ri.RouteID < rj.RouteID

Conversation

bootjp commented Feb 17, 2026

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant