Skip to content

Benchmark: Model benchmark - deterministic training support#731

Open
Aishwarya-Tonpe wants to merge 103 commits intomainfrom
aishwaryatonpe/deterministic-training
Open

Benchmark: Model benchmark - deterministic training support#731
Aishwarya-Tonpe wants to merge 103 commits intomainfrom
aishwaryatonpe/deterministic-training

Conversation

@Aishwarya-Tonpe
Copy link

@Aishwarya-Tonpe Aishwarya-Tonpe commented Aug 28, 2025

Support for deterministic training and reproducible logging to all PyTorch model benchmarks in SuperBench (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral).

Deterministic mode: Makes sure model runs are consistent every time by fixing random seeds, turning off TF32, and using stable math operations.
Log generation: Saves key info like loss and activation stats during training.
Log comparison: Lets you compare a new run with a previous one to check if they match.
New command-line options:

--enable-determinism
--generate-log {boolean flag which when enabled, stores the metrics (loss and activation mean) to the results file}
--compare-log {log path of the json file against which you want to compare the results of the current run}
--check-frequency

Changes -

Updated pytorch_base.py to handle deterministic settings, logging, and comparisons.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.

Usage -

Run with --enable-determinism --generate-log to create a reference log.
Run again with --compare-log to check if the new run matches the reference.
Make sure all parameters stay the same between runs.

- Add _enable_deterministic_training() method to set all necessary seeds
- Add --deterministic and --random_seed command line arguments
- Integrate deterministic training in _create_model() and _generate_dataset()
- Add comprehensive unit tests for deterministic functionality
- Tests validate parameter parsing, functionality, and regression scenarios
- All tests pass and integrate with existing SuperBench test suite
…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests
…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance
…rings; fix GPT-2 params; soft vs strict checks stabilized
…sum tests with BERT pattern, improve docstrings and skip logic.
…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.
@Aishwarya-Tonpe Aishwarya-Tonpe requested a review from a team as a code owner August 28, 2025 17:41
@Aishwarya-Tonpe
Copy link
Author

@Aishwarya-Tonpe please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

Copy link
Member

@abuccts abuccts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata and compare log functions still seem to be unnecessary.

  • For compare log function, it just checks whether the loss etc. in each step are equal or not, which is just a special case of the result analysis. I think you can just re-use current result analysis module to write some yaml configs to perform this comparison, rather then writing new code to do this during online benchmark run. Besides, there exist several scenarios that current compare log function cannot cover:

    1. in large scale training, the all-reduce usually produces accumulated errors due to different reduction orders among runs, so tolerating a range of differences is necessary in analysis/comparison, which can be easily configured in yaml configs of result analysis module.
    2. in validation, the results may need to compared to either baseline or results of other nodes. current compare log only performs 1 on 1 comparison of a pre-defined results, and cannot compare loss between different nodes in one run.
  • For metadata, all settings should already be included in benchmark config. When users compare loss results in two runs, they should guarantee the configs used are same, which is the same as comparing performance results. You may also write the necessary metadata into metrics so that results analysis can compare it as well.

Currently, all benchmarks in superbench only record related metrics during each run in benchmark module, then runner will collect all metrics after each run in runner module, and analysis/comparison is performed offline after all benchmarks finished in result analysis module.

Therefore, it would be better for determinism support in model benchmark follows the same process:

  1. write necessary results (e.g., loss, metadata, etc.) into metrics for each rank in pytorch benchmark during each run
  2. rely on existing results collection process in runner module to collect results from each rank, rather than ad-hoc all-reduce/all-gather in benchmark
  3. rely on existing results analysis module to compare the results offline. if there's any uncovered function for comparison, it would be better to support it generally in results analysis so that determinism in micro-benchmarks can also re-use it in the future.

Besides, please fix the unit tests accordingly.


Return:
The step-time list of every training step.
A tuple of (step_times_ms, info) of every training step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing one space in indent

…to aishwaryatonpe/deterministic-training
…that saves metadata, changed the comaprison logic, logic now involves adding metrics to the result file and running diagnosis
Copilot AI review requested due to automatic review settings February 12, 2026 20:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Aishwarya-Tonpe and others added 2 commits February 12, 2026 23:22
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 13, 2026 01:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot uses AI. Check for mistakes.
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot uses AI. Check for mistakes.
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot uses AI. Check for mistakes.
Comment on lines +199 to 212
def _setup_target(self):
# Use a separate deterministic RNG stream for target generation by offsetting the seed.
# This keeps dataset RNG and target/model RNG deterministic but independent.
generator = None
if getattr(self._args, 'enable_determinism', False) and hasattr(self._args, 'deterministic_seed'):
generator = torch.Generator()
generator.manual_seed(self._args.deterministic_seed + 1)
if generator is not None:
self._target = torch.LongTensor(self._args.batch_size).random_(self._args.num_classes, generator=generator)
else:
self._target = torch.LongTensor(self._args.batch_size).random_(self._args.num_classes)
if self._gpu_available:
self._target = self._target.cuda()

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only pytorch_mixtral_impl.py has been updated to use deterministic target generation with a separate generator, but other model benchmarks (BERT, GPT2, LLaMA, LSTM, CNN) still use the non-deterministic torch.LongTensor(...).random_() without a generator parameter. For consistent deterministic behavior across all models, all model files should use the same approach as pytorch_mixtral_impl.py's _setup_target() method when enable_determinism is enabled.

Copilot uses AI. Check for mistakes.

# Force Scaled Dot-Product Attention to use deterministic math kernel
try:
sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sdp_kernel function should be used as a context manager (with statement) to temporarily set the SDP backend for a block of code, not called directly. This call will have no lasting effect. Instead, consider using torch.backends.cuda.enable_flash_sdp(False), torch.backends.cuda.enable_math_sdp(True), and torch.backends.cuda.enable_mem_efficient_sdp(False) which set global state, or wrap the training code in a 'with sdp_kernel(...)' context manager.

Copilot uses AI. Check for mistakes.
'--batch_size 1 --precision float32 --num_warmup 1 --num_steps 120 --sample_count 8192 '
'--pin_memory --model_action train --check_frequency 20',
'lstm':
'--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float16 '
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LSTM example uses precision float16, but the test file (test_pytorch_determinism_all.py line 51) states that "float16 incompatible with deterministic mode". This inconsistency could lead to issues when users try to run deterministic training with LSTM. Consider changing the LSTM example to use float32 for consistency with the determinism requirements.

Suggested change
'--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float16 '
'--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float32 '

Copilot uses AI. Check for mistakes.
from datetime import timedelta

import torch
import torch.distributed as dist
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import 'import torch.distributed as dist' on line 13 is unused in the file. No references to 'dist.' were found in the code. Consider removing this unused import.

Suggested change
import torch.distributed as dist

Copilot uses AI. Check for mistakes.
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot uses AI. Check for mistakes.
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot uses AI. Check for mistakes.
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 13, 2026 21:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +84 to +88
# Force Scaled Dot-Product Attention to use deterministic math kernel
try:
sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False)
except Exception:
logger.warning('SDP kernel not available')
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdp_kernel(...) is a context manager in recent PyTorch releases; calling it without a with block won’t change the SDPA backend selection. If the intent is to force deterministic math kernels globally, use the dedicated enable/disable APIs (or wrap the model forward with with sdp_kernel(...)). Otherwise determinism expectations here won’t be met.

Copilot uses AI. Check for mistakes.
# Add raw data (all values at each checkpoint)
self._result.add_raw_data(metric_name, values, self._args.log_raw_data)
# Add summarized result (mean of checkpointed values)
self._result.add_result(metric_name, statistics.mean([v for v in values if v is not None]))
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statistics.mean([v for v in values if v is not None]) will raise StatisticsError if all recorded values are None (e.g., loss conversion failure or missing logits), which would fail the whole benchmark run during post-processing. Please guard for an empty filtered list (skip the metric, emit NaN, or record an explicit sentinel) before calling mean.

Suggested change
self._result.add_result(metric_name, statistics.mean([v for v in values if v is not None]))
filtered_values = [v for v in values if v is not None]
if filtered_values:
self._result.add_result(metric_name, statistics.mean(filtered_values))
else:
# No valid (non-None) values recorded; record NaN to avoid StatisticsError
self._result.add_result(metric_name, float('nan'))

Copilot uses AI. Check for mistakes.
Comment on lines +133 to +139
self._parser.add_argument(
'--check_frequency',
type=int,
default=100,
required=False,
help='How often (in steps) to run lightweight periodic checks/logs and evaluate early-stop conditions.',
)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_frequency is used in a modulo operation for periodic logging; with the current parser it can be set to 0 (or negative), which will raise at runtime or behave unexpectedly. Add validation (e.g., check_frequency > 0) at argument parsing time, or handle non-positive values safely in the logging helpers.

Copilot uses AI. Check for mistakes.

import torch
import torch.distributed as dist
import transformers
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dist is imported but not referenced anywhere in this module. Please remove the unused import to avoid lint noise and keep dependencies clear.

Suggested change
import transformers

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +80
if not enable_determinism or (curr_step % check_frequency != 0):
return

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curr_step % check_frequency will raise ZeroDivisionError when check_frequency is 0 (or behave oddly for negatives). Since check_frequency is user-configurable, please validate it (>0) or add a defensive check here to avoid crashing deterministic runs.

Suggested change
if not enable_determinism or (curr_step % check_frequency != 0):
return
if not enable_determinism:
return
# Defensive check: avoid ZeroDivisionError and undefined behavior for non-positive or invalid frequencies.
if not isinstance(check_frequency, int) or check_frequency <= 0:
if logger:
logger.warning(
f'Invalid check_frequency={check_frequency} at step {curr_step}; '
'skipping periodic fingerprint recording.'
)
return
if curr_step % check_frequency != 0:
return

Copilot uses AI. Check for mistakes.
Comment on lines +246 to 250
self.record_determinism_fingerprint(curr_step, loss, logits, periodic, self._args.check_frequency)
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
return duration, self._finalize_periodic_logging(periodic)

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to 123
self.record_determinism_fingerprint(curr_step, loss, output, periodic, self._args.check_frequency)
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
return duration, self._finalize_periodic_logging(periodic)

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot uses AI. Check for mistakes.
Comment on lines +158 to 162
self.record_determinism_fingerprint(curr_step, loss, output, periodic, self._args.check_frequency)
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
return duration, self._finalize_periodic_logging(periodic)

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot uses AI. Check for mistakes.
Comment on lines +189 to 193
self.record_determinism_fingerprint(curr_step, loss, logits, periodic, self._args.check_frequency)
self._log_step_time(curr_step, precision, duration)
if self._is_finished(curr_step, end, check_frequency):
return duration
if self._is_finished(curr_step, end):
return duration, self._finalize_periodic_logging(periodic)

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot uses AI. Check for mistakes.
output = self._model(sample)
loss = self._loss_fn(output[range(self._args.batch_size), -1], self._target)
logits = output[range(self._args.batch_size), -1]
loss = self._loss_fn(logits.float(), self._target)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loss = self._loss_fn(logits.float(), ...) forces FP32 loss for all precisions. That alters benchmark semantics/perf for lower-precision runs even when determinism is disabled; if the cast is only intended for deterministic mode, gate it on enable_determinism (or document the unconditional FP32 loss behavior).

Suggested change
loss = self._loss_fn(logits.float(), self._target)
# Use FP32 logits for loss only when determinism is enabled; otherwise
# keep logits in their native precision to preserve benchmark semantics.
enable_determinism = getattr(self._args, 'enable_determinism', False)
logits_for_loss = logits.float() if enable_determinism else logits
loss = self._loss_fn(logits_for_loss, self._target)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarks SuperBench Benchmarks model-benchmarks Model Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants