Benchmark: Model benchmark - deterministic training support by Aishwarya-Tonpe · Pull Request #731 · microsoft/superbenchmark

Aishwarya-Tonpe · 2025-08-28T17:41:54Z

Support for deterministic training and reproducible logging to all PyTorch model benchmarks in SuperBench (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral).

Deterministic mode: Makes sure model runs are consistent every time by fixing random seeds, turning off TF32, and using stable math operations.
Log generation: Saves key info like loss and activation stats during training.
Log comparison: Lets you compare a new run with a previous one to check if they match.
New command-line options:

--enable-determinism
--generate-log {boolean flag which when enabled, stores the metrics (loss and activation mean) to the results file}
--compare-log {log path of the json file against which you want to compare the results of the current run}
--check-frequency

Changes -

Updated pytorch_base.py to handle deterministic settings, logging, and comparisons.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.

Usage -

Run with --enable-determinism --generate-log to create a reference log.
Run again with --compare-log to check if the new run matches the reference.
Make sure all parameters stay the same between runs.

- Add _enable_deterministic_training() method to set all necessary seeds - Add --deterministic and --random_seed command line arguments - Integrate deterministic training in _create_model() and _generate_dataset() - Add comprehensive unit tests for deterministic functionality - Tests validate parameter parsing, functionality, and regression scenarios - All tests pass and integrate with existing SuperBench test suite

…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests

…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance

…rings; fix GPT-2 params; soft vs strict checks stabilized

…sum tests with BERT pattern, improve docstrings and skip logic.

…BERT, GPT-2, LSTM, CNN, LLaMA examples

… models; update tests

…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.

…ted example file, remove redundant code

… unnecessary code

…idual model classes

… reduce redundant code

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

…in the test file

Aishwarya-Tonpe · 2025-08-28T19:55:34Z

@Aishwarya-Tonpe please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

…se file

…tcases pass on local

abuccts

The metadata and compare log functions still seem to be unnecessary.

For compare log function, it just checks whether the loss etc. in each step are equal or not, which is just a special case of the result analysis. I think you can just re-use current result analysis module to write some yaml configs to perform this comparison, rather then writing new code to do this during online benchmark run. Besides, there exist several scenarios that current compare log function cannot cover:
1. in large scale training, the all-reduce usually produces accumulated errors due to different reduction orders among runs, so tolerating a range of differences is necessary in analysis/comparison, which can be easily configured in yaml configs of result analysis module.
2. in validation, the results may need to compared to either baseline or results of other nodes. current compare log only performs 1 on 1 comparison of a pre-defined results, and cannot compare loss between different nodes in one run.
For metadata, all settings should already be included in benchmark config. When users compare loss results in two runs, they should guarantee the configs used are same, which is the same as comparing performance results. You may also write the necessary metadata into metrics so that results analysis can compare it as well.

Currently, all benchmarks in superbench only record related metrics during each run in benchmark module, then runner will collect all metrics after each run in runner module, and analysis/comparison is performed offline after all benchmarks finished in result analysis module.

Therefore, it would be better for determinism support in model benchmark follows the same process:

write necessary results (e.g., loss, metadata, etc.) into metrics for each rank in pytorch benchmark during each run
rely on existing results collection process in runner module to collect results from each rank, rather than ad-hoc all-reduce/all-gather in benchmark
rely on existing results analysis module to compare the results offline. if there's any uncovered function for comparison, it would be better to support it generally in results analysis so that determinism in micro-benchmarks can also re-use it in the future.

Besides, please fix the unit tests accordingly.

docs/user-tutorial/benchmarks/model-benchmarks.md

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/model_benchmarks/pytorch_cnn.py

abuccts · 2026-02-03T22:44:20Z

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py


        Return:
-            The step-time list of every training step.
+           A tuple of (step_times_ms, info) of every training step.


missing one space in indent

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py

…to aishwaryatonpe/deterministic-training

…that saves metadata, changed the comaprison logic, logic now involves adding metrics to the result file and running diagnosis

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_cnn.py

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py

superbench/benchmarks/model_benchmarks/pytorch_lstm.py

superbench/benchmarks/model_benchmarks/pytorch_llama.py

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T01:19:53Z

superbench/benchmarks/model_benchmarks/pytorch_llama.py

                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):


The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot · 2026-02-13T01:19:54Z

superbench/benchmarks/model_benchmarks/pytorch_lstm.py

                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):


The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot · 2026-02-13T01:19:54Z

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):


The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot · 2026-02-13T01:19:54Z

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

+    def _setup_target(self):
+        # Use a separate deterministic RNG stream for target generation by offsetting the seed.
+        # This keeps dataset RNG and target/model RNG deterministic but independent.
+        generator = None
+        if getattr(self._args, 'enable_determinism', False) and hasattr(self._args, 'deterministic_seed'):
+            generator = torch.Generator()
+            generator.manual_seed(self._args.deterministic_seed + 1)
+        if generator is not None:
+            self._target = torch.LongTensor(self._args.batch_size).random_(self._args.num_classes, generator=generator)
+        else:
+            self._target = torch.LongTensor(self._args.batch_size).random_(self._args.num_classes)
        if self._gpu_available:
            self._target = self._target.cuda()



Only pytorch_mixtral_impl.py has been updated to use deterministic target generation with a separate generator, but other model benchmarks (BERT, GPT2, LLaMA, LSTM, CNN) still use the non-deterministic torch.LongTensor(...).random_() without a generator parameter. For consistent deterministic behavior across all models, all model files should use the same approach as pytorch_mixtral_impl.py's _setup_target() method when enable_determinism is enabled.

Copilot · 2026-02-13T01:19:54Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+
+        # Force Scaled Dot-Product Attention to use deterministic math kernel
+        try:
+            sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False)


The sdp_kernel function should be used as a context manager (with statement) to temporarily set the SDP backend for a block of code, not called directly. This call will have no lasting effect. Instead, consider using torch.backends.cuda.enable_flash_sdp(False), torch.backends.cuda.enable_math_sdp(True), and torch.backends.cuda.enable_mem_efficient_sdp(False) which set global state, or wrap the training code in a 'with sdp_kernel(...)' context manager.

Copilot · 2026-02-13T01:19:55Z

examples/benchmarks/pytorch_deterministic_example.py

+    '--batch_size 1 --precision float32 --num_warmup 1 --num_steps 120 --sample_count 8192 '
+    '--pin_memory --model_action train --check_frequency 20',
+    'lstm':
+    '--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float16 '


The LSTM example uses precision float16, but the test file (test_pytorch_determinism_all.py line 51) states that "float16 incompatible with deterministic mode". This inconsistency could lead to issues when users try to run deterministic training with LSTM. Consider changing the LSTM example to use float32 for consistency with the determinism requirements.

Suggested change

'--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float16 '

'--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float32 '

Copilot · 2026-02-13T01:19:55Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+from datetime import timedelta

 import torch
+import torch.distributed as dist


The import 'import torch.distributed as dist' on line 13 is unused in the file. No references to 'dist.' were found in the code. Consider removing this unused import.

Suggested change

import torch.distributed as dist

Copilot · 2026-02-13T01:19:55Z

superbench/benchmarks/model_benchmarks/pytorch_bert.py

                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):


The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot · 2026-02-13T01:19:55Z

superbench/benchmarks/model_benchmarks/pytorch_cnn.py

                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):


The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot · 2026-02-13T01:19:56Z

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py

                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):


The _is_finished method signature requires 3 parameters (curr_step, curr_time, check_frequency), but this call only provides 2 parameters. The third parameter check_frequency is missing. Based on the original code, this should be: self._is_finished(curr_step, end, self._args.check_frequency)

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T22:04:04Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+        # Force Scaled Dot-Product Attention to use deterministic math kernel
+        try:
+            sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False)
+        except Exception:
+            logger.warning('SDP kernel not available')


sdp_kernel(...) is a context manager in recent PyTorch releases; calling it without a with block won’t change the SDPA backend selection. If the intent is to force deterministic math kernels globally, use the dedicated enable/disable APIs (or wrap the model forward with with sdp_kernel(...)). Otherwise determinism expectations here won’t be met.

Copilot · 2026-02-13T22:04:04Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+                    # Add raw data (all values at each checkpoint)
+                    self._result.add_raw_data(metric_name, values, self._args.log_raw_data)
+                    # Add summarized result (mean of checkpointed values)
+                    self._result.add_result(metric_name, statistics.mean([v for v in values if v is not None]))


statistics.mean([v for v in values if v is not None]) will raise StatisticsError if all recorded values are None (e.g., loss conversion failure or missing logits), which would fail the whole benchmark run during post-processing. Please guard for an empty filtered list (skip the metric, emit NaN, or record an explicit sentinel) before calling mean.

Suggested change

self._result.add_result(metric_name, statistics.mean([v for v in values if v is not None]))

filtered_values = [v for v in values if v is not None]

if filtered_values:

self._result.add_result(metric_name, statistics.mean(filtered_values))

else:

# No valid (non-None) values recorded; record NaN to avoid StatisticsError

self._result.add_result(metric_name, float('nan'))

Copilot · 2026-02-13T22:04:04Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+        self._parser.add_argument(
+            '--check_frequency',
+            type=int,
+            default=100,
+            required=False,
+            help='How often (in steps) to run lightweight periodic checks/logs and evaluate early-stop conditions.',
+        )


check_frequency is used in a modulo operation for periodic logging; with the current parser it can be set to 0 (or negative), which will raise at runtime or behave unexpectedly. Add validation (e.g., check_frequency > 0) at argument parsing time, or handle non-positive values safely in the logging helpers.

Copilot · 2026-02-13T22:04:04Z

superbench/benchmarks/model_benchmarks/pytorch_base.py


 import torch
+import torch.distributed as dist
 import transformers


dist is imported but not referenced anywhere in this module. Please remove the unused import to avoid lint noise and keep dependencies clear.

Suggested change

import transformers

Copilot · 2026-02-13T22:04:05Z

superbench/common/model_log_utils.py

+    if not enable_determinism or (curr_step % check_frequency != 0):
+        return
+


curr_step % check_frequency will raise ZeroDivisionError when check_frequency is 0 (or behave oddly for negatives). Since check_frequency is user-configurable, please validate it (>0) or add a defensive check here to avoid crashing deterministic runs.

Suggested change

if not enable_determinism or (curr_step % check_frequency != 0):

return

if not enable_determinism:

return

# Defensive check: avoid ZeroDivisionError and undefined behavior for non-positive or invalid frequencies.

if not isinstance(check_frequency, int) or check_frequency <= 0:

if logger:

logger.warning(

f'Invalid check_frequency={check_frequency} at step {curr_step}; '

'skipping periodic fingerprint recording.'

)

return

if curr_step % check_frequency != 0:

return

Copilot · 2026-02-13T22:04:07Z

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

+                    self.record_determinism_fingerprint(curr_step, loss, logits, periodic, self._args.check_frequency)
                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):
+                    return duration, self._finalize_periodic_logging(periodic)



check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot · 2026-02-13T22:04:07Z

superbench/benchmarks/model_benchmarks/pytorch_cnn.py

+                    self.record_determinism_fingerprint(curr_step, loss, output, periodic, self._args.check_frequency)
                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):
+                    return duration, self._finalize_periodic_logging(periodic)



check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot · 2026-02-13T22:04:08Z

superbench/benchmarks/model_benchmarks/pytorch_lstm.py

+                    self.record_determinism_fingerprint(curr_step, loss, output, periodic, self._args.check_frequency)
                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):
+                    return duration, self._finalize_periodic_logging(periodic)



check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot · 2026-02-13T22:04:08Z

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py

+                    self.record_determinism_fingerprint(curr_step, loss, logits, periodic, self._args.check_frequency)
                    self._log_step_time(curr_step, precision, duration)
-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                if self._is_finished(curr_step, end):
+                    return duration, self._finalize_periodic_logging(periodic)



check_frequency is documented/used as the cadence for early-stop checks, but _is_finished is now called without passing it, so distributed duration-based early stopping will still sync only every 100 steps (the default in PytorchBase._is_finished). Pass self._args.check_frequency through so runtime behavior matches the CLI option.

Copilot · 2026-02-13T22:04:08Z

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

                    output = self._model(sample)
-                loss = self._loss_fn(output[range(self._args.batch_size), -1], self._target)
+                logits = output[range(self._args.batch_size), -1]
+                loss = self._loss_fn(logits.float(), self._target)


loss = self._loss_fn(logits.float(), ...) forces FP32 loss for all precisions. That alters benchmark semantics/perf for lower-precision runs even when determinism is disabled; if the cast is only intended for deterministic mode, gate it on enable_determinism (or document the unconditional FP32 loss behavior).

Suggested change

loss = self._loss_fn(logits.float(), self._target)

# Use FP32 logits for loss only when determinism is enabled; otherwise

# keep logits in their native precision to preserve benchmark semantics.

enable_determinism = getattr(self._args, 'enable_determinism', False)

logits_for_loss = logits.float() if enable_determinism else logits

loss = self._loss_fn(logits_for_loss, self._target)

Aishwarya-Tonpe added 25 commits August 17, 2025 21:24

llama: add periodic checksum logging (deterministic-only, log-only); …

e103dd0

…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests

deterministic training: enable seeding + deterministic algorithms acr…

87ff6d6

…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance

tests(pytorch): add strict determinism skip guards and detailed docst…

8eee235

…rings; fix GPT-2 params; soft vs strict checks stabilized

Refactor LLaMA model tests: align strict, soft determinism, and check…

fe34247

…sum tests with BERT pattern, improve docstrings and skip logic.

examples: add deterministic and strict_determinism flags and docs to …

c374dfe

…BERT, GPT-2, LSTM, CNN, LLaMA examples

Deterministic fingerprints: replace checksum with Loss+ActMean across…

614f96c

… models; update tests

Deterministic training + reproducible logging: align GPT-2/LLaMA/LSTM…

689dc44

…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.

Adding flag: Checck-frequency

33c3f6a

Add Check frequency flag to tests

f35e98b

Code refactor: Move enable_determinism to base class, add a consolida…

dd7fcbe

…ted example file, remove redundant code

Code refactor: Add a new test folder to remove redundant code, remove…

d439395

… unnecessary code

Code refactor: Move loss and ActMean logging to base class from indiv…

da9c85a

…idual model classes

Code refactor: Move _benchmark() method to base class

2635aad

Code refactor: Add method _finalize_periodic_logging to base class to…

4a21990

… reduce redundant code

Code cleanup: Remove unnecessary imports

ddd3f23

Code cleanup: Remove unnecessary imports

a9cb452

Code cleanup: Remove unnecessary imports

52c5516

Code cleanup: Remove unnecessary imports

6623f59

Tescase addition: Add Failure testcase, renameflag

8853c21

Delete extra lines

14be806

Add Docstrings, align imports, add assertions messages

8cd1c19

Lint Checks

99bdc16

Lint Checks

4bc0445

Lint Checks

2c8d856

Aishwarya-Tonpe requested a review from a team as a code owner August 28, 2025 17:41

github-advanced-security bot found potential problems Aug 28, 2025

View reviewed changes

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py Fixed Show fixed Hide fixed

Failed check: Resolving failed pipeline check for creating temp file …

d8d9ca0

…in the test file

Pipeline failure fixes : Fixing Lint failures on test, example and ba…

8bcd801

…se file

Aishwarya-Tonpe added 14 commits December 17, 2025 18:57

Lint error fixes

a26518c

Pipeline erros resolve : Link errors, function complex error

c8abf0c

Resetting the env var cause of failing testcases in the pipeline, tes…

2f5493a

…tcases pass on local

Resolving pipelines errors

8398f51

Resolving pipelines errors

7c5405a

Resolving pipeline issues

6b51a18

Adding a new test file to cover the code logic in the model_utils file

c8ca973

Resolving pipeline issues

7f6bfeb

Resolving pipeline issues

205934e

resolving pipeline issues

3e996f2

Resolving pipeline failures

ea9f6b2

Fix pipeline issues

3b31c6a

Minor change

4384412

Merge branch 'main' into aishwaryatonpe/deterministic-training

b5967f7

abuccts reviewed Feb 3, 2026

View reviewed changes

Aishwarya-Tonpe added 2 commits February 9, 2026 19:19

Merge branch 'main' of https://github.com/microsoft/superbenchmark in…

a96889e

…to aishwaryatonpe/deterministic-training

Resolving comments: Removed compare and genertae flags, remove logic …

0b54fdf

…that saves metadata, changed the comaprison logic, logic now involves adding metrics to the result file and running diagnosis

Copilot AI review requested due to automatic review settings February 12, 2026 20:30

Copilot started reviewing on behalf of Aishwarya-Tonpe February 12, 2026 20:30 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

Aishwarya-Tonpe and others added 2 commits February 12, 2026 23:22

Comments address

b718adc

Update superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

4b482a7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 13, 2026 01:13

Copilot started reviewing on behalf of Aishwarya-Tonpe February 13, 2026 01:14 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Aishwarya-Tonpe added 2 commits February 13, 2026 18:18

Comments resolve: Minor change

22964b2

Comments resolve: Metadat ne wlogic changes

a9755f7

Copilot AI review requested due to automatic review settings February 13, 2026 21:53

Copilot started reviewing on behalf of Aishwarya-Tonpe February 13, 2026 21:53 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

	'--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float16 '
	'--batch_size 1 --num_steps 100 --num_warmup 2 --seq_len 64 --precision float32 '

-                    self._result.add_result(metric_name, statistics.mean([v for v in values if v is not None]))
+                    filtered_values = [v for v in values if v is not None]
+                    if filtered_values:
+                        self._result.add_result(metric_name, statistics.mean(filtered_values))
+                    else:
+                        # No valid (non-None) values recorded; record NaN to avoid StatisticsError
+                        self._result.add_result(metric_name, float('nan'))

		if not enable_determinism or (curr_step % check_frequency != 0):
		return

-    if not enable_determinism or (curr_step % check_frequency != 0):
-        return
+    if not enable_determinism:
+        return
+    # Defensive check: avoid ZeroDivisionError and undefined behavior for non-positive or invalid frequencies.
+    if not isinstance(check_frequency, int) or check_frequency <= 0:
+        if logger:
+            logger.warning(
+                f'Invalid check_frequency={check_frequency} at step {curr_step}; '
+                'skipping periodic fingerprint recording.'
+            )
+        return
+    if curr_step % check_frequency != 0:
+        return

-                loss = self._loss_fn(logits.float(), self._target)
+                # Use FP32 logits for loss only when determinism is enabled; otherwise
+                # keep logits in their native precision to preserve benchmark semantics.
+                enable_determinism = getattr(self._args, 'enable_determinism', False)
+                logits_for_loss = logits.float() if enable_determinism else logits
+                loss = self._loss_fn(logits_for_loss, self._target)

Conversation

Aishwarya-Tonpe commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Aishwarya-Tonpe commented Aug 28, 2025

Uh oh!

abuccts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abuccts Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Aishwarya-Tonpe commented Aug 28, 2025 •

edited

Loading