Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The metrics table provides the following key data points for each variant:


- **Confidence**: The statistical confidence that the observed difference between variants is real, not due to random chance.
- **Impact**: The difference in metric performance between variants, shown visually and as a percentage with a confidence interval (CI). In this example the observed impact is +1.35% with a confidence interval going from +0.38% to +2.32%.
- **Impact**: The difference in metric performance between variants, shown visually and as a percentage with a confidence interval (CI). For example, the observed impact could be +1.35% with a confidence interval going from +0.38% to +2.32%.

### Interpreting confidence intervals and results
The confidence interval (CI) and the color-coded indicator help assess the significance of the results:
Expand Down Expand Up @@ -66,33 +66,40 @@ Example using the example above

## Group Sequential Testing Metrics

<Image img="experiment-results/gst-inconclusive-result.png" alt="Inconclusive GST result in ABsmartly" maxWidth="45rem" />
<Image img="experiment-results/gst-efficacy-boundary-crossed.png" alt="Experiment crossing the efficacy boundary in ABsmartly" maxWidth="30rem" />

### Understanding the Metrics Table
### Understanding the GST data

The metrics table provides the following key data points for each variant at the last interim analysis:
<Image img="experiment-results/gst-data.png" alt="GST data in ABsmartly" maxWidth="45rem" />

- **Mean**: The GST adjusted average performance of the metric for the variant.

In case of a GST experiment, the metric table for the primary metric, shown on the experiment overview has a GST toggle which
allows you to toggle between the GST data (the one used for decision-making), and the non-GST data which can be used for debugging purposes.

The primary metric table, for GST experiments, provides the following key data points for each variant **at the last interim analysis**:

- **Mean**: The GST-adjusted average performance of the metric for the variant.
- **Observed Mean**: The actual observed average performance of the metric during the experiment.
- **Impact**: The percentage change in the metric compared to the baseline. This is a GST adjusted value. In this example +1.74% with a confidence interval going from -1.88% to +5.49%.
- **Impact**: The percentage change in the metric compared to the baseline. This is a GST adjusted value. In this example, +4.84% with a confidence interval going from +2.40% to +8.58%.
- **Z-Score**: A statistical measure that represents how many standard deviations the result is from the null hypothesis (no effect). Positive Z-scores indicate an improvement, while negative Z-scores suggest a decline.
- **P-Value**: The probability that the observed result occurred by chance if the null hypothesis is true. Lower P-values (e.g., below 0.05) indicate statistical significance.

These data points provide a summary of the ongoing analysis for the selected variant at the last analysis, helping to evaluate its performance relative to the baseline.
These data points provide a summary of the ongoing analysis for the selected variant at the last interim analysis,
helping to evaluate its performance relative to the baseline.

---

### Understanding the Group Sequential Graph

<Image img="experiment-results/gst-inconclusive-result.png" alt="Inconclusive GST result in ABsmartly" maxWidth="45rem" />
<Image img="experiment-results/gst-futility-boundary-crossed.png" alt="Experiment crossing the futility boundary in ABsmartly" maxWidth="30rem" />

Group Sequential Testing, makes it easy to visually interpret results. The graph displays the evolution of statistical evidence over time, allowing decisions to be made at predefined checkpoints during the experiment. It includes the following elements:

- **X-Axis (Time)**: Represents the progress of the experiment in time, with dates marking each past and future interim analyses.
- **Y-Axis (Standard Deviations, Z-Scale)**: Represents the Z-Score, showing how far the observed result is from the null hypothesis.
- **Z-Score Trajectory (Orange line)**: The path of the observed Z-Score over time. It starts at 0 and moves based on accumulating data.
- **Efficiency Boundary (Green Region)**: The upper boundary. If the Z-Score trajectory crosses this boundary, the variant shows a statistically significant improvement, and the experiment can be stopped early for success.
- **Futility Boundary (Pink Region)**: The lower boundary. If the Z-Score trajectory crosses this boundary, the variant is deemed unlikely to show meaningful improvement, and the experiment can be stopped early for futility.
- **Efficacy Boundary (Green Region)**: The upper boundary. If the Z-Score trajectory crosses this boundary, the variant shows a statistically significant improvement, and the experiment can be stopped early for success.
- **Futility Boundary (Gray Region)**: The lower boundary. If the Z-Score trajectory crosses this boundary, the variant is deemed unlikely to show meaningful improvement, and the experiment can be stopped early for futility.
- **Fixed Horizon (Vertical Dotted Line)**: Represents the moment in time where the equivalent Fixed Horizon test would have been completed. All interim analyses before that dotted line are opportunities to make an early decision.

---
Expand All @@ -101,7 +108,7 @@ Group Sequential Testing, makes it easy to visually interpret results. The graph
- Crossing this boundary means there is enough evidence to conclude the variant performs significantly better than the baseline.
- The experiment can be stopped early, and the variant can be considered successful.

2. **Futility Boundary (Pink)**:
2. **Futility Boundary (Gray)**:
- Crossing this boundary indicates the variant is unlikely to show significant improvement.
- In case of a binding futility type (see experiment setup), the experiment is completed (no more interim analysis will happen) and can be stopped as further data collection is unlikely to change the conclusion.
- In case of a non-binding futility type, you can decide to keep the experiment running to the following interim analyses.
Expand Down
Binary file added static/img/experiment-results/gst-data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.