Docs/Single-Arm SSR

Single-Arm Sample Size Re-estimation (SSR)

Name: Zetyra
Price: 99 USD
Rating: 4.9 (47 reviews)
Author: Zetyra

Technical documentation for adaptive single-arm Phase II designs comparing a binary response rate against a fixed historical control. Covers the Bayesian posterior/predictive framework, Mehta–Pocock promising-zone conditional power, prior specification, operating characteristics, and FDA regulatory considerations (2019 Adaptive Designs guidance, Project Optimus 2023).

1. Overview & Motivation

Single-arm Phase II trials are the dominant design in early oncology drug development. Enrolling all patients onto the experimental arm accelerates evidence generation when a randomized comparison is ethically or practically infeasible, and the observed objective response rate (ORR) is compared against a historical control rate $p_0$ drawn from prior trials of standard-of-care.

Single-arm designs can support FDA accelerated approval in selected settings — especially oncology — when the endpoint is reasonably likely to predict clinical benefit and confirmatory evidence is planned or required. Within those settings, single-arm trials powered against a well-characterized $p_0$ are a common design choice; the pathway is not a default for all single-arm trials.

Why adaptive SSR? The initial sample size depends on a minimally-clinically-important alternative $p_1$ that sponsors often specify with significant uncertainty. If interim data suggest the true effect is smaller than $p_1$ but still clinically meaningful, a modest expansion can preserve power. Conversely, a very large observed effect supports an efficacy interim stop, and a very small effect supports futility termination—both sparing patients and resources.

When adaptive SSR helps: (a) the clinically meaningful effect size is uncertain, (b) operational flexibility is valued (stop early for efficacy or futility), (c) the historical control rate $p_0$ is well-established, and (d) the trial is exploratory (Phase II), not confirmatory.

2. Design Framework

Let $X_1, \ldots, X_n$ be i.i.d. Bernoulli responses with unknown true rate $p$ . We test:

H_0: p \leq p_0 \quad \text{vs.} \quad H_1: p > p_0

where $p_0$ is the historical control rate (null) and $p_1$ is the target alternative. Under a normal approximation to the one-sample binomial, the required fixed-design sample size is:

n = \left\lceil \left( \frac{z_\alpha \sqrt{p_0(1-p_0)} + z_\beta \sqrt{p_1(1-p_1)}}{p_1 - p_0} \right)^2 \right\rceil

with $z_\alpha = \Phi^{-1}(1-\alpha)$ and $z_\beta = \Phi^{-1}(\text{power})$ . The standard test at the final analysis rejects $H_0$ if the observed $\hat{p}$ exceeds a critical value derived from the binomial (or its normal approximation).

At the interim look with $n_1$ patients enrolled and $x_1$ responses, the design chooses between (i) early efficacy stop, (ii) early futility stop, or (iii) continuation—optionally with a re-estimated target sample size $n^*$ bounded by a pre-specified cap $n_{\max}$ .

3. Bayesian Mode

The Bayesian mode uses a conjugate Beta–Binomial framework. With prior $p \sim \text{Beta}(\alpha_0, \beta_0)$ and interim data $x_1$ responses in $n_1$ patients, the posterior is:

p \,|\, x_1, n_1 \sim \text{Beta}(\alpha_0 + x_1,\; \beta_0 + n_1 - x_1)

Posterior efficacy stopping. Stop early for efficacy at the interim if the posterior probability that $p$ exceeds the null rate clears the interim bar:

\Pr(p > p_0 \mid x_1, n_1) \geq \gamma_\text{efficacy}

Two thresholds, not one. The design uses two distinct posterior-probability bars: gamma_efficacy at the interim (typically high, e.g., 0.97–0.99) and gamma_final at the final analysis (defaults to $1 - \alpha$ , e.g., 0.975 for $\alpha = 0.025$ ). The interim bar is the stop-early gate; the final bar is the success criterion. Conflating the two depresses simulated power because predictive probability then projects to an inflated final bar.

Predictive futility stopping. Compute the Bayesian predictive probability (PPoS) that the trial will clear gamma_final at the final analysis given current data:

\text{PPoS} = \Pr\!\left[\Pr(p > p_0 \mid \text{final data}) \geq \gamma_\text{final} \,\middle|\, x_1, n_1\right]

Stop for futility if $\text{PPoS} \leq \delta_\text{futility}$ (typically around 0.05). Otherwise, continue—optionally recalculating the final $n^*$ up to $n_{\max}$ .

Threshold calibration. Neither gamma_efficacy nor gamma_final is analytically tied to frequentist Type I error; verify by Monte Carlo at $p = p_0$ . If Type I error is inflated, raise gamma_efficacy first (interim early stops are counted as rejections); raising gamma_final also helps but costs power. If power is below target, lower gamma_final toward $1 - \alpha$ or raise the interim/final N. Zetyra's engine reports both rates in the OC table.

4. Conditional Power Mode

The conditional power (CP) mode adapts the Mehta–Pocock (2011) promising zone framework from two-arm to single-arm designs. Given interim statistic $z_1$ computed under the one-sample binomial:

z_1 = \frac{\hat{p}_1 - p_0}{\sqrt{p_0(1-p_0)/n_1}}

the conditional power under the observed current trend (or under the target alternative, per SAP) is:

CP(z_1) = \Phi\!\left( \frac{z_1 \sqrt{n_1} + (z_1/\sqrt{n_1})(n - n_1) - z_\alpha \sqrt{n}}{\sqrt{n - n_1}} \right)

Zones are defined by CP thresholds:

•Favorable (CP > promising upper): large effect; no re-estimation needed (or consider efficacy stop).
•Promising (promising lower ≤ CP ≤ promising upper): re-estimate $n^*$ to restore planned CP, capped at $n_{\max}$ .
•Unfavorable (futility ≤ CP < promising lower): continue with planned sample size; do not inflate.
•Futility (CP < futility threshold): consider stopping for futility.

The original Mehta–Pocock theorem (Chen, DeMets, Lan 2004; Gao, Ware, Mehta 2008) preserves Type I error in the two-arm normal/z-test setting when re-estimation is confined to the promising zone. For single-arm binomial designs this guarantee does not transfer analytically — the discrete sample space and exact-binomial final test mean Type I error must be confirmed via simulation (Tier 2 OC table) before fixing cp_promising_lower / cp_promising_upper for the protocol.

Stronger result: in our base setting, no cp_promising_lower on the grid achieves T1E ≤ α

Exact enumeration of the joint binomial distribution (Qian 2026) for the representative oncology setting $(p_0=0.23,\, p_1=0.35,\, N_{\text{init}}=84,\, n_{\text{int}}=65,\, N_{\max}=200)$ shows that exact Type I error sits uniformly above α=0.05 across every cp_promising_lower value tested (5.7–6.3% over the grid 0.30 → 0.80). The CP design in this setting is uncalibratable: no fixed cp_promising_lower choice attains nominal control.

Across five different $(p_0,\, p_1,\, n_{\text{int}},\, N_{\text{init}},\, N_{\max})$ configurations, the discrete z-test critical-count rounding bias changes sign: in some settings T1E sits above α everywhere, in others some cp_promising_lower values control T1E and others do not, with no monotone or otherwise predictable rule. A cp_promising_lower that controls T1E in one design configuration may not control it in another; calibration cannot be extrapolated across designs.

Why: discrete z-test bias plus SSR-driven boundary crossings

Because interim outcomes follow a discrete $\text{Binomial}(n_1, p_0)$ distribution, the one-sided z-test critical count $x_{\text{crit}}(N)$ is the smallest integer with $z \geq z_\alpha$ , so the actual fixed-N rejection probability under $H_0$ differs from nominal α by a signed amount whose sign depends on $(p_0, N)$ . SSR moves trials between finite-N tests with their own opposite-sign discreteness biases in a way that depends on cp_promising_lower and the realised interim count. The resulting T1E behaviour is design-specific and cannot be controlled by a fixed-CP_L rule.

Practitioners who must use this design should report exact T1E (or simulated T1E with adequate precision) across their full cp_promising_lower grid for their specific design parameters, treat the worst-case T1E across plausible perturbations as the operating characteristic, and not extrapolate calibration from one design to another.

Recommendation: switch to Bayesian PP SSR (calibratable, monotone)

The Bayesian predictive-probability mode of this calculator decouples the interim early-stop bar (gamma_efficacy) from the final-analysis bar (gamma_final). Exact T1E is monotone in gamma_final and the calibration surface is far less irregular than the CP_L surface, so the design can be calibrated to any target T1E by enumeration: pick the smallest gamma_final that meets the budget. In our base setting, calibrated gamma_final = 0.955 yields exact T1E = 4.7% with 82.7% power at p₁ — competitive with the best-calibrated CP design (when one exists) and dominant when CP cannot be calibrated.

A partial-symmetric comparison (Qian 2026, §4.6) confirms that switching the CP design's final analysis from a z-test to a Bayesian posterior threshold — while keeping the same CP-zone SSR logic — is sufficient to restore calibration. The discrete z-test final is therefore a sufficient cause of the inflation. Practitioners who want to keep familiar CP-zone SSR machinery can recover calibrated T1E by replacing the z-test final with a Bayesian posterior threshold; or adopt Bayesian PP SSR end-to-end (the cleaner path).

In-product calibration helper: the Single-Arm SSR calculator now ships a Calibrate to α button next to the gamma_final field (Bayesian mode). It bisects gamma_final against simulated T1E at p_true = p₀ and fills in the calibrated value, surfacing both achieved T1E and power. If the design is uncalibratable on this axis (T1E cannot be brought to α even at gamma_final ≈ 0.9999), the helper flags non-convergence so you can revisit gamma_efficacy or the prior rather than ship an inflated design.

5. Prior Specification

The choice of prior $\text{Beta}(\alpha_0, \beta_0)$ materially affects interim decisions, particularly when $n_1$ is small. Zetyra offers three presets:

•Jeffreys Beta(0.5, 0.5) — default. The Jeffreys prior is the invariant reference prior for a Bernoulli parameter, derived from the square root of the Fisher information. It is objective in the sense that it is invariant under reparameterization and has prior effective sample size (ESS) of 1.
•Flat Beta(1, 1). The uniform prior on $[0, 1]$ . Often preferred by sponsors for its intuitive interpretation; ESS of 2. Slightly more informative than Jeffreys in the tails.
•Custom informative priors. Derived from prior trials via the MAP prior / bayesian-borrowing workflow or elicited from experts via prior elicitation. Use with caution: regulators scrutinize informative priors that favor efficacy claims.

Prior ESS consideration. Prior ESS = $\alpha_0 + \beta_0$ . If ESS approaches $n_1$ , the posterior is heavily influenced by the prior. Report prior ESS and run sensitivity analyses (Jeffreys vs. flat vs. custom) before finalizing thresholds.

6. Operating Characteristics

For both modes, simulated operating characteristics are mandatory before fixing thresholds for the protocol. Bayesian stopping rules are not analytically tied to frequentist Type I error, and the two-arm Mehta–Pocock promising-zone theorem does not transfer analytically to single-arm binomial CP designs (FDA Adaptive Designs Guidance 2019, Section V).

Zetyra's OC table reports, for a grid of true rates $p \in \{p_0, \ldots, p_1, \ldots\}$ :

•Type I error at $p = p_0$ : must be $\leq \alpha$ . If inflated in Bayesian mode, the in-product Calibrate to α button bisects gamma_final via simulation; or raise gamma_efficacy (typically toward 0.97–0.99) and re-simulate. If inflated in CP mode, do not assume tighter promising-zone bounds will help — T1E behaviour is design-specific and in many configurations sits uniformly above α across every cp_promising_lower value (Qian 2026, exact-enumeration result; see Section 4 callouts below). The recommended path is to switch to Bayesian PP mode, which is calibratable on the continuous gamma_final axis.
•Simulated power at $p = p_1$ : should match the planned power target.
•Expected sample size $\mathbb{E}[N \mid p]$ : shows the adaptive design's efficiency gain over fixed-N under each true rate, together with quantiles $N_{10}, N_{50}, N_{90}$ .
•Stopping probabilities: Pr(efficacy stop), Pr(futility stop), Pr(N hits cap) at each true rate.

Interpret the table jointly: a design with 5% Type I error, 82% power at $p_1$ , and $\mathbb{E}[N \mid p_0]$ substantially below the fixed-N is well-tuned. An 8% Type I error means the thresholds are too liberal.

7. Regulatory Considerations

•FDA Adaptive Designs Guidance (2019), Section IV.B. Sample size re-estimation is a well-characterized adaptation provided the rule, timing, and caps are pre-specified and Type I error is verified by simulation.
•FDA Accelerated Approval. Single-arm ORR trials supporting accelerated approval must enroll a pre-specified population, use a locked analysis plan, and demonstrate a meaningful effect over historical control.
•Project Optimus (2023). FDA oncology dose-optimization initiative emphasizes adequate sample sizes for dose selection and characterization of tolerability in Phase II, which SSR directly supports by expanding cohorts under promising interim trends.
•Pre-specification requirements. The SAP must fix $p_0, p_1, \alpha, \text{power}$ , the interim timing $n_1$ , the prior (if Bayesian), the thresholds $(\gamma, \delta)$ or CP zones, the cap $n_{\max}$ , and include simulation-based OC evidence.
•SAP text generation. The Zetyra report exports an SAP-ready decision rule description plus the OC table and sensitivity scenarios directly suitable for inclusion in a protocol and SAP submission.

8. Assumptions & Limitations

•Historical control stability. The entire design rests on $p_0$ being a stable, well-characterized historical rate. Drift in $p_0$ (e.g., supportive-care improvements, population shifts, selection bias in the historical source) inflates Type I error without detection.
•Binary endpoint only. The v1 engine supports binary (response/no response) endpoints. Continuous and time-to-event single-arm designs are not implemented.
•Historical control misspecification. Even modest (2–5 pp) drift in $p_0$ can materially shift achieved Type I error. Sensitivity scenarios in the report show how the recalculated N and CP change under plausible alternative $p_0$ .
•Not for confirmatory Phase III. Single-arm designs are exploratory; efficacy claims for full approval require randomized confirmatory evidence except in narrow accelerated-approval settings.
•One interim look. The v1 engine supports a single interim analysis. Multi-look GSD-style boundaries for single-arm trials should use the group-sequential calculator instead.

9. API Reference

Endpoint: POST /api/v1/calculators/ssr-single-arm

Request parameters

Field	Type	Default	Description
ssr_method	string	—	"bayesian" or "conditional_power"
p0	float	—	Null/historical response rate (0, 1)
p1	float	—	Target alternative rate, p1 > p0
alpha	float	0.025	One-sided Type I error
power	float	0.80	Target power at p1
interim_fraction	float	0.5	Fraction of planned N at interim look
interim_n	int?	null	Absolute interim N (overrides fraction)
n_max_factor	float	1.5	Cap as multiple of initial N (must be >1, ≤5)
n_max_absolute	int?	null	Absolute N cap (overrides n_max_factor); must be ≥10
prior_alpha	float	0.5	Beta prior α (Bayesian mode)
prior_beta	float	0.5	Beta prior β (Bayesian mode)
gamma_efficacy	float	0.95	Interim early-stop threshold. Posterior P(p>p0 \| data) ≥ this triggers efficacy stop at the interim look. Calibrate via simulation.
gamma_final	float?	1−α	Final-analysis success threshold. The eventual posterior must clear this for the trial to be a positive result. Predictive probability is computed under this threshold. Default is 1−α (e.g., 0.975 for α=0.025), which keeps simulated power near the design target.
delta_futility	float	0.05	Predictive probability threshold for futility
pp_promising_upper	float	0.50	Predictive-probability upper bound for the SSR promising zone (Bayesian mode). Trials with delta_futility < PP < this extend N up to N_max; PP ≥ this continues at the originally planned N. Must be greater than delta_futility. Raise to 0.70–0.80 to keep more trials in the SSR zone and push N_p90 toward the N_max budget.
cp_futility	float	0.10	CP lower bound for futility (CP mode)
cp_promising_lower	float	0.30	CP lower bound for promising zone
cp_promising_upper	float	0.80	CP upper bound for promising zone
simulate	bool	false	Run Monte Carlo OC validation
simulation_seed	int?	null	Random seed for reproducibility (auto-generated if null)
n_simulations	int	10000	Simulation replicates (1,000–100,000)

Example Request

{
  "ssr_method": "bayesian",
  "p0": 0.20,
  "p1": 0.40,
  "alpha": 0.025,
  "power": 0.80,
  "interim_fraction": 0.5,
  "n_max_factor": 1.5,
  "prior_alpha": 0.5,
  "prior_beta": 0.5,
  "gamma_efficacy": 0.95,
  "gamma_final": null,
  "delta_futility": 0.05,
  "pp_promising_upper": 0.50,
  "simulate": true,
  "simulation_seed": 42,
  "n_simulations": 10000
}

gamma_final: null defaults to 1 - alpha (e.g., 0.975 for alpha 0.025). Raise pp_promising_upper toward 0.70 to keep more trials in the SSR promising zone.

Response Schema (abridged)

{
  "calculation_id": "...",
  "tier": "analytical+simulation",
  "analytical_results": {
    "initial_n": 36,
    "interim_n": 18,
    "interim_fraction": 0.5,
    "ssr_method": "bayesian",
    "posterior_probability": 0.97,
    "predictive_probability": 0.81,
    "conditional_power": 0.82,
    "conditional_power_planned": 0.82,
    "zone": "",
    "z1": 1.96,
    "efficacy_stop": true,
    "futility_stop": false,
    "recalculated_n": 18,
    "inflation_factor": 0.5,
    "n_capped": false,
    "n_max_used": 54,
    "gamma_final_used": 0.975,
    "prior_description": "Jeffreys Beta(0.5, 0.5)",
    "decision_rule_description": "...",
    "recalculation_scenarios": [
      {
        "label": "Planned effect",
        "assumed_nuisance": 0.40,
        "recalculated_n_per_arm": 36,
        "recalculated_n_total": 36,
        "inflation_factor": 1.0,
        "conditional_power": 0.82,
        "decision": "continue_favorable"
      }
    ],
    "regulatory_notes": [...]
  },
  "metadata": {...},
  "simulation": {...},
  "warnings": [],
  "regulatory_citations": [...]
}

decision enum values: stop_efficacy, stop_futility, continue_ssr, continue_favorable, continue_unfavorable. Five sensitivity rows are returned by default (50%, 75%, 100%, 125%, 150% of planned effect).

10. References

Simon R. Optimal two-stage designs for Phase II clinical trials. Controlled Clinical Trials. 1989;10(1):1-10.
Thall PF, Simon R. Practical Bayesian guidelines for Phase IIB clinical trials. Biometrics. 1994;50(2):337-349.
Lee JJ, Liu DD. A predictive probability design for Phase II cancer clinical trials. Clinical Trials. 2008;5(2):93-106.
Mehta CR, Pocock SJ. Adaptive increase in sample size when interim results are promising: A practical guide with examples. Statistics in Medicine. 2011;30(28):3267-3284.
Chen DT, Schell MJ, Fulp WJ, et al. Application of Bayesian predictive probability for interim futility analysis in single-arm phase II trial. Translational Cancer Research. 2019;8(Suppl 4):S404-S420.
U.S. Food and Drug Administration. Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry. November 2019.
U.S. Food and Drug Administration. Project Optimus: Optimizing the Dosage of Human Prescription Drugs and Biological Products for the Treatment of Oncologic Diseases. 2023.
Qian L. Conditional Power Promising Zone Sample Size Re-estimation Inflates Type I Error in Single-Arm Binary Trials: An Exact-Enumeration Study and Comparison with Bayesian Predictive Probability SSR. Zetyra | Evidence in the Wild; April 2026 (under peer review). github.com/evidenceinthewild/CP-SSR-Binary-Trials

Last updated: May 2026