Docs/Single-Arm SSR

Single-Arm Sample Size Re-estimation (SSR)

Technical documentation for adaptive single-arm Phase II designs comparing a binary response rate against a fixed historical control. Covers the Bayesian posterior/predictive framework, Mehta–Pocock promising-zone conditional power, prior specification, operating characteristics, and FDA regulatory considerations (2019 Adaptive Designs guidance, Project Optimus 2023).

1. Overview & Motivation

Single-arm Phase II trials are the dominant design in early oncology drug development. Enrolling all patients onto the experimental arm accelerates evidence generation when a randomized comparison is ethically or practically infeasible, and the observed objective response rate (ORR) is compared against a historical control rate p0p_0 drawn from prior trials of standard-of-care.

Single-arm designs can support FDA accelerated approval in selected settings — especially oncology — when the endpoint is reasonably likely to predict clinical benefit and confirmatory evidence is planned or required. Within those settings, single-arm trials powered against a well-characterized p0p_0 are a common design choice; the pathway is not a default for all single-arm trials.

Why adaptive SSR? The initial sample size depends on a minimally-clinically-important alternative p1p_1 that sponsors often specify with significant uncertainty. If interim data suggest the true effect is smaller than p1p_1 but still clinically meaningful, a modest expansion can preserve power. Conversely, a very large observed effect supports an efficacy interim stop, and a very small effect supports futility termination—both sparing patients and resources.

When adaptive SSR helps: (a) the clinically meaningful effect size is uncertain, (b) operational flexibility is valued (stop early for efficacy or futility), (c) the historical control rate p0p_0 is well-established, and (d) the trial is exploratory (Phase II), not confirmatory.

2. Design Framework

Let X1,,XnX_1, \ldots, X_n be i.i.d. Bernoulli responses with unknown true rate pp. We test:

H0:pp0vs.H1:p>p0H_0: p \leq p_0 \quad \text{vs.} \quad H_1: p > p_0

where p0p_0 is the historical control rate (null) and p1p_1 is the target alternative. Under a normal approximation to the one-sample binomial, the required fixed-design sample size is:

n=(zαp0(1p0)+zβp1(1p1)p1p0)2n = \left\lceil \left( \frac{z_\alpha \sqrt{p_0(1-p_0)} + z_\beta \sqrt{p_1(1-p_1)}}{p_1 - p_0} \right)^2 \right\rceil

with zα=Φ1(1α)z_\alpha = \Phi^{-1}(1-\alpha) and zβ=Φ1(power)z_\beta = \Phi^{-1}(\text{power}). The standard test at the final analysis rejects H0H_0 if the observed p^\hat{p} exceeds a critical value derived from the binomial (or its normal approximation).

At the interim look with n1n_1 patients enrolled and x1x_1 responses, the design chooses between (i) early efficacy stop, (ii) early futility stop, or (iii) continuation—optionally with a re-estimated target sample size nn^* bounded by a pre-specified cap nmaxn_{\max}.

3. Bayesian Mode

The Bayesian mode uses a conjugate Beta–Binomial framework. With prior pBeta(α0,β0)p \sim \text{Beta}(\alpha_0, \beta_0) and interim data x1x_1 responses in n1n_1 patients, the posterior is:

px1,n1Beta(α0+x1,  β0+n1x1)p \,|\, x_1, n_1 \sim \text{Beta}(\alpha_0 + x_1,\; \beta_0 + n_1 - x_1)

Posterior efficacy stopping. Stop early for efficacy at the interim if the posterior probability that pp exceeds the null rate clears the interim bar:

Pr(p>p0x1,n1)γefficacy\Pr(p > p_0 \mid x_1, n_1) \geq \gamma_\text{efficacy}

Two thresholds, not one. The design uses two distinct posterior-probability bars: gamma_efficacy at the interim (typically high, e.g., 0.97–0.99) and gamma_final at the final analysis (defaults to 1α1 - \alpha, e.g., 0.975 for α=0.025\alpha = 0.025). The interim bar is the stop-early gate; the final bar is the success criterion. Conflating the two depresses simulated power because predictive probability then projects to an inflated final bar.

Predictive futility stopping. Compute the Bayesian predictive probability (PPoS) that the trial will clear gamma_final at the final analysis given current data:

PPoS=Pr ⁣[Pr(p>p0final data)γfinal|x1,n1]\text{PPoS} = \Pr\!\left[\Pr(p > p_0 \mid \text{final data}) \geq \gamma_\text{final} \,\middle|\, x_1, n_1\right]

Stop for futility if PPoSδfutility\text{PPoS} \leq \delta_\text{futility} (typically around 0.05). Otherwise, continue—optionally recalculating the final nn^* up to nmaxn_{\max}.

Threshold calibration. Neither gamma_efficacy nor gamma_final is analytically tied to frequentist Type I error; verify by Monte Carlo at p=p0p = p_0. If Type I error is inflated, raise gamma_efficacy first (interim early stops are counted as rejections); raising gamma_final also helps but costs power. If power is below target, lower gamma_final toward 1α1 - \alpha or raise the interim/final N. Zetyra's engine reports both rates in the OC table.

4. Conditional Power Mode

The conditional power (CP) mode adapts the Mehta–Pocock (2011) promising zone framework from two-arm to single-arm designs. Given interim statistic z1z_1 computed under the one-sample binomial:

z1=p^1p0p0(1p0)/n1z_1 = \frac{\hat{p}_1 - p_0}{\sqrt{p_0(1-p_0)/n_1}}

the conditional power under the observed current trend (or under the target alternative, per SAP) is:

CP(z1)=Φ ⁣(z1n1+(z1/n1)(nn1)zαnnn1)CP(z_1) = \Phi\!\left( \frac{z_1 \sqrt{n_1} + (z_1/\sqrt{n_1})(n - n_1) - z_\alpha \sqrt{n}}{\sqrt{n - n_1}} \right)

Zones are defined by CP thresholds:

  • Favorable (CP > promising upper): large effect; no re-estimation needed (or consider efficacy stop).
  • Promising (promising lower ≤ CP ≤ promising upper): re-estimate nn^* to restore planned CP, capped at nmaxn_{\max}.
  • Unfavorable (futility ≤ CP < promising lower): continue with planned sample size; do not inflate.
  • Futility (CP < futility threshold): consider stopping for futility.

The original Mehta–Pocock theorem (Chen, DeMets, Lan 2004; Gao, Ware, Mehta 2008) preserves Type I error in the two-arm normal/z-test setting when re-estimation is confined to the promising zone. For single-arm binomial designs this guarantee does not transfer analytically — the discrete sample space and exact-binomial final test mean Type I error must be confirmed via simulation (Tier 2 OC table) before fixing cp_promising_lower / cp_promising_upper for the protocol.

Stronger result: in our base setting, no cp_promising_lower on the grid achieves T1E ≤ α

Exact enumeration of the joint binomial distribution (Qian 2026) for the representative oncology setting (p0=0.23,p1=0.35,Ninit=84,nint=65,Nmax=200)(p_0=0.23,\, p_1=0.35,\, N_{\text{init}}=84,\, n_{\text{int}}=65,\, N_{\max}=200) shows that exact Type I error sits uniformly above α=0.05 across every cp_promising_lower value tested (5.7–6.3% over the grid 0.30 → 0.80). The CP design in this setting is uncalibratable: no fixed cp_promising_lower choice attains nominal control.

Across five different (p0,p1,nint,Ninit,Nmax)(p_0,\, p_1,\, n_{\text{int}},\, N_{\text{init}},\, N_{\max}) configurations, the discrete z-test critical-count rounding bias changes sign: in some settings T1E sits above α everywhere, in others some cp_promising_lower values control T1E and others do not, with no monotone or otherwise predictable rule. A cp_promising_lower that controls T1E in one design configuration may not control it in another; calibration cannot be extrapolated across designs.

Why: discrete z-test bias plus SSR-driven boundary crossings

Because interim outcomes follow a discrete Binomial(n1,p0)\text{Binomial}(n_1, p_0) distribution, the one-sided z-test critical count xcrit(N)x_{\text{crit}}(N) is the smallest integer with zzαz \geq z_\alpha, so the actual fixed-N rejection probability under H0H_0 differs from nominal α by a signed amount whose sign depends on (p0,N)(p_0, N). SSR moves trials between finite-N tests with their own opposite-sign discreteness biases in a way that depends on cp_promising_lower and the realised interim count. The resulting T1E behaviour is design-specific and cannot be controlled by a fixed-CP_L rule.

Practitioners who must use this design should report exact T1E (or simulated T1E with adequate precision) across their full cp_promising_lower grid for their specific design parameters, treat the worst-case T1E across plausible perturbations as the operating characteristic, and not extrapolate calibration from one design to another.

Recommendation: switch to Bayesian PP SSR (calibratable, monotone)

The Bayesian predictive-probability mode of this calculator decouples the interim early-stop bar (gamma_efficacy) from the final-analysis bar (gamma_final). Exact T1E is monotone in gamma_final and the calibration surface is far less irregular than the CP_L surface, so the design can be calibrated to any target T1E by enumeration: pick the smallest gamma_final that meets the budget. In our base setting, calibrated gamma_final = 0.955 yields exact T1E = 4.7% with 82.7% power at p₁ — competitive with the best-calibrated CP design (when one exists) and dominant when CP cannot be calibrated.

A partial-symmetric comparison (Qian 2026, §4.6) confirms that switching the CP design's final analysis from a z-test to a Bayesian posterior threshold — while keeping the same CP-zone SSR logic — is sufficient to restore calibration. The discrete z-test final is therefore a sufficient cause of the inflation. Practitioners who want to keep familiar CP-zone SSR machinery can recover calibrated T1E by replacing the z-test final with a Bayesian posterior threshold; or adopt Bayesian PP SSR end-to-end (the cleaner path).

In-product calibration helper: the Single-Arm SSR calculator now ships a Calibrate to α button next to the gamma_final field (Bayesian mode). It bisects gamma_final against simulated T1E at p_true = p₀ and fills in the calibrated value, surfacing both achieved T1E and power. If the design is uncalibratable on this axis (T1E cannot be brought to α even at gamma_final ≈ 0.9999), the helper flags non-convergence so you can revisit gamma_efficacy or the prior rather than ship an inflated design.

5. Prior Specification

The choice of prior Beta(α0,β0)\text{Beta}(\alpha_0, \beta_0) materially affects interim decisions, particularly when n1n_1 is small. Zetyra offers three presets:

  • Jeffreys Beta(0.5, 0.5) — default. The Jeffreys prior is the invariant reference prior for a Bernoulli parameter, derived from the square root of the Fisher information. It is objective in the sense that it is invariant under reparameterization and has prior effective sample size (ESS) of 1.
  • Flat Beta(1, 1). The uniform prior on [0,1][0, 1]. Often preferred by sponsors for its intuitive interpretation; ESS of 2. Slightly more informative than Jeffreys in the tails.
  • Custom informative priors. Derived from prior trials via the MAP prior / bayesian-borrowing workflow or elicited from experts via prior elicitation. Use with caution: regulators scrutinize informative priors that favor efficacy claims.

Prior ESS consideration. Prior ESS =α0+β0\alpha_0 + \beta_0. If ESS approachesn1n_1, the posterior is heavily influenced by the prior. Report prior ESS and run sensitivity analyses (Jeffreys vs. flat vs. custom) before finalizing thresholds.

6. Operating Characteristics

For both modes, simulated operating characteristics are mandatory before fixing thresholds for the protocol. Bayesian stopping rules are not analytically tied to frequentist Type I error, and the two-arm Mehta–Pocock promising-zone theorem does not transfer analytically to single-arm binomial CP designs (FDA Adaptive Designs Guidance 2019, Section V).

Zetyra's OC table reports, for a grid of true rates p{p0,,p1,}p \in \{p_0, \ldots, p_1, \ldots\}:

  • Type I error at p=p0p = p_0: must be α\leq \alpha. If inflated in Bayesian mode, the in-product Calibrate to α button bisects gamma_final via simulation; or raise gamma_efficacy (typically toward 0.97–0.99) and re-simulate. If inflated in CP mode, do not assume tighter promising-zone bounds will help — T1E behaviour is design-specific and in many configurations sits uniformly above α across every cp_promising_lower value (Qian 2026, exact-enumeration result; see Section 4 callouts below). The recommended path is to switch to Bayesian PP mode, which is calibratable on the continuous gamma_final axis.
  • Simulated power at p=p1p = p_1: should match the planned power target.
  • Expected sample size E[Np]\mathbb{E}[N \mid p]: shows the adaptive design's efficiency gain over fixed-N under each true rate, together with quantiles N10,N50,N90N_{10}, N_{50}, N_{90}.
  • Stopping probabilities: Pr(efficacy stop), Pr(futility stop), Pr(N hits cap) at each true rate.

Interpret the table jointly: a design with 5% Type I error, 82% power at p1p_1, and E[Np0]\mathbb{E}[N \mid p_0] substantially below the fixed-N is well-tuned. An 8% Type I error means the thresholds are too liberal.

7. Regulatory Considerations

  • FDA Adaptive Designs Guidance (2019), Section IV.B. Sample size re-estimation is a well-characterized adaptation provided the rule, timing, and caps are pre-specified and Type I error is verified by simulation.
  • FDA Accelerated Approval. Single-arm ORR trials supporting accelerated approval must enroll a pre-specified population, use a locked analysis plan, and demonstrate a meaningful effect over historical control.
  • Project Optimus (2023). FDA oncology dose-optimization initiative emphasizes adequate sample sizes for dose selection and characterization of tolerability in Phase II, which SSR directly supports by expanding cohorts under promising interim trends.
  • Pre-specification requirements. The SAP must fix p0,p1,α,powerp_0, p_1, \alpha, \text{power}, the interim timing n1n_1, the prior (if Bayesian), the thresholds (γ,δ)(\gamma, \delta) or CP zones, the cap nmaxn_{\max}, and include simulation-based OC evidence.
  • SAP text generation. The Zetyra report exports an SAP-ready decision rule description plus the OC table and sensitivity scenarios directly suitable for inclusion in a protocol and SAP submission.

8. Assumptions & Limitations

  • Historical control stability. The entire design rests on p0p_0 being a stable, well-characterized historical rate. Drift in p0p_0 (e.g., supportive-care improvements, population shifts, selection bias in the historical source) inflates Type I error without detection.
  • Binary endpoint only. The v1 engine supports binary (response/no response) endpoints. Continuous and time-to-event single-arm designs are not implemented.
  • Historical control misspecification. Even modest (2–5 pp) drift in p0p_0 can materially shift achieved Type I error. Sensitivity scenarios in the report show how the recalculated N and CP change under plausible alternative p0p_0.
  • Not for confirmatory Phase III. Single-arm designs are exploratory; efficacy claims for full approval require randomized confirmatory evidence except in narrow accelerated-approval settings.
  • One interim look. The v1 engine supports a single interim analysis. Multi-look GSD-style boundaries for single-arm trials should use the group-sequential calculator instead.

9. API Reference

Endpoint: POST /api/v1/calculators/ssr-single-arm

Request parameters

FieldTypeDefaultDescription
ssr_methodstring"bayesian" or "conditional_power"
p0floatNull/historical response rate (0, 1)
p1floatTarget alternative rate, p1 > p0
alphafloat0.025One-sided Type I error
powerfloat0.80Target power at p1
interim_fractionfloat0.5Fraction of planned N at interim look
interim_nint?nullAbsolute interim N (overrides fraction)
n_max_factorfloat1.5Cap as multiple of initial N (must be >1, ≤5)
n_max_absoluteint?nullAbsolute N cap (overrides n_max_factor); must be ≥10
prior_alphafloat0.5Beta prior α (Bayesian mode)
prior_betafloat0.5Beta prior β (Bayesian mode)
gamma_efficacyfloat0.95Interim early-stop threshold. Posterior P(p>p0 | data) ≥ this triggers efficacy stop at the interim look. Calibrate via simulation.
gamma_finalfloat?1−αFinal-analysis success threshold. The eventual posterior must clear this for the trial to be a positive result. Predictive probability is computed under this threshold. Default is 1−α (e.g., 0.975 for α=0.025), which keeps simulated power near the design target.
delta_futilityfloat0.05Predictive probability threshold for futility
pp_promising_upperfloat0.50Predictive-probability upper bound for the SSR promising zone (Bayesian mode). Trials with delta_futility < PP < this extend N up to N_max; PP ≥ this continues at the originally planned N. Must be greater than delta_futility. Raise to 0.70–0.80 to keep more trials in the SSR zone and push N_p90 toward the N_max budget.
cp_futilityfloat0.10CP lower bound for futility (CP mode)
cp_promising_lowerfloat0.30CP lower bound for promising zone
cp_promising_upperfloat0.80CP upper bound for promising zone
simulateboolfalseRun Monte Carlo OC validation
simulation_seedint?nullRandom seed for reproducibility (auto-generated if null)
n_simulationsint10000Simulation replicates (1,000–100,000)

Example Request

{
  "ssr_method": "bayesian",
  "p0": 0.20,
  "p1": 0.40,
  "alpha": 0.025,
  "power": 0.80,
  "interim_fraction": 0.5,
  "n_max_factor": 1.5,
  "prior_alpha": 0.5,
  "prior_beta": 0.5,
  "gamma_efficacy": 0.95,
  "gamma_final": null,
  "delta_futility": 0.05,
  "pp_promising_upper": 0.50,
  "simulate": true,
  "simulation_seed": 42,
  "n_simulations": 10000
}

gamma_final: null defaults to 1 - alpha (e.g., 0.975 for alpha 0.025). Raise pp_promising_upper toward 0.70 to keep more trials in the SSR promising zone.

Response Schema (abridged)

{
  "calculation_id": "...",
  "tier": "analytical+simulation",
  "analytical_results": {
    "initial_n": 36,
    "interim_n": 18,
    "interim_fraction": 0.5,
    "ssr_method": "bayesian",
    "posterior_probability": 0.97,
    "predictive_probability": 0.81,
    "conditional_power": 0.82,
    "conditional_power_planned": 0.82,
    "zone": "",
    "z1": 1.96,
    "efficacy_stop": true,
    "futility_stop": false,
    "recalculated_n": 18,
    "inflation_factor": 0.5,
    "n_capped": false,
    "n_max_used": 54,
    "gamma_final_used": 0.975,
    "prior_description": "Jeffreys Beta(0.5, 0.5)",
    "decision_rule_description": "...",
    "recalculation_scenarios": [
      {
        "label": "Planned effect",
        "assumed_nuisance": 0.40,
        "recalculated_n_per_arm": 36,
        "recalculated_n_total": 36,
        "inflation_factor": 1.0,
        "conditional_power": 0.82,
        "decision": "continue_favorable"
      }
    ],
    "regulatory_notes": [...]
  },
  "metadata": {...},
  "simulation": {...},
  "warnings": [],
  "regulatory_citations": [...]
}

decision enum values: stop_efficacy, stop_futility, continue_ssr, continue_favorable, continue_unfavorable. Five sensitivity rows are returned by default (50%, 75%, 100%, 125%, 150% of planned effect).

10. References

  1. Simon R. Optimal two-stage designs for Phase II clinical trials. Controlled Clinical Trials. 1989;10(1):1-10.
  2. Thall PF, Simon R. Practical Bayesian guidelines for Phase IIB clinical trials. Biometrics. 1994;50(2):337-349.
  3. Lee JJ, Liu DD. A predictive probability design for Phase II cancer clinical trials. Clinical Trials. 2008;5(2):93-106.
  4. Mehta CR, Pocock SJ. Adaptive increase in sample size when interim results are promising: A practical guide with examples. Statistics in Medicine. 2011;30(28):3267-3284.
  5. Chen DT, Schell MJ, Fulp WJ, et al. Application of Bayesian predictive probability for interim futility analysis in single-arm phase II trial. Translational Cancer Research. 2019;8(Suppl 4):S404-S420.
  6. U.S. Food and Drug Administration. Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry. November 2019.
  7. U.S. Food and Drug Administration. Project Optimus: Optimizing the Dosage of Human Prescription Drugs and Biological Products for the Treatment of Oncologic Diseases. 2023.
  8. Qian L. Conditional Power Promising Zone Sample Size Re-estimation Inflates Type I Error in Single-Arm Binary Trials: An Exact-Enumeration Study and Comparison with Bayesian Predictive Probability SSR. Zetyra | Evidence in the Wild; April 2026 (under peer review). github.com/evidenceinthewild/CP-SSR-Binary-Trials

Last updated: May 2026

Ready to design your Phase II ORR trial?

Use our Single-Arm SSR Calculator for Bayesian PPoS or CP promising-zone rules with decoupled gamma_efficacy / gamma_final thresholds and an in-product calibration helper.

Open Single-Arm SSR Calculator