Single-Arm Sample Size Re-estimation (SSR)
Technical documentation for adaptive single-arm Phase II designs comparing a binary response rate against a fixed historical control. Covers the Bayesian posterior/predictive framework, Mehta–Pocock promising-zone conditional power, prior specification, operating characteristics, and FDA regulatory considerations (2019 Adaptive Designs guidance, Project Optimus 2023).
Contents
1. Overview & Motivation
Single-arm Phase II trials are the dominant design in early oncology drug development. Enrolling all patients onto the experimental arm accelerates evidence generation when a randomized comparison is ethically or practically infeasible, and the observed objective response rate (ORR) is compared against a historical control rate drawn from prior trials of standard-of-care.
Single-arm designs can support FDA accelerated approval in selected settings — especially oncology — when the endpoint is reasonably likely to predict clinical benefit and confirmatory evidence is planned or required. Within those settings, single-arm trials powered against a well-characterized are a common design choice; the pathway is not a default for all single-arm trials.
Why adaptive SSR? The initial sample size depends on a minimally-clinically-important alternative that sponsors often specify with significant uncertainty. If interim data suggest the true effect is smaller than but still clinically meaningful, a modest expansion can preserve power. Conversely, a very large observed effect supports an efficacy interim stop, and a very small effect supports futility termination—both sparing patients and resources.
When adaptive SSR helps: (a) the clinically meaningful effect size is uncertain, (b) operational flexibility is valued (stop early for efficacy or futility), (c) the historical control rate is well-established, and (d) the trial is exploratory (Phase II), not confirmatory.
2. Design Framework
Let be i.i.d. Bernoulli responses with unknown true rate . We test:
where is the historical control rate (null) and is the target alternative. Under a normal approximation to the one-sample binomial, the required fixed-design sample size is:
with and . The standard test at the final analysis rejects if the observed exceeds a critical value derived from the binomial (or its normal approximation).
At the interim look with patients enrolled and responses, the design chooses between (i) early efficacy stop, (ii) early futility stop, or (iii) continuation—optionally with a re-estimated target sample size bounded by a pre-specified cap .
3. Bayesian Mode
The Bayesian mode uses a conjugate Beta–Binomial framework. With prior and interim data responses in patients, the posterior is:
Posterior efficacy stopping. Stop early for efficacy at the interim if the posterior probability that exceeds the null rate clears the interim bar:
Two thresholds, not one. The design uses two distinct posterior-probability bars: gamma_efficacy at the interim (typically high, e.g., 0.97–0.99) and gamma_final at the final analysis (defaults to , e.g., 0.975 for ). The interim bar is the stop-early gate; the final bar is the success criterion. Conflating the two depresses simulated power because predictive probability then projects to an inflated final bar.
Predictive futility stopping. Compute the Bayesian predictive probability (PPoS) that the trial will clear gamma_final at the final analysis given current data:
Stop for futility if (typically around 0.05). Otherwise, continue—optionally recalculating the final up to .
Threshold calibration. Neither gamma_efficacy nor gamma_final is analytically tied to frequentist Type I error; verify by Monte Carlo at . If Type I error is inflated, raise gamma_efficacy first (interim early stops are counted as rejections); raising gamma_final also helps but costs power. If power is below target, lower gamma_final toward or raise the interim/final N. Zetyra's engine reports both rates in the OC table.
4. Conditional Power Mode
The conditional power (CP) mode adapts the Mehta–Pocock (2011) promising zone framework from two-arm to single-arm designs. Given interim statistic computed under the one-sample binomial:
the conditional power under the observed current trend (or under the target alternative, per SAP) is:
Zones are defined by CP thresholds:
- •Favorable (CP > promising upper): large effect; no re-estimation needed (or consider efficacy stop).
- •Promising (promising lower ≤ CP ≤ promising upper): re-estimate to restore planned CP, capped at .
- •Unfavorable (futility ≤ CP < promising lower): continue with planned sample size; do not inflate.
- •Futility (CP < futility threshold): consider stopping for futility.
The original Mehta–Pocock theorem (Chen, DeMets, Lan 2004; Gao, Ware, Mehta 2008) preserves Type I error in the two-arm normal/z-test setting when re-estimation is confined to the promising zone. For single-arm binomial designs this guarantee does not transfer analytically — the discrete sample space and exact-binomial final test mean Type I error must be confirmed via simulation (Tier 2 OC table) before fixing cp_promising_lower / cp_promising_upper for the protocol.
Stronger result: in our base setting, no cp_promising_lower on the grid achieves T1E ≤ α
Exact enumeration of the joint binomial distribution (Qian 2026) for the representative oncology setting shows that exact Type I error sits uniformly above α=0.05 across every cp_promising_lower value tested (5.7–6.3% over the grid 0.30 → 0.80). The CP design in this setting is uncalibratable: no fixed cp_promising_lower choice attains nominal control.
Across five different configurations, the discrete z-test critical-count rounding bias changes sign: in some settings T1E sits above α everywhere, in others some cp_promising_lower values control T1E and others do not, with no monotone or otherwise predictable rule. A cp_promising_lower that controls T1E in one design configuration may not control it in another; calibration cannot be extrapolated across designs.
Why: discrete z-test bias plus SSR-driven boundary crossings
Because interim outcomes follow a discrete distribution, the one-sided z-test critical count is the smallest integer with , so the actual fixed-N rejection probability under differs from nominal α by a signed amount whose sign depends on . SSR moves trials between finite-N tests with their own opposite-sign discreteness biases in a way that depends on cp_promising_lower and the realised interim count. The resulting T1E behaviour is design-specific and cannot be controlled by a fixed-CP_L rule.
Practitioners who must use this design should report exact T1E (or simulated T1E with adequate precision) across their full cp_promising_lower grid for their specific design parameters, treat the worst-case T1E across plausible perturbations as the operating characteristic, and not extrapolate calibration from one design to another.
Recommendation: switch to Bayesian PP SSR (calibratable, monotone)
The Bayesian predictive-probability mode of this calculator decouples the interim early-stop bar (gamma_efficacy) from the final-analysis bar (gamma_final). Exact T1E is monotone in gamma_final and the calibration surface is far less irregular than the CP_L surface, so the design can be calibrated to any target T1E by enumeration: pick the smallest gamma_final that meets the budget. In our base setting, calibrated gamma_final = 0.955 yields exact T1E = 4.7% with 82.7% power at p₁ — competitive with the best-calibrated CP design (when one exists) and dominant when CP cannot be calibrated.
A partial-symmetric comparison (Qian 2026, §4.6) confirms that switching the CP design's final analysis from a z-test to a Bayesian posterior threshold — while keeping the same CP-zone SSR logic — is sufficient to restore calibration. The discrete z-test final is therefore a sufficient cause of the inflation. Practitioners who want to keep familiar CP-zone SSR machinery can recover calibrated T1E by replacing the z-test final with a Bayesian posterior threshold; or adopt Bayesian PP SSR end-to-end (the cleaner path).
In-product calibration helper: the Single-Arm SSR calculator now ships a Calibrate to α button next to the gamma_final field (Bayesian mode). It bisects gamma_final against simulated T1E at p_true = p₀ and fills in the calibrated value, surfacing both achieved T1E and power. If the design is uncalibratable on this axis (T1E cannot be brought to α even at gamma_final ≈ 0.9999), the helper flags non-convergence so you can revisit gamma_efficacy or the prior rather than ship an inflated design.
5. Prior Specification
The choice of prior materially affects interim decisions, particularly when is small. Zetyra offers three presets:
- •Jeffreys Beta(0.5, 0.5) — default. The Jeffreys prior is the invariant reference prior for a Bernoulli parameter, derived from the square root of the Fisher information. It is objective in the sense that it is invariant under reparameterization and has prior effective sample size (ESS) of 1.
- •Flat Beta(1, 1). The uniform prior on . Often preferred by sponsors for its intuitive interpretation; ESS of 2. Slightly more informative than Jeffreys in the tails.
- •Custom informative priors. Derived from prior trials via the MAP prior / bayesian-borrowing workflow or elicited from experts via prior elicitation. Use with caution: regulators scrutinize informative priors that favor efficacy claims.
Prior ESS consideration. Prior ESS =. If ESS approaches, the posterior is heavily influenced by the prior. Report prior ESS and run sensitivity analyses (Jeffreys vs. flat vs. custom) before finalizing thresholds.
6. Operating Characteristics
For both modes, simulated operating characteristics are mandatory before fixing thresholds for the protocol. Bayesian stopping rules are not analytically tied to frequentist Type I error, and the two-arm Mehta–Pocock promising-zone theorem does not transfer analytically to single-arm binomial CP designs (FDA Adaptive Designs Guidance 2019, Section V).
Zetyra's OC table reports, for a grid of true rates :
- •Type I error at : must be . If inflated in Bayesian mode, the in-product Calibrate to α button bisects
gamma_finalvia simulation; or raisegamma_efficacy(typically toward 0.97–0.99) and re-simulate. If inflated in CP mode, do not assume tighter promising-zone bounds will help — T1E behaviour is design-specific and in many configurations sits uniformly above α across everycp_promising_lowervalue (Qian 2026, exact-enumeration result; see Section 4 callouts below). The recommended path is to switch to Bayesian PP mode, which is calibratable on the continuousgamma_finalaxis. - •Simulated power at : should match the planned power target.
- •Expected sample size : shows the adaptive design's efficiency gain over fixed-N under each true rate, together with quantiles .
- •Stopping probabilities: Pr(efficacy stop), Pr(futility stop), Pr(N hits cap) at each true rate.
Interpret the table jointly: a design with 5% Type I error, 82% power at , and substantially below the fixed-N is well-tuned. An 8% Type I error means the thresholds are too liberal.
7. Regulatory Considerations
- •FDA Adaptive Designs Guidance (2019), Section IV.B. Sample size re-estimation is a well-characterized adaptation provided the rule, timing, and caps are pre-specified and Type I error is verified by simulation.
- •FDA Accelerated Approval. Single-arm ORR trials supporting accelerated approval must enroll a pre-specified population, use a locked analysis plan, and demonstrate a meaningful effect over historical control.
- •Project Optimus (2023). FDA oncology dose-optimization initiative emphasizes adequate sample sizes for dose selection and characterization of tolerability in Phase II, which SSR directly supports by expanding cohorts under promising interim trends.
- •Pre-specification requirements. The SAP must fix , the interim timing , the prior (if Bayesian), the thresholds or CP zones, the cap , and include simulation-based OC evidence.
- •SAP text generation. The Zetyra report exports an SAP-ready decision rule description plus the OC table and sensitivity scenarios directly suitable for inclusion in a protocol and SAP submission.
8. Assumptions & Limitations
- •Historical control stability. The entire design rests on being a stable, well-characterized historical rate. Drift in (e.g., supportive-care improvements, population shifts, selection bias in the historical source) inflates Type I error without detection.
- •Binary endpoint only. The v1 engine supports binary (response/no response) endpoints. Continuous and time-to-event single-arm designs are not implemented.
- •Historical control misspecification. Even modest (2–5 pp) drift in can materially shift achieved Type I error. Sensitivity scenarios in the report show how the recalculated N and CP change under plausible alternative .
- •Not for confirmatory Phase III. Single-arm designs are exploratory; efficacy claims for full approval require randomized confirmatory evidence except in narrow accelerated-approval settings.
- •One interim look. The v1 engine supports a single interim analysis. Multi-look GSD-style boundaries for single-arm trials should use the group-sequential calculator instead.
9. API Reference
Endpoint: POST /api/v1/calculators/ssr-single-arm
Request parameters
| Field | Type | Default | Description |
|---|---|---|---|
| ssr_method | string | — | "bayesian" or "conditional_power" |
| p0 | float | — | Null/historical response rate (0, 1) |
| p1 | float | — | Target alternative rate, p1 > p0 |
| alpha | float | 0.025 | One-sided Type I error |
| power | float | 0.80 | Target power at p1 |
| interim_fraction | float | 0.5 | Fraction of planned N at interim look |
| interim_n | int? | null | Absolute interim N (overrides fraction) |
| n_max_factor | float | 1.5 | Cap as multiple of initial N (must be >1, ≤5) |
| n_max_absolute | int? | null | Absolute N cap (overrides n_max_factor); must be ≥10 |
| prior_alpha | float | 0.5 | Beta prior α (Bayesian mode) |
| prior_beta | float | 0.5 | Beta prior β (Bayesian mode) |
| gamma_efficacy | float | 0.95 | Interim early-stop threshold. Posterior P(p>p0 | data) ≥ this triggers efficacy stop at the interim look. Calibrate via simulation. |
| gamma_final | float? | 1−α | Final-analysis success threshold. The eventual posterior must clear this for the trial to be a positive result. Predictive probability is computed under this threshold. Default is 1−α (e.g., 0.975 for α=0.025), which keeps simulated power near the design target. |
| delta_futility | float | 0.05 | Predictive probability threshold for futility |
| pp_promising_upper | float | 0.50 | Predictive-probability upper bound for the SSR promising zone (Bayesian mode). Trials with delta_futility < PP < this extend N up to N_max; PP ≥ this continues at the originally planned N. Must be greater than delta_futility. Raise to 0.70–0.80 to keep more trials in the SSR zone and push N_p90 toward the N_max budget. |
| cp_futility | float | 0.10 | CP lower bound for futility (CP mode) |
| cp_promising_lower | float | 0.30 | CP lower bound for promising zone |
| cp_promising_upper | float | 0.80 | CP upper bound for promising zone |
| simulate | bool | false | Run Monte Carlo OC validation |
| simulation_seed | int? | null | Random seed for reproducibility (auto-generated if null) |
| n_simulations | int | 10000 | Simulation replicates (1,000–100,000) |
Example Request
{
"ssr_method": "bayesian",
"p0": 0.20,
"p1": 0.40,
"alpha": 0.025,
"power": 0.80,
"interim_fraction": 0.5,
"n_max_factor": 1.5,
"prior_alpha": 0.5,
"prior_beta": 0.5,
"gamma_efficacy": 0.95,
"gamma_final": null,
"delta_futility": 0.05,
"pp_promising_upper": 0.50,
"simulate": true,
"simulation_seed": 42,
"n_simulations": 10000
}gamma_final: null defaults to 1 - alpha (e.g., 0.975 for alpha 0.025). Raise pp_promising_upper toward 0.70 to keep more trials in the SSR promising zone.
Response Schema (abridged)
{
"calculation_id": "...",
"tier": "analytical+simulation",
"analytical_results": {
"initial_n": 36,
"interim_n": 18,
"interim_fraction": 0.5,
"ssr_method": "bayesian",
"posterior_probability": 0.97,
"predictive_probability": 0.81,
"conditional_power": 0.82,
"conditional_power_planned": 0.82,
"zone": "",
"z1": 1.96,
"efficacy_stop": true,
"futility_stop": false,
"recalculated_n": 18,
"inflation_factor": 0.5,
"n_capped": false,
"n_max_used": 54,
"gamma_final_used": 0.975,
"prior_description": "Jeffreys Beta(0.5, 0.5)",
"decision_rule_description": "...",
"recalculation_scenarios": [
{
"label": "Planned effect",
"assumed_nuisance": 0.40,
"recalculated_n_per_arm": 36,
"recalculated_n_total": 36,
"inflation_factor": 1.0,
"conditional_power": 0.82,
"decision": "continue_favorable"
}
],
"regulatory_notes": [...]
},
"metadata": {...},
"simulation": {...},
"warnings": [],
"regulatory_citations": [...]
}decision enum values: stop_efficacy, stop_futility, continue_ssr, continue_favorable, continue_unfavorable. Five sensitivity rows are returned by default (50%, 75%, 100%, 125%, 150% of planned effect).
10. References
- Simon R. Optimal two-stage designs for Phase II clinical trials. Controlled Clinical Trials. 1989;10(1):1-10.
- Thall PF, Simon R. Practical Bayesian guidelines for Phase IIB clinical trials. Biometrics. 1994;50(2):337-349.
- Lee JJ, Liu DD. A predictive probability design for Phase II cancer clinical trials. Clinical Trials. 2008;5(2):93-106.
- Mehta CR, Pocock SJ. Adaptive increase in sample size when interim results are promising: A practical guide with examples. Statistics in Medicine. 2011;30(28):3267-3284.
- Chen DT, Schell MJ, Fulp WJ, et al. Application of Bayesian predictive probability for interim futility analysis in single-arm phase II trial. Translational Cancer Research. 2019;8(Suppl 4):S404-S420.
- U.S. Food and Drug Administration. Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry. November 2019.
- U.S. Food and Drug Administration. Project Optimus: Optimizing the Dosage of Human Prescription Drugs and Biological Products for the Treatment of Oncologic Diseases. 2023.
- Qian L. Conditional Power Promising Zone Sample Size Re-estimation Inflates Type I Error in Single-Arm Binary Trials: An Exact-Enumeration Study and Comparison with Bayesian Predictive Probability SSR. Zetyra | Evidence in the Wild; April 2026 (under peer review). github.com/evidenceinthewild/CP-SSR-Binary-Trials
Last updated: May 2026
Related Documentation
Blinded SSR
Two-arm Kieser-Friede blinded nuisance-parameter re-estimation with Cui-Hung-Wang conditional power across continuous, binary, and survival endpoints.
Unblinded SSR
Two-arm Mehta-Pocock promising-zone with inverse-normal combination test — the comparator framework when an active control is available.
Bayesian Sample Size
Fixed-design Phase II sizing for binary and continuous endpoints with full operating characteristics — the non-adaptive counterpart to Single-Arm SSR.
Ready to design your Phase II ORR trial?
Use our Single-Arm SSR Calculator for Bayesian PPoS or CP promising-zone rules with decoupled gamma_efficacy / gamma_final thresholds and an in-product calibration helper.
Open Single-Arm SSR Calculator