Disaster Recovery for Trading Desks

August 1, 2012, 9:31 a.m. ET: Knight Capital's freshly deployed trading code begins firing 397 million shares of erroneous orders into the market—and by 10:15 a.m., forty-five minutes later, the firm has racked up $460 million in losses from a single botched code push, not a market event (SEC Press Release 2013-222). Two months later, Hurricane Sandy forces the NYSE into 2 consecutive trading days of closure—the first weather-related multi-day shutdown since 1888—stranding firms with tens of millions in unrealized revenue and 6 delayed IPOs. Neither disaster was a surprise; what collapsed was recovery infrastructure that couldn't match the speed at which modern trading systems fail. The practical antidote is one most desks acknowledge in theory yet chronically underbuild in practice—recovery capabilities engineered to meet specific regulatory thresholds before the disruption arrives, not after.
TL;DR: Disaster recovery for trading desks is governed by explicit regulatory requirements—including 2-hour recovery objectives for CCPs under EMIR and next-business-day recovery for DCOs under CFTC rules. Firms that treat BCDR planning as a checkbox exercise discover the gaps only when it's too late to fix them.
What Disaster Recovery Means for a Trading Desk (And What It Doesn't)
A Disaster Recovery Plan (DRP) is the subset of your broader Business Continuity Plan focused specifically on restoring IT systems, data, and infrastructure after a disruption. It covers geographic dispersal of backup facilities, defined recovery time objectives, and recovery point objectives. It does not cover every business continuity scenario—staffing plans, client communication protocols, and alternate office arrangements fall under the broader Business Continuity Plan (BCP).
The point is: disaster recovery is infrastructure-focused. It answers one question—how fast can you restore the systems that process trades, calculate margin, and file regulatory reports?
Two metrics define every DRP:
- Recovery Time Objective (RTO): The maximum acceptable duration before a critical function is restored. Under EMIR Article 34, CCPs must achieve a 2-hour RTO for critical functions. Under CFTC Rule 39.18, DCOs must recover by the next business day.
- Recovery Point Objective (RPO): The maximum acceptable data loss measured in time. Industry standard for trading desks is 0 to 15 minutes for trade data (meaning you can lose no more than 15 minutes of transaction records).
FINRA Rule 4370 requires broker-dealers to maintain a written BCP addressing 10 mandatory elements, including data backup and recovery, mission-critical systems identification, alternate communications, and regulatory reporting. A registered principal must review the plan at least once every 12 months and update it upon any material operational change.
Why this matters: if your firm trades derivatives—cleared or uncleared—you're subject to overlapping BCDR obligations from multiple regulators simultaneously (CFTC, SEC, FINRA, and potentially ESMA if you touch EU markets).
The Regulatory Framework (Who Requires What)
The regulatory landscape for trading desk disaster recovery is layered. Here's what each framework demands:
| Regulation | Applies To | Key Requirement | Recovery Standard |
|---|---|---|---|
| CFTC Rule 39.18 | DCOs | BCDR plan with geographic dispersal and testing | Next business day RTO |
| EMIR Article 34 | CCPs (EU) | Business continuity with defined RTO | 2-hour RTO for critical functions |
| FINRA Rule 4370 | Broker-dealers | Written BCP with 10 elements, annual review | No specific RTO; must be "reasonable" |
| SEC Rule 17a-4 | Broker-dealers | Redundant electronic recordkeeping | Records readily downloadable; 6-year retention for most records |
| CFTC Parts 43/45 | SDs, MSPs, DCOs | Swap data reporting to SDRs | T+1 for SD/MSP; T+2 for non-SD |
| CFTC ORF NPRM (2024) | FCMs, SDs, MSPs | Operational Resilience Framework | IT security + third-party risk + BCDR |
The CFTC's Operational Resilience Framework, proposed on January 24, 2024 (Federal Register), would require all registered FCMs, swap dealers, and major swap participants to maintain a formal ORF encompassing IT security programs, third-party risk management, and BCDR plans with risk-proportionate testing. This isn't final rulemaking yet, but it signals the direction of regulatory expectations.
The rule that survives: regulators are converging on the principle that operational resilience is not optional infrastructure—it's a compliance obligation with specific, testable standards.
How Disaster Recovery Works in Practice (The Mechanics)
A trading desk's DRP operates across three layers:
Layer 1: Data Replication and Backup Trade data, position records, and margin calculations must replicate continuously to a secondary site. SEC Rule 17a-4 mandates backup electronic recordkeeping systems that serve as redundant record sets, with records readily downloadable in both human-readable and electronic format. The RPO target—0 to 15 minutes—means your replication lag cannot exceed that window.
Layer 2: Infrastructure Failover Failover is the process of switching operations from a primary site to a secondary disaster recovery site. Both CFTC and EMIR requirements mandate at least annual testing of this process. The secondary site must maintain geographic dispersal—industry practice for systemically important financial market utilities is 200+ miles between primary and secondary sites (sufficient distance to avoid simultaneous disruption from a single event like a hurricane or regional power failure).
Layer 3: Regulatory Reporting Continuity Even during a disruption, your reporting obligations don't pause. Swap dealers must report trade data to a Swap Data Repository (SDR) within T+1 of execution under CFTC Part 45. Non-SD/MSP counterparties get T+2. If your primary reporting infrastructure fails and you miss these deadlines, that's a regulatory violation—not an excused absence.
The point is: disaster recovery isn't just about getting your screens back. It's about maintaining three simultaneous capabilities—data integrity, system availability, and regulatory compliance—under degraded conditions.
Worked Example: Interest Rate Swap Desk Failover Scenario
Consider a mid-sized swap dealer running an interest rate derivatives book. Here's how a disaster recovery event plays out with real numbers.
Phase 1: The Setup
Your firm executes 15-25 interest rate swaps daily with a gross notional outstanding of approximately EUR 12 billion (above the EUR 8 billion AANA threshold that triggers initial margin requirements for uncleared swaps under BCBS-IOSCO Phase 6, effective September 1, 2022). You post variation margin daily—collected or paid by end of business day following trade date per BCBS-IOSCO standards. Your initial margin exposure at group level exceeds EUR 50 million, the threshold before posting is required.
At 10:15 AM on a Tuesday, your primary data center experiences a complete power failure. Your trading, risk management, and regulatory reporting systems go offline.
Phase 2: The Trigger
Your DRP activates. The failover process begins switching operations to your secondary site (located 250 miles from the primary facility). Here's the timeline that matters:
| Clock | Event | Regulatory Benchmark |
|---|---|---|
| T+0 min | Primary site failure detected | — |
| T+15 min | Failover initiated; last replicated data point confirmed | RPO: 0-15 min |
| T+45 min | Secondary site systems online; read-only access restored | — |
| T+90 min | Full trading capability restored at secondary site | EMIR CCP standard: 2-hour RTO |
| T+120 min | Regulatory reporting systems confirmed operational | Swap reporting: T+1 deadline still applies |
| End of day | Variation margin calculations completed and exchanged | BCBS-IOSCO: end of next business day |
Phase 3: The Outcome
You executed 8 swaps before the outage. Under CFTC Part 45, those must be reported to the SDR by end of business Wednesday (T+1 from execution). Your secondary site's reporting systems came online at T+120 minutes—well within the deadline.
However, you lost 12 minutes of position data between the last replication and the failure (within the 0-15 minute RPO). Your operations team must reconcile those 12 minutes of activity manually against counterparty records before variation margin can be calculated accurately.
The practical point: Your variation margin payment—potentially millions of euros on a EUR 12 billion book—depends on accurate position data. A 12-minute data gap doesn't sound large, but if 3 of those 8 pre-outage swaps haven't replicated, your mark-to-market is wrong, your margin call is wrong, and your counterparty relationship is strained.
Mechanical alternative: Reduce your RPO from 15 minutes to near-zero using synchronous replication (at higher infrastructure cost), or implement a trade-level confirmation protocol that independently logs each execution to both sites simultaneously.
Common Failure Modes (What Goes Wrong and Why)
Disaster recovery plans fail not in their documentation but in their execution. Here are the patterns:
Failure → Impact → Regulatory consequence:
- Untested failover → Secondary site doesn't actually work under load → Missed reporting deadlines and margin payment failures (the most common failure mode—firms document a plan but test it only superficially)
- Insufficient geographic dispersal → Both sites affected by same regional event → Extended outage beyond RTO (Hurricane Sandy demonstrated this when firms with Manhattan primary and Brooklyn secondary sites lost both)
- Third-party dependency gaps → Your systems recover but your clearing firm's don't → You can't submit trades to the CCP (the CFTC's 2024 ORF proposal specifically targets this gap)
- Stale contact lists → Key personnel unreachable during activation → Delayed decision-making and escalation failures
The test: Can your desk execute, clear, report, and margin a new trade entirely from your secondary site, using only secondary-site personnel, within your stated RTO? If you haven't validated this end-to-end in the last 12 months, your DRP is a document, not a capability.
Risks, Limitations, and Tradeoffs
Every DRP involves cost-benefit decisions:
Synchronous vs. asynchronous replication. Synchronous replication (near-zero RPO) requires high-bandwidth, low-latency connections between sites—expensive when sites are 200+ miles apart. Asynchronous replication is cheaper but accepts some data loss. The right choice depends on your book's size and velocity.
RTO ambition vs. infrastructure cost. A 2-hour RTO (EMIR CCP standard) requires hot standby infrastructure—servers running and ready, not just provisioned. A next-business-day RTO (CFTC DCO standard) allows cold or warm standby at lower cost. Your firm's RTO should be proportionate to your market impact (and regulatory classification).
Testing frequency vs. operational disruption. Full failover tests are disruptive—they require taking the primary site offline. Annual testing is the regulatory minimum, but firms with complex derivatives books benefit from semi-annual or quarterly partial tests that validate specific components without full cutover.
Why this matters: the cost of disaster recovery infrastructure is measurable. The cost of an inadequate plan is not—until the disaster happens. Knight Capital's $460 million loss and subsequent acquisition by Getco LLC is the extreme case, but smaller operational failures that trigger missed margin payments or reporting deadlines carry real financial and regulatory consequences (Knight Capital paid a $12 million SEC penalty on top of the trading losses).
Disaster Recovery Readiness Checklist
Essential (High ROI)
- Identify all mission-critical systems (trading, clearing, margin, regulatory reporting) and document RTOs and RPOs for each
- Verify geographic dispersal—secondary site is 200+ miles from primary, on a separate power grid and network backbone
- Confirm regulatory reporting capability from secondary site—can you file to your SDR within T+1 using only backup infrastructure?
- Assign a registered principal as BCP/DRP owner with authority to activate failover (FINRA Rule 4370 requirement)
High-Impact (Workflow and Automation)
- Conduct full failover test annually (minimum)—including clearing connectivity, margin calculation, and SDR reporting
- Map third-party dependencies—clearing firms, exchanges, SDRs, market data vendors—and confirm their recovery capabilities align with yours
- Implement automated replication monitoring with alerts when RPO exceeds threshold (e.g., replication lag > 5 minutes triggers escalation)
- Maintain current contact lists for all DR activation personnel, counterparties, and regulators; test quarterly
Optional (Good for Firms with Large Uncleared Books)
- Test margin recalculation accuracy after failover—verify that variation margin computed from replicated data matches primary-site calculations within tolerance
- Establish independent trade confirmation logging to both sites simultaneously (eliminates RPO gap for executed trades)
- Conduct tabletop exercises with senior management simulating extended outages (3+ days) to test escalation and communication protocols
Your Next Step
This week, answer one question: what is your current RPO for trade data, and when was it last validated?
Pull up your DRP documentation. Find the stated RPO. Then check with your infrastructure team—when did they last measure actual replication lag under peak trading volume? If the answer is "I'm not sure" or "more than 12 months ago," that's your first action item. Schedule a replication lag measurement during the next high-volume trading session and compare the result against your stated RPO. If measured lag exceeds the stated objective, escalate to your BCP owner before the next annual review cycle.
For related operational controls, see Recordkeeping and Surveillance Obligations and Cybersecurity Considerations for Derivatives Teams.
Related Articles

Regulation Best Interest and Derivative Sales
Regulation Best Interest (Reg BI)—the SEC's standard for broker-dealer recommendations since June 30, 2020—hits derivative sales desks harder than vanilla equity or bond businesses because every de...

Glossary: Regulatory and Operational Terms
Regulatory and operational terms in derivatives aren't just jargon—they're the control framework that determines whether a firm can trade, how it must report, and what capital it must hold. Misunde...

Credit Support Annex and Collateral Terms
The Credit Support Annex is the document that determines whether you actually get paid when an OTC derivatives counterparty owes you money. It bolts onto the ISDA Master Agreement and governs every detail of collateral exchange: who posts, when they post, what they post, and what happens when the...