Skip to main content

The Day We Recovered Production in 6 Hours

Every secret in our .env disappeared at 4:41 UTC. Here is exactly how we rotated, rebuilt, and verified the entire stack.

5 min read

At 4:41 AM UTC today, every secret in our production environment file disappeared.

Database credentials. Stripe API keys. Webhook signing secrets. Email credentials. Authentication tokens. The cron shared secret. All of it — gone from a single env file on our production droplet.

The site immediately started throwing 500s. Candidates couldn't log in. Employers couldn't access their dashboards. Stripe webhooks began failing silently. The Smart-Apply cron that runs every six hours went dark.

Six hours later, SignalRoster is back online — every subsystem rebuilt, every credential rotated, every surface tested end-to-end.

Here's exactly what happened, what we did, and what we learned.

The Discovery

The first sign was a spike in 500 errors. Within minutes we SSH'd into the production droplet and found the env file gutted. Everything sensitive had been stripped out.

We don't yet know what caused it. A deploy script, a misfired automation, a compromised credential — the investigation is still open. But the cause didn't matter in the first hour. The only thing that mattered was getting the site back up.

The Playbook

When your production config is in an unknown state, you have to treat every credential as compromised and rotate it. No exceptions. Here's the sequence we ran:

1. Triage, not forensics. We split the problem in half: fix the site first, investigate after. That mental switch is hard but critical. Every minute spent asking "how did this happen?" is a minute the site stays down.

2. Rotate everything. We assumed every secret that had been in the file was now exposed, whether that was actually true or not. New database password. New Stripe live secret key. New webhook signing secret. New Resend API key. New NextAuth secret. New cron shared secret. Zero trust in anything old.

3. Rebuild the config. Audit logs, CLI histories, and a fresh set of credentials were our source of truth. We rebuilt the env from scratch, one variable at a time, verifying each one as it went in.

4. Fight the real-world gotchas. DigitalOcean blocks outbound SMTP on ports 465 and 587 — we had to switch to 2465 once we discovered mail was silently failing. Our Resend domain verification was tied to a different team account than the one whose API key we'd used, so the first rotation still couldn't send mail. Every one of these cost 20–40 minutes.

5. Verify end-to-end before declaring victory. We wrote a single script that checked PM2 cluster status, every public route, every API endpoint, Stripe webhook signature validation, database connectivity, a real SMTP send to a real inbox, and the cron endpoint authenticating and actually running a scan. Only when every check returned green did we stop.

What We Learned

Secrets management is not a "later" problem. If you can't recover your production config from a secrets manager in under ten minutes, you have a problem.

Write the verification script before you need it. The single most valuable tool during this incident was a shell script that tested every public surface, every API, every integration. Having that in version control means future recoveries take minutes, not hours.

Outages happen in layers. The first fix never fixes everything. We thought we had email working three separate times before we actually did. Every assumption we made needed verification.

Forensics come last. We still don't know what wiped the file. That investigation starts tomorrow, with clear heads and a running system.

What's Next for SignalRoster

Recovery is one thing. Momentum is another.

SignalRoster exists because job search in 2026 is still broken. Ninety-nine percent of applications go unread. Candidates apply into a black hole. Recruiters drown in noise. We're building something different — AI-powered matching that actually fits, Smart-Apply that tailors and submits applications while you sleep, and real humans on the other end.

We came back stronger today. Our infrastructure is tighter, our rotation discipline is sharper, and our resolve to build the best hiring platform on the internet hasn't moved an inch.

Still shipping. Always shipping.