Senior Full-Stack Engineer, Reliability & Incident Response (Part-Time Contract)
Braintrust
About the role
We are a crypto-native consumer app that makes it easy to use your digital assets for everyday spending — paying bills, topping up cards, shopping online — directly from your crypto wallet.
Founded in 2021, we have processed over $250M in volume, is profit- generating, and is backed by top investors. We're a small, high-impact team. Every contributor plays a pivotal role in keeping our systems stable, scalable, and reliable.
About This Role
We're hiring a part-time Senior Full-Stack Engineer to own the stability, performance, and observability of our production systems — and to act as a force multiplier for our lead engineer.
You won't be a pure SRE in the traditional sense. You'll be a senior full-stack engineer with strong SRE instincts: someone who can debug a gnarly React state bug in the morning, ship a Node.js hotfix in the afternoon, tune a Sentry alert between tickets, and — when something like a Sunday-morning fraud incident hits — calmly make sound operational decisions when the playbook is incomplete, and coordinate quickly with engineering and management.
This is not a platform/Kubernetes-focused SRE role, and it is not a feature-only product engineering role. It is a hands-on production ownership role across our TypeScript stack.
You'll work closely with engineering, support, and operations to triage incidents, reduce recurrence, and build the AI-driven reliability tooling that lets our lead 1 engineer take a real vacation.
Role Shape
- Remote, async-first
- Part-time contract, 15 hrs/week baseline
- Core coverage: 2-5pm ET weekdays
- Additional incident response by agreement
- Reports directly to our Lead Engineer
Areas of Responsibility (AORs)
Your performance will be measured against your ability to own these three distinct Areas, In Priority Order
Primary AOR (Reactive): First Response & Triage
This is your top priority. You monitor our incident and support channels (Linear, Slack, Intercom) for incoming bugs and user-reported issues during your core hours.
- Acknowledge: Be the first to self-assign tickets and communicate that you're investigating.
- Investigate & Fix: Dive into HyperDX, AWS CloudWatch, and Hotjar to reproduce and diagnose issues across the React frontend and Node.js backend.
- Resolve: Ship well-tested hotfixes according to ticket priority.
- Escalate: When a fix is too complex, risky, or domain-heavy, produce a detailed analysis and a clear escalation path for Laurence — don't just toss it over the wall.
Secondary AOR (Proactive): Systemic Improvement
When you're not fighting fires, your job is to make the system more resilient.
- Pattern Recognition: Identify recurring bugs or failure patterns from tickets you resolve.
- Structural Fixes: Propose and implement root-cause fixes — better API contracts, improved error handling, refactors of flaky components.
- Documentation: Keep our troubleshooting docs and runbooks current as you go. Our collective knowledge should grow with every incident you close.
Tertiary AOR (Background): Observability Enhancement
When the queue is clear, improve our monitoring and alerting.
- Reduce Noise: Audit Sentry and HyperDX alerts for signal quality — they should map to real user impact.
- Build Dashboards: Create or refine dashboards covering key user journeys and system health.
- Define Metrics: Propose and implement new metrics, alerts, or SLIs that catch issues before users report them.
What You'll Actually Do, Week To Week
- Be the first responder for production issues and user-reported bugs across both frontend and backend.
- Own observability end-to-end — HyperDX, Sentry, and AWS CloudWatch dashboards, alerts, and error tracking.
- Ship fixes, not just diagnose them. This role is hands-on-keyboard across the stack.
- Practical current experience using LLM tools in real engineering workflows, including how you verify outputs before shipping.
- Translate user-reported issues into reproducible bug reports so Support can confidently update customers.
- Maintain runbooks, rollback playbooks, and P1/P2/P3 incident communication templates.
- Occasionally act as an extra pair of senior hands for ad-hoc requests from management — emergency fraud response, data pulls, one-off scripts, system shutdowns — the kind of tasks that come up in a small, high-stakes fintech team.
Minimum Requirements
- Part-time availability — explicitly seeking 15 hrs/week contract work.
- Core hours overlap — reliably online during 2-5pm ET on weekdays.
- 5+ years of professional experience shipping software to production
- Senior full-stack, not a specialist: demonstrable production experience with React + TypeScript on the frontend and Node.js on the backend.
- Production AWS experience — deploying and debugging real systems. Must be able to speak to -CloudWatch Logs & Metrics), Lambda, and SQS or equivalents from direct experience.
- Incident response background — verifiable experience as a first responder to live production incidents.
- Modern observability experience — hands-on with at least one of: HyperDX, Sentry, Datadog, New Relic, Honeycomb, or Grafana.
- AI-in-the-loop development — practical, current experience using LLM tools (Claude, Cursor, Copilot, or equivalent) in real engineering workflows.
- Payments, fintech, or real-money crypto experience — shipped production systems involving money movement, financial state, ledgers, payment processors, reconciliation, or on-chain transactions with real funds.
- Experience in small team, w/ startup culture.
Nice to Have
- Prior experience defining and monitoring SLOs and SLIs.
- Familiarity with session replay / UX monitoring tools (FullStory, Hotjar, LogRocket).
- Serverless architecture experience at scale (Lambda concurrency, SQS queue tuning, cold-start mitigation).
- Ledger / double-entry accounting systems (AWS QLDB or similar).
- Tech lead or deployment automation experience — people who've built the process, not just followed it.
How We Work
- Async-first and remote — teammates across US and international time zones.
- Cross-functional — you'll work directly with Engineering, Ops, and Support.
- Small team, high trust — we hire people we can give the keys to, not people we need to supervise.
- Core coverage window: weekday 2-5pm ET, plus incident response by agreement.
Reports to: Lead Engineer
Skills
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free