Senior Full-Stack Engineer, Reliability & Incident Response (Part-Time Contract)

Braintrust

Remote (Global) Part-time Senior 1mo ago

About the role

We are a crypto-native consumer app that makes it easy to use your digital assets for everyday spending — paying bills, topping up cards, shopping online — directly from your crypto wallet.

Founded in 2021, we have processed over $250M in volume, is profit- generating, and is backed by top investors. We're a small, high-impact team. Every contributor plays a pivotal role in keeping our systems stable, scalable, and reliable.

About This Role

We're hiring a part-time Senior Full-Stack Engineer to own the stability, performance, and observability of our production systems — and to act as a force multiplier for our lead engineer.

You won't be a pure SRE in the traditional sense. You'll be a senior full-stack engineer with strong SRE instincts: someone who can debug a gnarly React state bug in the morning, ship a Node.js hotfix in the afternoon, tune a Sentry alert between tickets, and — when something like a Sunday-morning fraud incident hits — calmly make sound operational decisions when the playbook is incomplete, and coordinate quickly with engineering and management.

This is not a platform/Kubernetes-focused SRE role, and it is not a feature-only product engineering role. It is a hands-on production ownership role across our TypeScript stack.

You'll work closely with engineering, support, and operations to triage incidents, reduce recurrence, and build the AI-driven reliability tooling that lets our lead 1 engineer take a real vacation.

Role Shape

Remote, async-first
Part-time contract, 15 hrs/week baseline
Core coverage: 2-5pm ET weekdays
Additional incident response by agreement
Reports directly to our Lead Engineer

Areas of Responsibility (AORs)

Your performance will be measured against your ability to own these three distinct Areas, In Priority Order

Primary AOR (Reactive): First Response & Triage

This is your top priority. You monitor our incident and support channels (Linear, Slack, Intercom) for incoming bugs and user-reported issues during your core hours.

Acknowledge: Be the first to self-assign tickets and communicate that you're investigating.
Investigate & Fix: Dive into HyperDX, AWS CloudWatch, and Hotjar to reproduce and diagnose issues across the React frontend and Node.js backend.
Resolve: Ship well-tested hotfixes according to ticket priority.
Escalate: When a fix is too complex, risky, or domain-heavy, produce a detailed analysis and a clear escalation path for Laurence — don't just toss it over the wall.

Secondary AOR (Proactive): Systemic Improvement

When you're not fighting fires, your job is to make the system more resilient.

Pattern Recognition: Identify recurring bugs or failure patterns from tickets you resolve.
Structural Fixes: Propose and implement root-cause fixes — better API contracts, improved error handling, refactors of flaky components.
Documentation: Keep our troubleshooting docs and runbooks current as you go. Our collective knowledge should grow with every incident you close.

Tertiary AOR (Background): Observability Enhancement

When the queue is clear, improve our monitoring and alerting.

Reduce Noise: Audit Sentry and HyperDX alerts for signal quality — they should map to real user impact.
Build Dashboards: Create or refine dashboards covering key user journeys and system health.
Define Metrics: Propose and implement new metrics, alerts, or SLIs that catch issues before users report them.

What You'll Actually Do, Week To Week

Be the first responder for production issues and user-reported bugs across both frontend and backend.
Own observability end-to-end — HyperDX, Sentry, and AWS CloudWatch dashboards, alerts, and error tracking.
Ship fixes, not just diagnose them. This role is hands-on-keyboard across the stack.
Practical current experience using LLM tools in real engineering workflows, including how you verify outputs before shipping.
Translate user-reported issues into reproducible bug reports so Support can confidently update customers.
Maintain runbooks, rollback playbooks, and P1/P2/P3 incident communication templates.
Occasionally act as an extra pair of senior hands for ad-hoc requests from management — emergency fraud response, data pulls, one-off scripts, system shutdowns — the kind of tasks that come up in a small, high-stakes fintech team.

Minimum Requirements

Part-time availability — explicitly seeking 15 hrs/week contract work.
Core hours overlap — reliably online during 2-5pm ET on weekdays.
5+ years of professional experience shipping software to production
Senior full-stack, not a specialist: demonstrable production experience with React + TypeScript on the frontend and Node.js on the backend.
Production AWS experience — deploying and debugging real systems. Must be able to speak to -CloudWatch Logs & Metrics), Lambda, and SQS or equivalents from direct experience.
Incident response background — verifiable experience as a first responder to live production incidents.
Modern observability experience — hands-on with at least one of: HyperDX, Sentry, Datadog, New Relic, Honeycomb, or Grafana.
AI-in-the-loop development — practical, current experience using LLM tools (Claude, Cursor, Copilot, or equivalent) in real engineering workflows.
Payments, fintech, or real-money crypto experience — shipped production systems involving money movement, financial state, ledgers, payment processors, reconciliation, or on-chain transactions with real funds.
Experience in small team, w/ startup culture.

Nice to Have

Prior experience defining and monitoring SLOs and SLIs.
Familiarity with session replay / UX monitoring tools (FullStory, Hotjar, LogRocket).
Serverless architecture experience at scale (Lambda concurrency, SQS queue tuning, cold-start mitigation).
Ledger / double-entry accounting systems (AWS QLDB or similar).
Tech lead or deployment automation experience — people who've built the process, not just followed it.

How We Work

Async-first and remote — teammates across US and international time zones.
Cross-functional — you'll work directly with Engineering, Ops, and Support.
Small team, high trust — we hire people we can give the keys to, not people we need to supervise.
Core coverage window: weekday 2-5pm ET, plus incident response by agreement.

Reports to: Lead Engineer

Skills

AWS CloudWatchAWS LambdaHyperDXIntercomLinearNode.jsReactSentrySQSTypeScript

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free