Skip to content
mimi

Senior Full-Stack Engineer, Reliability & Incident Response (Part-Time Contract)

Braintrust

Remote (Global) Part-time Senior 2w ago

About the role

We are a crypto-native consumer app that makes it easy to use your digital assets for everyday spending — paying bills, topping up cards, shopping online — directly from your crypto wallet.

Founded in 2021, we have processed over $250M in volume, is profit- generating, and is backed by top investors. We're a small, high-impact team. Every contributor plays a pivotal role in keeping our systems stable, scalable, and reliable.

About This Role

We're hiring a part-time Senior Full-Stack Engineer to own the stability, performance, and observability of our production systems — and to act as a force multiplier for our lead engineer.

You won't be a pure SRE in the traditional sense. You'll be a senior full-stack engineer with strong SRE instincts: someone who can debug a gnarly React state bug in the morning, ship a Node.js hotfix in the afternoon, tune a Sentry alert between tickets, and — when something like a Sunday-morning fraud incident hits — calmly make sound operational decisions when the playbook is incomplete, and coordinate quickly with engineering and management.

This is not a platform/Kubernetes-focused SRE role, and it is not a feature-only product engineering role. It is a hands-on production ownership role across our TypeScript stack.

You'll work closely with engineering, support, and operations to triage incidents, reduce recurrence, and build the AI-driven reliability tooling that lets our lead 1 engineer take a real vacation.

Role Shape

  • Remote, async-first
  • Part-time contract, 15 hrs/week baseline
  • Core coverage: 2-5pm ET weekdays
  • Additional incident response by agreement
  • Reports directly to our Lead Engineer

Areas of Responsibility (AORs)

Your performance will be measured against your ability to own these three distinct Areas, In Priority Order

Primary AOR (Reactive): First Response & Triage

This is your top priority. You monitor our incident and support channels (Linear, Slack, Intercom) for incoming bugs and user-reported issues during your core hours.

  • Acknowledge: Be the first to self-assign tickets and communicate that you're investigating.
  • Investigate & Fix: Dive into HyperDX, AWS CloudWatch, and Hotjar to reproduce and diagnose issues across the React frontend and Node.js backend.
  • Resolve: Ship well-tested hotfixes according to ticket priority.
  • Escalate: When a fix is too complex, risky, or domain-heavy, produce a detailed analysis and a clear escalation path for Laurence — don't just toss it over the wall.

Secondary AOR (Proactive): Systemic Improvement

When you're not fighting fires, your job is to make the system more resilient.

  • Pattern Recognition: Identify recurring bugs or failure patterns from tickets you resolve.
  • Structural Fixes: Propose and implement root-cause fixes — better API contracts, improved error handling, refactors of flaky components.
  • Documentation: Keep our troubleshooting docs and runbooks current as you go. Our collective knowledge should grow with every incident you close.

Tertiary AOR (Background): Observability Enhancement

When the queue is clear, improve our monitoring and alerting.

  • Reduce Noise: Audit Sentry and HyperDX alerts for signal quality — they should map to real user impact.
  • Build Dashboards: Create or refine dashboards covering key user journeys and system health.
  • Define Metrics: Propose and implement new metrics, alerts, or SLIs that catch issues before users report them.

What You'll Actually Do, Week To Week

  • Be the first responder for production issues and user-reported bugs across both frontend and backend.
  • Own observability end-to-end — HyperDX, Sentry, and AWS CloudWatch dashboards, alerts, and error tracking.
  • Ship fixes, not just diagnose them. This role is hands-on-keyboard across the stack.
  • Practical current experience using LLM tools in real engineering workflows, including how you verify outputs before shipping.
  • Translate user-reported issues into reproducible bug reports so Support can confidently update customers.
  • Maintain runbooks, rollback playbooks, and P1/P2/P3 incident communication templates.
  • Occasionally act as an extra pair of senior hands for ad-hoc requests from management — emergency fraud response, data pulls, one-off scripts, system shutdowns — the kind of tasks that come up in a small, high-stakes fintech team.

Minimum Requirements

  • Part-time availability — explicitly seeking 15 hrs/week contract work.
  • Core hours overlap — reliably online during 2-5pm ET on weekdays.
  • 5+ years of professional experience shipping software to production
  • Senior full-stack, not a specialist: demonstrable production experience with React + TypeScript on the frontend and Node.js on the backend.
  • Production AWS experience — deploying and debugging real systems. Must be able to speak to -CloudWatch Logs & Metrics), Lambda, and SQS or equivalents from direct experience.
  • Incident response background — verifiable experience as a first responder to live production incidents.
  • Modern observability experience — hands-on with at least one of: HyperDX, Sentry, Datadog, New Relic, Honeycomb, or Grafana.
  • AI-in-the-loop development — practical, current experience using LLM tools (Claude, Cursor, Copilot, or equivalent) in real engineering workflows.
  • Payments, fintech, or real-money crypto experience — shipped production systems involving money movement, financial state, ledgers, payment processors, reconciliation, or on-chain transactions with real funds.
  • Experience in small team, w/ startup culture.

Nice to Have

  • Prior experience defining and monitoring SLOs and SLIs.
  • Familiarity with session replay / UX monitoring tools (FullStory, Hotjar, LogRocket).
  • Serverless architecture experience at scale (Lambda concurrency, SQS queue tuning, cold-start mitigation).
  • Ledger / double-entry accounting systems (AWS QLDB or similar).
  • Tech lead or deployment automation experience — people who've built the process, not just followed it.

How We Work

  • Async-first and remote — teammates across US and international time zones.
  • Cross-functional — you'll work directly with Engineering, Ops, and Support.
  • Small team, high trust — we hire people we can give the keys to, not people we need to supervise.
  • Core coverage window: weekday 2-5pm ET, plus incident response by agreement.

Reports to: Lead Engineer

Skills

AWS CloudWatchAWS LambdaHyperDXIntercomLinearNode.jsReactSentrySQSTypeScript

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free