Skip to content
mimi

Data Engineer

Jobs via Dice

McLean · Hybrid Contract Yesterday

About the role

Job Summary

The Data Engineer works with moderate supervision across two equally weighted domains:

  • large-scale data pipeline development processing high-volume event data in a cloud environment, and
  • design and development of agentic AI systems, including LLM-powered data assistants, MCP servers, and agent harness architectures.

This position contributes to overall product quality throughout the software development lifecycle.

Responsibilities

  • Build and maintain ETL/ELT pipelines using Apache Spark, Hive, and Trino across S3-based data lake environments
  • Develop and optimize SQL for large-scale datasets, including window functions, multi-table joins, and complex aggregations
  • Build and engineer big data systems (EMR-on-EC2, EMR-on-EKS) and develop solutions on analytical platforms (SageMaker, Domino, Dataiku)
  • Participate in data quality monitoring, anomaly detection, and production incident investigation
  • Develop AI agent systems using AWS Bedrock and agent frameworks (e.g., Strands Agents SDK, LangChain/LangGraph, or equivalent)
  • Build agent harness architectures combining LLM reasoning with deterministic execution (e.g., RAG-based SQL generation and structured output validation)
  • Implement agent memory, context management, and tool integration (MCP servers, API connectors, data catalog lookups)
  • Build evaluation frameworks for agent accuracy (e.g., paraphrase robustness, routing precision, structural consistency)
  • Stay informed of advances in LLM frameworks and emerging AI capabilities
  • Write clean, well-tested code; contribute to CI/CD pipelines and infrastructure-as-code on AWS
  • Ensure secure handling of sensitive data across both data pipelines and AI agent outputs, including auditable execution traces
  • Adhere to internal standards for secure development practices and technology policies
  • Partner across teams, communicate technical information effectively, and maintain documentation
  • Actively learn from senior team members and contribute to process improvement

Essential Technical Skills

Data Engineering & Big Data Technologies

  • Experience building data pipelines using Apache Spark (PySpark preferred) and SQL
  • Experience with SQL query engines (Hive, Trino/Presto, or similar) and cloud platforms (AWS S3, EMR, Lambda)
  • Understanding of data skew, large-scale data processing challenges, and debugging strategies

Generative AI & Agentic Systems

  • Experience building LLM-powered agent systems that use tools and produce structured outputs
  • Hands-on experience with agent frameworks (LangChain, LangGraph, AWS Strands, or equivalent)
  • Knowledge of prompt engineering, RAG architectures, and memory/context management
  • Experience with foundation model APIs (e.g., Anthropic Claude, Amazon Nova, OpenAI, or similar)

Memory & Agent Design

  • Understanding of memory architectures (working, episodic, semantic memory)
  • Familiarity with agent harness patterns (tool routing, guardrails, verification loops, fallback handling)

AI Tool Proficiency

  • Experience with AI development tools (e.g., GitHub Copilot, Q Developer, ChatGPT, Claude)
  • Experience with spec-driven development and AI-assisted coding workflows

Cloud Technologies

  • Experience with AWS services such as S3, EMR, Lambda, Bedrock, Step Functions
  • Familiarity with monitoring/logging tools (CloudWatch, CloudTrail)
  • Exposure to platforms like Google Vertex AI or similar

Programming (Python)

  • Strong Python skills for data engineering and automation
  • Ability to write clean, modular, and performant code
  • Understanding of functional programming concepts, concurrency, and memory management

SQL

  • Strong proficiency in SQL (window functions, joins, aggregations)
  • Ability to optimize complex queries and handle edge cases (NULLs, duplicates, ordering)

Nice to Have

  • Experience with agent frameworks and advanced patterns (evaluation harnesses, verification loops)
  • Model fine-tuning techniques (LoRA, PEFT, managed tuning platforms)
  • Vector databases (FAISS, Pinecone, OpenSearch)
  • Containerization and orchestration (Docker, Kubernetes, EKS)
  • Infrastructure as Code (Terraform, CloudFormation)
  • CI/CD tools (Jenkins, GitLab CI, GitHub Actions, ArgoCD)
  • Observability tools (Prometheus, Grafana, ELK stack)
  • Cloud or AI-related certifications

Education / Experience

  • Bachelor's degree in Computer Science, Data Science, Information Systems, or related field, with 2+ years of relevant experience (or equivalent practical experience)
  • Experience delivering enterprise-quality software solutions using object-oriented and database technologies
  • Knowledge of modern software engineering practices (test automation, build automation, configuration management)
  • Strong written and verbal communication skills
  • Ability to build effective working relationships and collaborate across teams
  • Ability to learn new technologies quickly and work in a fast-paced environment

Skills

AWS BedrockAWS EMRAWS LambdaAWS S3Apache SparkCloudWatchDockerEMR-on-EC2EMR-on-EKSFAISSGitLab CIGitHub ActionsGitHub CopilotGoogle Vertex AIGrafanaHiveInfrastructure as CodeKubernetesLangChainLangGraphLoRAOpenAIOpenSearchPEFTPineconePrometheusPythonQ DeveloperRAGSageMakerSQLTerraformTrinoVector databasesAWS Strands Agents SDK

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free