All jobs · Data Engineer jobs

Data Engineer

Jobs via Dice

McLean · Hybrid Contract Yesterday

Apply with a tailored resume Save job

About the role

Job Summary

The Data Engineer works with moderate supervision across two equally weighted domains:

large-scale data pipeline development processing high-volume event data in a cloud environment, and
design and development of agentic AI systems, including LLM-powered data assistants, MCP servers, and agent harness architectures.

This position contributes to overall product quality throughout the software development lifecycle.

Responsibilities

Build and maintain ETL/ELT pipelines using Apache Spark, Hive, and Trino across S3-based data lake environments
Develop and optimize SQL for large-scale datasets, including window functions, multi-table joins, and complex aggregations
Build and engineer big data systems (EMR-on-EC2, EMR-on-EKS) and develop solutions on analytical platforms (SageMaker, Domino, Dataiku)
Participate in data quality monitoring, anomaly detection, and production incident investigation
Develop AI agent systems using AWS Bedrock and agent frameworks (e.g., Strands Agents SDK, LangChain/LangGraph, or equivalent)
Build agent harness architectures combining LLM reasoning with deterministic execution (e.g., RAG-based SQL generation and structured output validation)
Implement agent memory, context management, and tool integration (MCP servers, API connectors, data catalog lookups)
Build evaluation frameworks for agent accuracy (e.g., paraphrase robustness, routing precision, structural consistency)
Stay informed of advances in LLM frameworks and emerging AI capabilities
Write clean, well-tested code; contribute to CI/CD pipelines and infrastructure-as-code on AWS
Ensure secure handling of sensitive data across both data pipelines and AI agent outputs, including auditable execution traces
Adhere to internal standards for secure development practices and technology policies
Partner across teams, communicate technical information effectively, and maintain documentation
Actively learn from senior team members and contribute to process improvement

Essential Technical Skills

Data Engineering & Big Data Technologies

Experience building data pipelines using Apache Spark (PySpark preferred) and SQL
Experience with SQL query engines (Hive, Trino/Presto, or similar) and cloud platforms (AWS S3, EMR, Lambda)
Understanding of data skew, large-scale data processing challenges, and debugging strategies

Generative AI & Agentic Systems

Experience building LLM-powered agent systems that use tools and produce structured outputs
Hands-on experience with agent frameworks (LangChain, LangGraph, AWS Strands, or equivalent)
Knowledge of prompt engineering, RAG architectures, and memory/context management
Experience with foundation model APIs (e.g., Anthropic Claude, Amazon Nova, OpenAI, or similar)

Memory & Agent Design

Understanding of memory architectures (working, episodic, semantic memory)
Familiarity with agent harness patterns (tool routing, guardrails, verification loops, fallback handling)

AI Tool Proficiency

Experience with AI development tools (e.g., GitHub Copilot, Q Developer, ChatGPT, Claude)
Experience with spec-driven development and AI-assisted coding workflows

Cloud Technologies

Experience with AWS services such as S3, EMR, Lambda, Bedrock, Step Functions
Familiarity with monitoring/logging tools (CloudWatch, CloudTrail)
Exposure to platforms like Google Vertex AI or similar

Programming (Python)

Strong Python skills for data engineering and automation
Ability to write clean, modular, and performant code
Understanding of functional programming concepts, concurrency, and memory management

SQL

Strong proficiency in SQL (window functions, joins, aggregations)
Ability to optimize complex queries and handle edge cases (NULLs, duplicates, ordering)

Nice to Have

Experience with agent frameworks and advanced patterns (evaluation harnesses, verification loops)
Model fine-tuning techniques (LoRA, PEFT, managed tuning platforms)
Vector databases (FAISS, Pinecone, OpenSearch)
Containerization and orchestration (Docker, Kubernetes, EKS)
Infrastructure as Code (Terraform, CloudFormation)
CI/CD tools (Jenkins, GitLab CI, GitHub Actions, ArgoCD)
Observability tools (Prometheus, Grafana, ELK stack)
Cloud or AI-related certifications

Education / Experience

Bachelor's degree in Computer Science, Data Science, Information Systems, or related field, with 2+ years of relevant experience (or equivalent practical experience)
Experience delivering enterprise-quality software solutions using object-oriented and database technologies
Knowledge of modern software engineering practices (test automation, build automation, configuration management)
Strong written and verbal communication skills
Ability to build effective working relationships and collaborate across teams
Ability to learn new technologies quickly and work in a fast-paced environment

Skills

AWS BedrockAWS EMRAWS LambdaAWS S3Apache SparkCloudWatchDockerEMR-on-EC2EMR-on-EKSFAISSGitLab CIGitHub ActionsGitHub CopilotGoogle Vertex AIGrafanaHiveInfrastructure as CodeKubernetesLangChainLangGraphLoRAOpenAIOpenSearchPEFTPineconePrometheusPythonQ DeveloperRAGSageMakerSQLTerraformTrinoVector databasesAWS Strands Agents SDK

Similar roles

Principal Machine Learning Engineer, ML Platform and Systems Architecture

Autodesk

$152k – $272k/yr

Senior Software Engineer

540

Site Reliability Engineer (SRE) Java/ Python / JavaScript

Jobs via Dice

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free