JV
Data Engineer
Jobs via Dice
McLean · Hybrid Contract Yesterday
About the role
Job Summary
The Data Engineer works with moderate supervision across two equally weighted domains:
- large-scale data pipeline development processing high-volume event data in a cloud environment, and
- design and development of agentic AI systems, including LLM-powered data assistants, MCP servers, and agent harness architectures.
This position contributes to overall product quality throughout the software development lifecycle.
Responsibilities
- Build and maintain ETL/ELT pipelines using Apache Spark, Hive, and Trino across S3-based data lake environments
- Develop and optimize SQL for large-scale datasets, including window functions, multi-table joins, and complex aggregations
- Build and engineer big data systems (EMR-on-EC2, EMR-on-EKS) and develop solutions on analytical platforms (SageMaker, Domino, Dataiku)
- Participate in data quality monitoring, anomaly detection, and production incident investigation
- Develop AI agent systems using AWS Bedrock and agent frameworks (e.g., Strands Agents SDK, LangChain/LangGraph, or equivalent)
- Build agent harness architectures combining LLM reasoning with deterministic execution (e.g., RAG-based SQL generation and structured output validation)
- Implement agent memory, context management, and tool integration (MCP servers, API connectors, data catalog lookups)
- Build evaluation frameworks for agent accuracy (e.g., paraphrase robustness, routing precision, structural consistency)
- Stay informed of advances in LLM frameworks and emerging AI capabilities
- Write clean, well-tested code; contribute to CI/CD pipelines and infrastructure-as-code on AWS
- Ensure secure handling of sensitive data across both data pipelines and AI agent outputs, including auditable execution traces
- Adhere to internal standards for secure development practices and technology policies
- Partner across teams, communicate technical information effectively, and maintain documentation
- Actively learn from senior team members and contribute to process improvement
Essential Technical Skills
Data Engineering & Big Data Technologies
- Experience building data pipelines using Apache Spark (PySpark preferred) and SQL
- Experience with SQL query engines (Hive, Trino/Presto, or similar) and cloud platforms (AWS S3, EMR, Lambda)
- Understanding of data skew, large-scale data processing challenges, and debugging strategies
Generative AI & Agentic Systems
- Experience building LLM-powered agent systems that use tools and produce structured outputs
- Hands-on experience with agent frameworks (LangChain, LangGraph, AWS Strands, or equivalent)
- Knowledge of prompt engineering, RAG architectures, and memory/context management
- Experience with foundation model APIs (e.g., Anthropic Claude, Amazon Nova, OpenAI, or similar)
Memory & Agent Design
- Understanding of memory architectures (working, episodic, semantic memory)
- Familiarity with agent harness patterns (tool routing, guardrails, verification loops, fallback handling)
AI Tool Proficiency
- Experience with AI development tools (e.g., GitHub Copilot, Q Developer, ChatGPT, Claude)
- Experience with spec-driven development and AI-assisted coding workflows
Cloud Technologies
- Experience with AWS services such as S3, EMR, Lambda, Bedrock, Step Functions
- Familiarity with monitoring/logging tools (CloudWatch, CloudTrail)
- Exposure to platforms like Google Vertex AI or similar
Programming (Python)
- Strong Python skills for data engineering and automation
- Ability to write clean, modular, and performant code
- Understanding of functional programming concepts, concurrency, and memory management
SQL
- Strong proficiency in SQL (window functions, joins, aggregations)
- Ability to optimize complex queries and handle edge cases (NULLs, duplicates, ordering)
Nice to Have
- Experience with agent frameworks and advanced patterns (evaluation harnesses, verification loops)
- Model fine-tuning techniques (LoRA, PEFT, managed tuning platforms)
- Vector databases (FAISS, Pinecone, OpenSearch)
- Containerization and orchestration (Docker, Kubernetes, EKS)
- Infrastructure as Code (Terraform, CloudFormation)
- CI/CD tools (Jenkins, GitLab CI, GitHub Actions, ArgoCD)
- Observability tools (Prometheus, Grafana, ELK stack)
- Cloud or AI-related certifications
Education / Experience
- Bachelor's degree in Computer Science, Data Science, Information Systems, or related field, with 2+ years of relevant experience (or equivalent practical experience)
- Experience delivering enterprise-quality software solutions using object-oriented and database technologies
- Knowledge of modern software engineering practices (test automation, build automation, configuration management)
- Strong written and verbal communication skills
- Ability to build effective working relationships and collaborate across teams
- Ability to learn new technologies quickly and work in a fast-paced environment
Skills
AWS BedrockAWS EMRAWS LambdaAWS S3Apache SparkCloudWatchDockerEMR-on-EC2EMR-on-EKSFAISSGitLab CIGitHub ActionsGitHub CopilotGoogle Vertex AIGrafanaHiveInfrastructure as CodeKubernetesLangChainLangGraphLoRAOpenAIOpenSearchPEFTPineconePrometheusPythonQ DeveloperRAGSageMakerSQLTerraformTrinoVector databasesAWS Strands Agents SDK
Don't send a generic resume
Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.
Get started free