Autonomous Driving Multimodal Model Algorithm Engineer

Black Sesame Technologies Inc

San Jose · On-site Full-time Mid Level 2mo ago

About the role

About Black Sesame Technologies

Black Sesame Technologies is building high-performance AI algorithms and self-developed chips for intelligent driving and beyond.

Role Overview

As an Autonomous Driving Multimodal Model Algorithm Engineer, you will work on next-generation multimodal AI models for autonomous driving, including Vision-Language Models, Vision-Language-Action Models, and World Models. You will collaborate with perception, prediction, planning, data, simulation, and deployment teams to integrate multimodal models with existing BEV perception, two-stage E2E, and one-stage E2E autonomous driving systems.

We are looking for candidates with hands-on experience in one or more of the following areas: Vision-Language Models, Vision-Language-Action Models, World Models.

Responsibilities

Multimodal Model Development for Autonomous Driving

Work on one or more multimodal modeling directions for autonomous driving, including VLM-based scene understanding, VLA-style planning-oriented modeling, and World Model-based future prediction.
Develop and optimize models that reason over multi-camera images, BEV features, map elements, object/lane instances, occupancy, trajectories, ego-motion, and driving context.
Explore model architectures that connect perception, prediction, planning, and decision-making in two-stage and one-stage E2E autonomous driving systems.
Collaborate with BEV perception and planning teams to improve representation quality, temporal consistency, long-tail robustness, and planning relevance.

Vision-Language and Vision-Language-Action Modeling

Develop VLM-based methods for driving scene understanding, open-vocabulary perception, risk reasoning, corner-case analysis, and interpretable autonomy.
Adapt and extend open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar models for autonomous driving scenarios.
Research VLA-style models that map multimodal driving context, navigation intent, and high-level instructions to trajectories, actions, or planning representations.
Align visual, BEV, map, object, lane, occupancy, trajectory, and language representations for driving-specific tasks.
Build supervised fine-tuning, instruction-tuning, and efficient adaptation pipelines for driving-relevant multimodal tasks.

World Model and Future Prediction

Build world-model-based approaches for future BEV, occupancy, object motion, lane evolution, traffic interaction, and ego-conditioned scene rollout.
Explore generative and predictive modeling methods such as diffusion models, autoregressive transformers, latent dynamics models, video prediction, and BEV prediction.
Use learned world models for scenario generation, counterfactual reasoning, long-tail case mining, planning evaluation, and closed-loop analysis.
Work with simulation and data teams to improve safety-critical scenario discovery and model-based evaluation.

Efficient Adaptation and Deployment

Apply efficient fine-tuning and adaptation methods such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, or other PEFT techniques.
Develop multimodal feature alignment modules, including projection heads, query adapters, cross-attention modules, tokenization strategies, and representation converters.
Optimize model architecture, latency, memory footprint, and compute cost for automotive deployment.
Apply distillation, quantization, pruning, sparse computation, and efficient attention methods where appropriate.
Collaborate with chip, compiler, runtime, and deployment teams to adapt multimodal models to in-house automotive AI hardware.

Research, Evaluation, and Iteration

Track the latest research in VLM, VLA, World Models, BEV perception, E2E driving, robotics foundation models, generative simulation, and multimodal learning.
Design evaluation metrics for reasoning quality, grounding accuracy, temporal consistency, prediction quality, planning relevance, and safety-critical scenarios.
Perform systematic failure analysis and drive data/model iteration based on real-world autonomous driving cases.
Contribute to patents, technical reports, internal research platforms, and conference or journal publications.

Qualifications

MS or PhD in Computer Science, Electrical Engineering, Robotics, Artificial Intelligence, or a related field.
Strong background in deep learning, computer vision, multimodal learning, robotics, or autonomous driving.
Hands-on experience in one or more of the following areas:
- Vision-Language Models, multimodal large models, or open-source VLM adaptation
- Vision-Language-Action models, robotics foundation models, or action-conditioned modeling
- World models, generative prediction, latent dynamics modeling, or future scene simulation
- BEV perception, multi-view 3D perception, or end-to-end autonomous driving
- Motion prediction, planning, trajectory generation, or closed-loop evaluation
Practical experience with open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, BLIP-style models, Flamingo-style models, or similar systems.
Solid understanding of multimodal feature alignment, including vision-language alignment, cross-modal attention, visual tokenization, projection layers, query-based fusion, or embedding-space alignment.
Experience with efficient fine-tuning or adaptation methods, such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, supervised fine-tuning, or instruction tuning.
Proficient in PyTorch and capable of modifying, training, debugging, and evaluating deep learning models.
Familiar with transformer architectures, attention mechanisms, temporal modeling, and large-scale training.
Experience with multimodal data, such as camera, radar, LiDAR, IMU, map, trajectory, language, or structured driving data.
Strong engineering ability in Python; C++/CUDA/TensorRT experience is a plus.
Comfortable with Git, Docker, Linux, distributed training, and collaborative development workflows.
Strong communication skills and ability to work across perception, planning, data, simulation, and deployment teams.

Preferred Qualifications

Experience adapting or fine-tuning VLM/VLA models such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar architectures.
Experience with Hugging Face Transformers, PEFT, DeepSpeed, FSDP, vLLM, SGLang, TensorRT-LLM, or similar training/inference frameworks.
Experience building multimodal instruction datasets, driving-scene QA datasets, grounding datasets, scene-reasoning datasets, or planner-oriented supervision signals.
Experience aligning multimodal model representations with BEV features, object queries, lane instances, occupancy grids, map vectors, trajectories, or planner inputs.
Experience with autonomous driving architectures such as BEVFormer, DETR/DINO, MapTR/MapQR, occupancy networks, diffusion planners, trajectory transformers, or similar models.
Experience with world models, generative models, video prediction, future BEV prediction, occupancy forecasting, learned simulation, or closed-loop evaluation.
Experience with efficient adaptation of large models, including LoRA/QLoRA, distillation, quantization, pruning, sparse attention, or lightweight adapter design.
Experience deploying deep learning models on automotive SoCs, ASICs, GPUs, or edge AI accelerators.
Publications or strong project experience in CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, CoRL, ICRA, IROS, RSS, or related autonomous driving and robotics venues.
Strong ability to convert research ideas into robust production systems.
Experience with AI agent tools and basic harness engineering, including building evaluation scripts, task runners, automated workflows, tool-use pipelines, and reproducible testing environments for model or agent development.

Skills

AWS LambdaBLIPBEVC++CUDADockerDETRDeepSpeedDINODiffusion modelsFSDPFlamingoGitHugging Face TransformersICMLICRAICCVIMUInternVLLLaVALinuxLoRAMapTRMiniCPM-VNeurIPSOpenVLAPyTorchQLoRAQwen-VLRSSRoboticsTensorRTTensorRT-LLMTransformer architecturesVLAVLMvLLM

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free