Skip to content
mimi

Autonomous Driving Multimodal Model Algorithm Engineer

Black Sesame Technologies Inc

San Jose · On-site Full-time Mid Level 3w ago

About the role

About Black Sesame Technologies

Black Sesame Technologies is building high-performance AI algorithms and self-developed chips for intelligent driving and beyond.

Role Overview

As an Autonomous Driving Multimodal Model Algorithm Engineer, you will work on next-generation multimodal AI models for autonomous driving, including Vision-Language Models, Vision-Language-Action Models, and World Models. You will collaborate with perception, prediction, planning, data, simulation, and deployment teams to integrate multimodal models with existing BEV perception, two-stage E2E, and one-stage E2E autonomous driving systems.

We are looking for candidates with hands-on experience in one or more of the following areas: Vision-Language Models, Vision-Language-Action Models, World Models.

Responsibilities

Multimodal Model Development for Autonomous Driving

  • Work on one or more multimodal modeling directions for autonomous driving, including VLM-based scene understanding, VLA-style planning-oriented modeling, and World Model-based future prediction.
  • Develop and optimize models that reason over multi-camera images, BEV features, map elements, object/lane instances, occupancy, trajectories, ego-motion, and driving context.
  • Explore model architectures that connect perception, prediction, planning, and decision-making in two-stage and one-stage E2E autonomous driving systems.
  • Collaborate with BEV perception and planning teams to improve representation quality, temporal consistency, long-tail robustness, and planning relevance.

Vision-Language and Vision-Language-Action Modeling

  • Develop VLM-based methods for driving scene understanding, open-vocabulary perception, risk reasoning, corner-case analysis, and interpretable autonomy.
  • Adapt and extend open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar models for autonomous driving scenarios.
  • Research VLA-style models that map multimodal driving context, navigation intent, and high-level instructions to trajectories, actions, or planning representations.
  • Align visual, BEV, map, object, lane, occupancy, trajectory, and language representations for driving-specific tasks.
  • Build supervised fine-tuning, instruction-tuning, and efficient adaptation pipelines for driving-relevant multimodal tasks.

World Model and Future Prediction

  • Build world-model-based approaches for future BEV, occupancy, object motion, lane evolution, traffic interaction, and ego-conditioned scene rollout.
  • Explore generative and predictive modeling methods such as diffusion models, autoregressive transformers, latent dynamics models, video prediction, and BEV prediction.
  • Use learned world models for scenario generation, counterfactual reasoning, long-tail case mining, planning evaluation, and closed-loop analysis.
  • Work with simulation and data teams to improve safety-critical scenario discovery and model-based evaluation.

Efficient Adaptation and Deployment

  • Apply efficient fine-tuning and adaptation methods such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, or other PEFT techniques.
  • Develop multimodal feature alignment modules, including projection heads, query adapters, cross-attention modules, tokenization strategies, and representation converters.
  • Optimize model architecture, latency, memory footprint, and compute cost for automotive deployment.
  • Apply distillation, quantization, pruning, sparse computation, and efficient attention methods where appropriate.
  • Collaborate with chip, compiler, runtime, and deployment teams to adapt multimodal models to in-house automotive AI hardware.

Research, Evaluation, and Iteration

  • Track the latest research in VLM, VLA, World Models, BEV perception, E2E driving, robotics foundation models, generative simulation, and multimodal learning.
  • Design evaluation metrics for reasoning quality, grounding accuracy, temporal consistency, prediction quality, planning relevance, and safety-critical scenarios.
  • Perform systematic failure analysis and drive data/model iteration based on real-world autonomous driving cases.
  • Contribute to patents, technical reports, internal research platforms, and conference or journal publications.

Qualifications

  • MS or PhD in Computer Science, Electrical Engineering, Robotics, Artificial Intelligence, or a related field.
  • Strong background in deep learning, computer vision, multimodal learning, robotics, or autonomous driving.
  • Hands-on experience in one or more of the following areas:
    • Vision-Language Models, multimodal large models, or open-source VLM adaptation
    • Vision-Language-Action models, robotics foundation models, or action-conditioned modeling
    • World models, generative prediction, latent dynamics modeling, or future scene simulation
    • BEV perception, multi-view 3D perception, or end-to-end autonomous driving
    • Motion prediction, planning, trajectory generation, or closed-loop evaluation
  • Practical experience with open-source multimodal architectures such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, BLIP-style models, Flamingo-style models, or similar systems.
  • Solid understanding of multimodal feature alignment, including vision-language alignment, cross-modal attention, visual tokenization, projection layers, query-based fusion, or embedding-space alignment.
  • Experience with efficient fine-tuning or adaptation methods, such as LoRA, QLoRA, Adapter, Prompt Tuning, Prefix Tuning, supervised fine-tuning, or instruction tuning.
  • Proficient in PyTorch and capable of modifying, training, debugging, and evaluating deep learning models.
  • Familiar with transformer architectures, attention mechanisms, temporal modeling, and large-scale training.
  • Experience with multimodal data, such as camera, radar, LiDAR, IMU, map, trajectory, language, or structured driving data.
  • Strong engineering ability in Python; C++/CUDA/TensorRT experience is a plus.
  • Comfortable with Git, Docker, Linux, distributed training, and collaborative development workflows.
  • Strong communication skills and ability to work across perception, planning, data, simulation, and deployment teams.

Preferred Qualifications

  • Experience adapting or fine-tuning VLM/VLA models such as LLaVA, Qwen-VL, InternVL, MiniCPM-V, OpenVLA, or similar architectures.
  • Experience with Hugging Face Transformers, PEFT, DeepSpeed, FSDP, vLLM, SGLang, TensorRT-LLM, or similar training/inference frameworks.
  • Experience building multimodal instruction datasets, driving-scene QA datasets, grounding datasets, scene-reasoning datasets, or planner-oriented supervision signals.
  • Experience aligning multimodal model representations with BEV features, object queries, lane instances, occupancy grids, map vectors, trajectories, or planner inputs.
  • Experience with autonomous driving architectures such as BEVFormer, DETR/DINO, MapTR/MapQR, occupancy networks, diffusion planners, trajectory transformers, or similar models.
  • Experience with world models, generative models, video prediction, future BEV prediction, occupancy forecasting, learned simulation, or closed-loop evaluation.
  • Experience with efficient adaptation of large models, including LoRA/QLoRA, distillation, quantization, pruning, sparse attention, or lightweight adapter design.
  • Experience deploying deep learning models on automotive SoCs, ASICs, GPUs, or edge AI accelerators.
  • Publications or strong project experience in CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, CoRL, ICRA, IROS, RSS, or related autonomous driving and robotics venues.
  • Strong ability to convert research ideas into robust production systems.
  • Experience with AI agent tools and basic harness engineering, including building evaluation scripts, task runners, automated workflows, tool-use pipelines, and reproducible testing environments for model or agent development.

Skills

AWS LambdaBLIPBEVC++CUDADockerDETRDeepSpeedDINODiffusion modelsFSDPFlamingoGitHugging Face TransformersICMLICRAICCVIMUInternVLLLaVALinuxLoRAMapTRMiniCPM-VNeurIPSOpenVLAPyTorchQLoRAQwen-VLRSSRoboticsTensorRTTensorRT-LLMTransformer architecturesVLAVLMvLLM

Don't send a generic resume

Paste this job description into Mimi and get a resume tailored to exactly what the hiring team is looking for.

Get started free