TL;DR
Principal Software Engineer (AI): Advancing the core capabilities of hirify.global Advertising's ad-serving infrastructure, which powers advertising across Bing Search, MSN, hirify.global Start, and shopping experiences in the Edge browser with an accent on GPU/CPU inference, real-time bidding, and intelligent ranking pipelines. Focus on designing and optimizing high-performance serving systems and GPU inference frameworks that drive measurable latency improvements and cost efficiency.
Location: Must be located in Mountain View, United States
Salary: USD $139,900 – $304,200 per year.
Company
hirify.global is an equal opportunity employer.
What you will do
- Design and lead the development of large-scale, distributed online serving systems, including GPU-accelerated and CPU-based ranking/inference pipelines.
- Architect and optimize end-to-end inference infrastructure, including model serving, batching/streaming, caching, scheduling, and resource orchestration across heterogeneous hardware.
- Profile and optimize performance across the full stack from CUDA kernels and GPU pipelines to CPU threads and OS-level scheduling.
- Own live-site reliability as a DRI: design telemetry, alerting, and fault-tolerance mechanisms.
- Collaborate and mentor across teams—driving architecture reviews, enforcing engineering excellence, and promoting system-level optimization practices.
Requirements
- Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience developing high-performance, distributed systems in C++.
- Industry experience in advertising or search engine backend systems, such as large-scale ad ranking, real-time bidding (RTB), or relevance-serving infrastructure.
- Hands-on experience with real-time data streaming systems (Kafka, Flink, Spark Streaming), feature-store integration, and multi-region deployment for low-latency, globally distributed services.
- Deep expertise in GPU inference frameworks such as NVIDIA Triton Inference Server, CUDA, and TensorRT, including hands-on experience implementing custom CUDA kernels.
- Expertise in low-level system and OS internals, including multi-threading, process scheduling, NUMA-aware memory allocation, lock-free data structures, context switching, and I/O stack tuning.
Nice to have
- Master’s Degree in Computer Science or related technical field AND 8+ years technical engineering experience developing high-performance, distributed systems in C++.
- Familiarity with LLM inference optimization—model sharding, tensor/kv-cache parallelism, paged attention, continuous batching, quantization (AWQ/FP8), and hybrid CPU–GPU orchestration.
- Strong understanding of model-serving trade-offs—batching vs. streaming, latency vs. throughput, quantization (FP16/BF16/INT8), dynamic batching, continuous model rollout, and adaptive inference scheduling across CPU/GPU tiers.
Culture & Benefits
- The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.
- The base pay range for this role in the San Francisco Bay area and New York City metropolitan area is USD $188,000 – $304,200 per year.
- Certain roles may be eligible for benefits and other compensation.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →