TL;DR
Lead ML Systems Engineer (Voice AI): Own the architecture, performance, and scalability of hirify.global Cloud’s real-time Voice AI serving infrastructure. Focus on transforming state-of-the-art research models into highly optimized, reliable, and cost-efficient production systems that power latency-sensitive, mission-critical Voice AI services. Focus on deep systems thinking and long-term architectural ownership.
Location: Hybrid in Armenia
Company
hirify.global develops AI-powered voice clarity software.
What you will do
- Prototype, implement, and benchmark critical components of the serving stack.
- Architect and implement inference and serving strategies defining how models are packaged, deployed, replicated, batched, scheduled, and optimized under real-time constraints.
- Partner with Research and Platform teams to drive deep performance optimization across runtime, precision (FP16/INT8/FP8), batching strategies, and GPU execution.
- Lead root cause analysis of systemic performance regressions and implement structural improvements.
- Drive alignment between model design and production constraints, ensuring research translates into performant, scalable, cost-effective systems.
- Shape the long-term architectural direction for Voice AI serving infrastructure through both implementation and strategic design.
Requirements
- 5+ years building performance-critical backend or distributed systems.
- Hands-on experience deploying and operating ML inference systems in production environments.
- Experience working on latency-sensitive or real-time services.
- Strong systems background (distributed systems, networking, concurrency, performance engineering).
- Hands-on experience deploying and optimizing GPU-based inference systems in production (TensorRT or similar runtimes; graph optimization, precision tuning, memory optimization, CUDA-level profiling).
- Strong programming skills in Python and/or C++.
Nice to have
- Experience optimizing ASR or TTS systems for real-time production workloads.
- Experience with streaming inference and low-latency (<200ms) systems.
- Experience building cost-efficient inference infrastructure at scale.
- Familiarity with CUDA internals or custom kernel optimization.
Culture & Benefits
- Equal Opportunity Employer.
- Treat each other with respect and empathy.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →