Мэтч & Сопровод
Покажет вашу совместимость и напишет письмо
Описание вакансии
Founding Infrastructure Engineer
Company
Metagov
Conditions
21 hours agoSeniorSalary: 50K - 50KFully remote (preference for Europe, United States, or Canada) Remote Contract Devops Jobs by Metagov
Skills
Tracing Capacity Planning Fault Tolerance Compute Incident Response Performance Load Testing Gpu Sre Distributed Systems Open Source Monitoring Orchestration Observability Aws Mcp Vllm Analytics Model Routing Fallback Design Openwebui Cscs Infomaniak Agent Tooling Routing Transparency Endpoint Provenance Hpc Multi-Region Deployment
About the Role
You will be the technical owner of the platform's operational backbone. You will harden the platform for major launches, perform load testing, and build fallback routing and per-agent monitoring. You will implement end-to-end observability and integrated trace analysis across heterogeneous infrastructure, ship downtime warnings and fallback behavior, and implement routing transparency and endpoint provenance so users can verify which backend served their inference. You will improve performance of public endpoints, integrate programmatic infrastructure interfaces such as an MCP server, and make the utility more transparent and contributable. You will set priorities autonomously, operate production inference and ML serving infrastructure, and coordinate with cloud providers, HPC centers, and other infrastructure partners. Occasional travel for team workshops may be required.
Requirements
- Significant experience operating production inference or ML serving infrastructure (vLLM, model routing, multi-region deployments, GPU-backed services)
- Strong distributed systems and SRE instincts including observability, incident response, fallback design, and capacity planning
- Comfort working across heterogeneous infrastructure partners including cloud providers and HPC centers
- Experience orchestrating many stacks and integrating open-source projects
- Maintainer and integrator experience with pride in operational excellence
- Ability to work autonomously in a small team and travel occasionally for workshops
Responsibilities
- Harden platform for launches
- Perform load testing
- Build fallback routing
- Set up per-agent monitoring
- Build end-to-end observability across stacks
- Ship downtime warnings and fallback behavior
- Implement routing transparency and endpoint provenance
- Improve production service performance
- Integrate MCP server or programmatic infrastructure interfaces
- Make infrastructure transparent and contributable
- Operate and maintain production inference and ML serving infrastructure
- Coordinate with heterogeneous infrastructure partners
- Orchestrate and integrate multiple open-source stacks
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →
Текст вакансии взят без изменений