Site Reliability Engineer (Kubernetes)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (Kubernetes/AWS): Building and maintaining cloud infrastructure for large-scale machine learning on terabytes of biosignal data with an accent on reliability, security, and observability. Focus on designing infrastructure as code, leading cluster upgrades, and enhancing CI/CD pipelines for distributed numerical workloads.
Boston, MA - Remote. In-person office hubs available in Boston, New York, and Paris.
$150,000 – $170,000
Company
Leading at-home EEG platform supporting clinical development of novel therapeutics for neurological, psychiatric, and sleep disorders with FDA-cleared hardware and AI algorithms.
What you will do
- Design and implement infrastructure as code solutions to improve reliability, security, and maintainability of cloud infrastructure.
- Lead major infrastructure initiatives including cluster upgrades, security improvements, and architectural changes.
- Develop and maintain CI/CD pipelines for safe and efficient deployments.
- Improve observability through enhanced monitoring, logging, and alerting.
- Participate in on-call rotation and lead incident response efforts.
- Collaborate with development teams to boost application reliability and performance.
- Maintain security posture through infrastructure hardening and automation.
- Create and maintain documentation for infrastructure, deployments, and incident response.
Requirements
- Strong experience with Kubernetes administration, including cluster management, security, and troubleshooting.
- Proven track record with infrastructure as code using Terraform or similar.
- Experience building and maintaining CI/CD pipelines, particularly with GitHub Actions, Azure DevOps, or ArgoCD.
- Solid understanding of container technologies and build processes, especially Docker.
- Strong cloud provider knowledge (e.g., AWS) including networking, security, and services; Azure is a plus.
- Experience with incident response and on-call in production environments.
- Deep Linux systems administration and debugging; Windows Server familiarity is a plus.
- Proficiency in at least one programming language (Python, Go, Typescript etc.).
- Understanding of security and networking concepts including OAuth2/OIDC, DNS, TLS, TCP/UDP.
- Bachelor's degree + 5-8 years of experience in SRE, DevOps, or similar.
Culture & Benefits
- Robust asynchronous work practices for first-class remote experience.
- In-person office hubs in Boston, New York, and Paris.
- Total compensation includes equity, PTO, and other benefits.
- Culture emphasizes curiosity, simplicity, composability, self-service, and empathy.
- Diverse team focused on robust systems and high impact.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →