Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (AWS): Design, develop, and maintain reliable, scalable AWS infrastructure using Infrastructure as Code (IaC) and automation, with an accent on observability, incident management, and deployment reliability. Focus on building monitoring/alerting and logging systems, running postmortems and root-cause analysis, and improving CI/CD and deployment automation to reduce downtime and alert fatigue.
Location: Chengdu
Company
builds gamer-centric products and operates a global team across multiple continents.
What you will do
- Build and maintain Infrastructure as Code (IaC) using Terraform or AWS CloudFormation.
- Operate and troubleshoot AWS-based infrastructure (compute, containers, networking, storage, databases, messaging).
- Own monitoring, alerting, and logging (e.g., CloudWatch, Prometheus, Grafana, ELK) and apply AIOps for predictive alerting and anomaly detection.
- Handle incidents: on-call support, incident management, postmortems, root cause analysis, and continuous improvement.
- Improve CI/CD pipelines and deployment automation, including zero-downtime and blue/green or canary deployments.
- Collaborate on reliability, scalability, security, performance, and cost-efficiency; automate operations to reduce manual toil.
Requirements
- Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or related field.
- Minimum 3 years of experience in SRE, DevOps, cloud infrastructure, or system administration.
- Hands-on AWS expertise across EC2/Lambda/ECS/EKS, Auto Scaling, VPC/Route 53/Security Groups, RDS/ElastiCache/Athena/S3, and SQS/SES.
- Strong IaC experience with Terraform and/or AWS CloudFormation.
- Proficiency in at least one scripting/programming language: Python, Node.js, Bash, or Ruby.
- Experience with Linux/Windows and container-based environments, distributed systems, and monitoring/incident management processes.
Culture & Benefits
- Global mission with a team distributed across 5 continents.
- Inclusive, equal-opportunity workplace with accommodations where needed.
- Gamer-centric culture and emphasis on accelerated personal and professional growth.
Hiring process
- Application review followed by interview steps to assess SRE/AWS and reliability practices.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →