Senior Site Reliability Engineer (AI)
ΠΡΡΡ & Π‘ΠΎΠΏΡΠΎΠ²ΠΎΠ΄
ΠΠ»Ρ ΠΌΡΡΡΠ° Ρ ΡΡΠΎΠΉ Π²Π°ΠΊΠ°Π½ΡΠΈΠ΅ΠΉ Π½ΡΠΆΠ΅Π½ Plus
ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅ Π²Π°ΠΊΠ°Π½ΡΠΈΠΈ
TL;DR
Senior Site Reliability Engineer (AI): Ensuring the reliability, performance, and scalability of AI products, model-serving infrastructure, and backend API systems with an accent on automating operations and enhancing observability. Focus on building resilient systems, solving complex infrastructure problems, and supporting AI workloads in production.
Location: Role based in Singapore office and may require up to 1 travel trip per year.
Company
is on a global mission to revolutionize the way the world games.
What you will do
- Administer, monitor, and manage cloud-scale production environments for AI model APIs, backend services, and high-traffic web systems serving global users.
- Design and implement fault-tolerant, autoscaling cloud architectures tailored for AI inference workloads, including GPU-based environments and software products.
- Build automated self-recovery systems to ensure high availability, rapid failover, and cost-efficient resource usage for all software products.
- Manage and monitor AI model-serving platforms, inference engines, vector databases, data pipelines, software applications.
- Implement and maintain comprehensive monitoring, logging, and alerting for all AI and backend services.
- Work closely with software engineering, ML engineering, and release management to enhance operational procedures, deployment processes, and incident response workflows.
Requirements
- 5+ years of relevant experience in SRE, DevOps, infrastructure engineering, or cloud operations.
- Experience operating production services with significant availability or scaling demands.
- Strong knowledge in Web Technologies such as HTTP, REST, SSL, Load Balancers, Web Proxies (NGINX).
- Comfortable with Linux and Docker administration.
- Basic knowledge in AWS, CI/CD (Jenkins), IaC (Terraform), Container Orchestration (AWS ECS or K8s), Version Control (Git), Database (mySQL, noSQL).
- Strong ability to code and script (preferably Bash scripting and Python).
- Must have good analytical skills to debug deployment problems without taking help from developers.
- Has a Bachelorβs or Masterβs degree in computer science, AI or similar discipline from an accredited institution.
Culture & Benefits
- Opportunity to make an impact globally while working across a global team located across 5 continents.
- Gamer-centric #LifeAt experience that will put you in an accelerated growth, both personally and professionally.
- Inclusive, respectful, and fair workplace for every employee across all the countries we operate in.
ΠΡΠ΄ΡΡΠ΅ ΠΎΡΡΠΎΡΠΎΠΆΠ½Ρ: Π΅ΡΠ»ΠΈ ΡΠ°Π±ΠΎΡΠΎΠ΄Π°ΡΠ΅Π»Ρ ΠΏΡΠΎΡΠΈΡ Π²ΠΎΠΉΡΠΈ Π² ΠΈΡ ΡΠΈΡΡΠ΅ΠΌΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ iCloud/Google, ΠΏΡΠΈΡΠ»Π°ΡΡ ΠΊΠΎΠ΄/ΠΏΠ°ΡΠΎΠ»Ρ, Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΊΠΎΠ΄/ΠΠ, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡΠ΅ ΡΡΠΎΠ³ΠΎ - ΡΡΠΎ ΠΌΠΎΡΠ΅Π½Π½ΠΈΠΊΠΈ. ΠΠ±ΡΠ·Π°ΡΠ΅Π»ΡΠ½ΠΎ ΠΆΠΌΠΈΡΠ΅ "ΠΠΎΠΆΠ°Π»ΠΎΠ²Π°ΡΡΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ. ΠΠΎΠ΄ΡΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β