Site Reliability Engineer (SRE) (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (SRE) (AI): Designing and maintaining highly reliable, observable, and secure cloud and edge infrastructure supporting AI-driven products with an accent on observability, Kubernetes orchestration, and system security. Focus on automating operational tasks, managing SLOs/SLIs, and ensuring system resilience through proactive monitoring and incident response.
Location: On-site in Taipei City, Taiwan
Company
is an innovative technology company operating large-scale cloud and edge infrastructure that powers AI-driven products and services.
What you will do
- Design and maintain monitoring, alerting, and dashboarding systems to build visibility into system health via metrics, logs, and traces.
- Deploy, manage, and optimize containerized workloads running on Kubernetes across production and edge environments.
- Implement secure access controls and monitor for cybersecurity threats and service disruptions.
- Automate repetitive operational tasks and build tooling to streamline infrastructure and CI/CD workflows.
- Participate in on-call rotations, lead troubleshooting for production incidents, and conduct root-cause analysis.
- Collaborate with AI, ML, hardware, and product teams to ensure new services are production-ready.
Requirements
- 3+ years of experience in SRE, DevOps, Platform Engineering, or Production Operations.
- Hands-on experience with AWS or other major cloud platforms.
- Proficiency with Docker, Kubernetes, and Infrastructure as Code tools like Terraform.
- Strong understanding of observability tools such as Grafana and Prometheus.
- Solid Linux administration skills and proficiency in Python or Bash.
- Must be based in or be able to work on-site in Taipei City, Taiwan
Nice to have
- Experience operating large-scale edge computing or IoT deployments.
- Familiarity with zero-trust access management or security operations (threat detection).
- Exposure to AI infrastructure, LLM-based applications, or AI-Ops solutions.
- Knowledge of compliance frameworks such as ISO 27001.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →