SRE Support Engineer (Observability)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
SRE Support Engineer (Observability): Provide high-impact technical support for a large technology company’s internal IaaS platform's monitoring, alerting, telemetry, and operational tooling with an accent on Prometheus, AlertManager, Linux, and networking. Focus on troubleshooting complex issues, onboarding customers, building documentation, and driving operational improvements through trend analysis and feedback to engineering.
Location: Remote | Time Zone: (US, Canada, Brazil, Chile, Colombia, Mexico) (8AM–5PM Pacific)
Company
Global technology services company delivering large-scale cloud, data, and engineering solutions across 130+ countries, partnering with world’s largest organizations to build and operate internal platforms.
What you will do
- Manage Slack threads and tickets for observability and tooling support, from simple resolutions to end-to-end onboarding.
- Troubleshoot and resolve monitoring/alerting issues with Prometheus, AlertManager, OpenTelemetry, Linux, and networking.
- Build and maintain knowledge base articles, playbooks, and community posts to enable self-service.
- Analyze customer trends, provide feedback to engineering/SRE teams, and propose scalable improvements.
- Participate in incident response, post-mortems, and contribute to team processes and tooling.
Requirements
- Several years supporting highly scalable applications and web services.
- Hands-on with Kubernetes, Prometheus, AlertManager troubleshooting, OpenTelemetry, distributed tracing.
- Strong Linux OS knowledge (CLI, debugging, logs) and TCP/IP networking fundamentals.
- Experience troubleshooting ambiguous multi-layer issues with analytical skills and attention to detail.
- Excellent written/verbal communication for technical customers; service mindset.
- 3–7+ years in Technical Support Engineering, SRE/Platform support, DevOps or similar.
Nice to have
- Experience with internal support tooling, automation, runbooks.
- Familiarity with Grafana, log aggregation, incident tooling.
- Prior SRE or platform support at scale.
Culture & Benefits
- Remote-first environment with high autonomy and trust-based culture.
- Global team working on modern systems and meaningful technical challenges.
- Real impact through deep troubleshooting and scaling support practices.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →