Senior Site Reliability Engineer (Cloud)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (Cloud): Designing and scaling fault-tolerant infrastructure for a global e-commerce shipping API with an accent on Kubernetes, cloud orchestration, and system reliability. Focus on automating CI/CD pipelines, implementing disaster recovery solutions, and optimizing high-availability distributed systems.
Location: Remote (Canada)
Company
is the shipping layer of the internet, providing logistics technology and infrastructure to connect merchants with carriers worldwide via a single API.
What you will do
- Design, scale, and secure infrastructure through fault-tolerant architecture, performance tuning, and capacity planning.
- Build and maintain automation, monitoring, and alerting systems, including disaster recovery solutions.
- Ensure scalability and maintainability through microservices adoption and decoupling of concerns.
- Enhance and maintain CI/CD pipelines to ensure smooth and safe production releases.
- Verify system performance and correctness regarding response time and throughput.
- Participate in on-call rotations and collaborate on peer design reviews for new features.
Requirements
- Experience developing and troubleshooting highly available distributed systems, specifically with Kubernetes in production.
- Extensive expertise with at least one public cloud provider (AWS, GCP, or Azure).
- Exceptional verbal, written, and interpersonal communication skills.
- Strong understanding of security practices, automation, and testing methods.
- Familiarity with Redis, Elasticsearch, and Hadoop.
- BS or MS degree in Computer Science or equivalent professional experience.
Nice to have
- Advanced knowledge of Postgresql server configuration and optimization.
- 3+ years of professional software development experience.
- Experience managing service meshes (e.g., Istio) and monitoring SLOs/SLAs.
- Proficiency with monitoring tools such as New Relic, Prometheus, Grafana, or Datadog.
- Knowledge of OpenTelemetry for distributed tracing and metrics collection.
- Experience managing Python and Golang applications in production.
Culture & Benefits
- Remote-first and globally distributed team environment.
- Culture based on flexibility, trust, and autonomy.
- Commitment to inclusivity and equal access to opportunities for all backgrounds.
- Modern, scalable technology stack.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →