Назад
Company hidden
6 дней назад

Big Data/Data Platform Site Reliability Engineer (Hadoop/Kafka)

Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify RU Global, списка компаний с восточно-европейскими корнями
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer (Hadoop/Kafka): Deploying, configuring, and maintaining big data stores and large-scale Linux infrastructure with an accent on reliability, scalability, and operational excellence. Focus on debugging complex distributed system issues, advancing technology stack, and ensuring maximum uptime and predictable performance.

Location: Willing and able to work East Coast U.S. hours (9am–6pm EST)

Company

hirify.global is a fast-growing healthcare technology company using real-time data to transform healthcare through machine learning and programmatic automation.

What you will do

  • Deploy, configure, monitor, and maintain multiple big data stores with a strong focus on reliability and scalability.
  • Manage large-scale Linux infrastructure to ensure maximum uptime and predictable performance.
  • Develop and document system configuration standards, operational procedures, and best practices.
  • Perform performance and reliability testing, including reviewing configuration and hardware specifications.
  • Participate in incident response, root cause analysis, and drive long-term reliability improvements.
  • Advance the technology stack with innovative ideas and pragmatic solutions.

Requirements

  • Strong hands-on experience operating large-scale Linux infrastructure in production (Rocky Linux or equivalent).
  • Deep practical knowledge of Apache Hadoop-based data platforms (HDFS architecture, Kerberos, operational lifecycle).
  • Experience running Apache Kafka clusters in production, including KRaft-based setups.
  • Proven ability to debug complex distributed system issues across storage, compute, and networking layers.
  • Experience designing or improving automation, deployment, or GitOps-style workflows.
  • Proficiency in scripting or automation (Python, Shell).
  • Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, basic network security).
  • Willing and able to work East Coast U.S. hours (9am–6pm EST).

Nice to have

  • Experience with Trino and Iceberg as large-scale analytical query engines.
  • Experience administering Percona XtraDB Cluster or other HA databases.
  • Hands-on experience with Ceph or other distributed storage systems.
  • Strong background in observability platforms (Prometheus, Grafana, Graphite, ELK, Icinga).
  • Experience with configuration management (Puppet or similar).
  • Familiarity with Docker and Kubernetes in production environments.
  • Background in AdTech, real-time data systems, or low-latency/high-throughput environments.

Culture & Benefits

  • A collaborative environment where team success is valued.
  • Opportunities to continuously grow skills and learn new technologies.
  • Emphasis on strategic thinking and deep dives into complex systems.
  • A proactive approach to problem-solving and infrastructure reliability.

Hiring process

  • Initial Screening Call (30 mins).
  • Team Lead Screening (30 mins).
  • Technical Interview with SREs (1 hour).
  • General Discussion with VP of Data Engineering (30 mins).
  • Technical Interview with Principal Architect (1 hour).
  • Meet & Greet w/ SVP of Engineering (15 mins).
  • Final Video Call with Sr. Director of Data Management at WebMD (30 mins).

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...