TL;DR
Senior Site Reliability Engineer (Hadoop/Kafka): Deploying, configuring, and maintaining big data stores and large-scale Linux infrastructure with an accent on reliability, scalability, and operational excellence. Focus on debugging complex distributed system issues, advancing technology stack, and ensuring maximum uptime and predictable performance.
Location: Willing and able to work East Coast U.S. hours (9am–6pm EST)
Company
hirify.global is a fast-growing healthcare technology company using real-time data to transform healthcare through machine learning and programmatic automation.
What you will do
- Deploy, configure, monitor, and maintain multiple big data stores with a strong focus on reliability and scalability.
- Manage large-scale Linux infrastructure to ensure maximum uptime and predictable performance.
- Develop and document system configuration standards, operational procedures, and best practices.
- Perform performance and reliability testing, including reviewing configuration and hardware specifications.
- Participate in incident response, root cause analysis, and drive long-term reliability improvements.
- Advance the technology stack with innovative ideas and pragmatic solutions.
Requirements
- Strong hands-on experience operating large-scale Linux infrastructure in production (Rocky Linux or equivalent).
- Deep practical knowledge of Apache Hadoop-based data platforms (HDFS architecture, Kerberos, operational lifecycle).
- Experience running Apache Kafka clusters in production, including KRaft-based setups.
- Proven ability to debug complex distributed system issues across storage, compute, and networking layers.
- Experience designing or improving automation, deployment, or GitOps-style workflows.
- Proficiency in scripting or automation (Python, Shell).
- Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, basic network security).
- Willing and able to work East Coast U.S. hours (9am–6pm EST).
Nice to have
- Experience with Trino and Iceberg as large-scale analytical query engines.
- Experience administering Percona XtraDB Cluster or other HA databases.
- Hands-on experience with Ceph or other distributed storage systems.
- Strong background in observability platforms (Prometheus, Grafana, Graphite, ELK, Icinga).
- Experience with configuration management (Puppet or similar).
- Familiarity with Docker and Kubernetes in production environments.
- Background in AdTech, real-time data systems, or low-latency/high-throughput environments.
Culture & Benefits
- A collaborative environment where team success is valued.
- Opportunities to continuously grow skills and learn new technologies.
- Emphasis on strategic thinking and deep dives into complex systems.
- A proactive approach to problem-solving and infrastructure reliability.
Hiring process
- Initial Screening Call (30 mins).
- Team Lead Screening (30 mins).
- Technical Interview with SREs (1 hour).
- General Discussion with VP of Data Engineering (30 mins).
- Technical Interview with Principal Architect (1 hour).
- Meet & Greet w/ SVP of Engineering (15 mins).
- Final Video Call with Sr. Director of Data Management at WebMD (30 mins).
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →