Назад
Company hidden
5 дней назад

Kafka Expert

Тип работы
project
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Kafka Expert (Kafka/ZooKeeper): Troubleshooting and modernizing an older on-prem Kafka cluster for real-time market quote / HFT tick data with an accent on incident diagnosis, storage/disk saturation root causes, and operational hardening. Focus on building monitoring/alerting, runbooks, and a practical upgrade roadmap (including a path away from ZooKeeper) while improving resilience to minimize RTO/RPO.

Company

hirify.global provides Kafka troubleshooting and modernization support for production streaming environments.

What you will do

  • Rapidly triage incidents by validating broker health, controller/ZK health, partition leadership/ISR, replication, rebalances, and disk saturation scope.
  • Diagnose why disk utilization jumped from ~10% to near 100% and identify root causes behind missing leaders, topic access failures, and invalid partition behavior.
  • Assess cluster configuration and harden it by reviewing broker/topic settings, partition distribution, rack awareness (if any), and failover behavior; document failure domains and bottlenecks.
  • Uplift observability by proposing/implementing Kafka monitoring (broker + ZK + OS/disk) with dashboards and alerting for lag, under-replication, disk/controller events, latency, GC, and network.
  • Deliver operational enablement: produce findings + recommendations/roadmap and create runbooks for safe operations (restarts, partition reassignment, capacity checks, backups, upgrades, recovery).
  • Optionally execute remediations (storage rebalancing, retention tuning, leader imbalance fixes) and plan Kafka upgrades including KRaft/ZooKeeper removal and resilience improvements.

Requirements

  • Proven hands-on experience operating Kafka in production, including high-throughput clusters.
  • Strong troubleshooting experience with partition leadership issues (missing leaders), ISR shrinkage, under-replicated partitions, and safe broker recovery without destructive “sledgehammer” actions.
  • Experience with ZooKeeper-based Kafka clusters and operational best practices.
  • Linux competence for disk/IO analysis, filesystem saturation, process/resource analysis, and networking basics.
  • Ability to produce clear, actionable documentation: findings, recommendations, and runbooks.
  • Strong communication skills working with a mixed engineering + IT team unfamiliar with Kafka.

Nice to have

  • Experience with Kafka monitoring stacks (JMX metrics pipelines, Prometheus/Grafana, lag monitoring, alerting design).
  • Experience with GUI/admin tooling and governance practices (RBAC, auditing approach, safer topic/config workflows).
  • Experience planning Kafka upgrades/migrations, including evaluating KRaft readiness and risk.
  • Familiarity with market data/trading workloads and latency-sensitive pipelines.
  • Experience with VMware-based on-prem operations and capacity planning.

Culture & Benefits

  • Freelance engagement focused on pragmatic improvements to an older on-prem Kafka environment with limited observability.
  • Clear deliverables: incident diagnosis, findings + recommendations/roadmap, and operational runbooks.
  • Hands-on collaboration with engineering and IT to transfer Kafka troubleshooting and day-to-day operations knowledge.
  • Resilience goal to minimize RTO/RPO (target: as low as practical, possibly ~1 minute max data loss tolerance).

Hiring process

  • Review the current Kafka incident symptoms and cluster context, then align on triage scope and modernization priorities.
  • Deliver a findings report and recommendations/roadmap, followed by optional execution of selected remediations and upgrade planning.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →