Kafka Expert
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Kafka Expert (Kafka/ZooKeeper): Troubleshooting and modernizing an older on-prem Kafka cluster for real-time market quote / HFT tick data with an accent on incident diagnosis, storage/disk saturation root causes, and operational hardening. Focus on building monitoring/alerting, runbooks, and a practical upgrade roadmap (including a path away from ZooKeeper) while improving resilience to minimize RTO/RPO.
Company
provides Kafka troubleshooting and modernization support for production streaming environments.
What you will do
- Rapidly triage incidents by validating broker health, controller/ZK health, partition leadership/ISR, replication, rebalances, and disk saturation scope.
- Diagnose why disk utilization jumped from ~10% to near 100% and identify root causes behind missing leaders, topic access failures, and invalid partition behavior.
- Assess cluster configuration and harden it by reviewing broker/topic settings, partition distribution, rack awareness (if any), and failover behavior; document failure domains and bottlenecks.
- Uplift observability by proposing/implementing Kafka monitoring (broker + ZK + OS/disk) with dashboards and alerting for lag, under-replication, disk/controller events, latency, GC, and network.
- Deliver operational enablement: produce findings + recommendations/roadmap and create runbooks for safe operations (restarts, partition reassignment, capacity checks, backups, upgrades, recovery).
- Optionally execute remediations (storage rebalancing, retention tuning, leader imbalance fixes) and plan Kafka upgrades including KRaft/ZooKeeper removal and resilience improvements.
Requirements
- Proven hands-on experience operating Kafka in production, including high-throughput clusters.
- Strong troubleshooting experience with partition leadership issues (missing leaders), ISR shrinkage, under-replicated partitions, and safe broker recovery without destructive “sledgehammer” actions.
- Experience with ZooKeeper-based Kafka clusters and operational best practices.
- Linux competence for disk/IO analysis, filesystem saturation, process/resource analysis, and networking basics.
- Ability to produce clear, actionable documentation: findings, recommendations, and runbooks.
- Strong communication skills working with a mixed engineering + IT team unfamiliar with Kafka.
Nice to have
- Experience with Kafka monitoring stacks (JMX metrics pipelines, Prometheus/Grafana, lag monitoring, alerting design).
- Experience with GUI/admin tooling and governance practices (RBAC, auditing approach, safer topic/config workflows).
- Experience planning Kafka upgrades/migrations, including evaluating KRaft readiness and risk.
- Familiarity with market data/trading workloads and latency-sensitive pipelines.
- Experience with VMware-based on-prem operations and capacity planning.
Culture & Benefits
- Freelance engagement focused on pragmatic improvements to an older on-prem Kafka environment with limited observability.
- Clear deliverables: incident diagnosis, findings + recommendations/roadmap, and operational runbooks.
- Hands-on collaboration with engineering and IT to transfer Kafka troubleshooting and day-to-day operations knowledge.
- Resilience goal to minimize RTO/RPO (target: as low as practical, possibly ~1 minute max data loss tolerance).
Hiring process
- Review the current Kafka incident symptoms and cluster context, then align on triage scope and modernization priorities.
- Deliver a findings report and recommendations/roadmap, followed by optional execution of selected remediations and upgrade planning.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →