Head of Engineering
Zocket
Job Description
The Technical Landscape This is what you’ll be working with and building on: AI & Agent Systems Multi-Agent Orchestration: A fleet of specialized AI agents (creative generation, compliance checking, competitive analysis, campaign intelligence) that coordinate, share context, and produce coherent outputs. We’re working with agentic frameworks and building custom orchestration layers for enterprise-scale reliability. Brand Knowledge Graph (Neo4j): A proprietary knowledge graph capturing brand identity, guidelines, competitive positioning, market context, and campaign performance history.
This graph is the shared memory and reasoning substrate for all agents. Context & Memory Layers: Short-term working memory for task execution, long-term memory for brand knowledge persistence, and retrieval mechanisms (RAG + graph traversal) that give agents the right context at the right time. Multimodal Brand Compliance: Validating generated content (text, image, video) against brand rules encoded in the knowledge graph, with feedback loops for continuous improvement.
Video Generation & Consistency: Working with models like Google Veo for video generation, maintaining brand-consistent visual identity, scene coherence, and style continuity. AI Evals & Agent Monitoring: Evaluation pipelines for measuring agent output quality, brand compliance accuracy, hallucination rates, and task completion reliability. Continuous monitoring of agent behaviour in production with alerting, drift detection, and regression tracking across prompt and model changes.
Backend & Infrastructure API & Services Architecture: RESTful and event-driven microservices handling authentication, authorization, rate limiting, request validation, error handling, and graceful degradation. Clean API contracts between services, versioning strategies, and backward compatibility for enterprise clients. Real-Time Systems: WebSocket services for live Brand Intel Dashboard updates, agent status streaming, and collaborative features.
Connection management, fan-out patterns, and graceful degradation at scale. Event-Driven Pipeline: Kafka for agent event streaming, async task processing, cross-service communication, and audit logging. Topic design, partitioning strategies, consumer group patterns, and delivery guarantees.
Caching & State Management: Redis for agent context caching, session state, rate limiting, pub/sub for real-time features, and distributed locking for concurrent agent operations. Data Layer: PostgreSQL as the relational backbone for transactional data, user management, and campaign state. Neo4j for the knowledge graph.
OLAP databases (ClickHouse / Lighthouse) for analytics workloads powering the Brand Intel Dashboard — aggregations over millions of ad performance records, competitive benchmarks, and trend analysis. Data Collection Infrastructure: Web scraping (Apify, Crawlee), official API integrations, data normalization, and ingestion pipelines feeding both Neo4j and the OLAP layer. DevOps, CI/CD & Infrastructure Containerization & Orchestration: Docker for service packaging, Kubernetes for orchestration, scaling, and service mesh.
Infrastructure-as-Code (Terraform / Pulumi) for reproducible deployments. CI/CD Pipelines: Automated build, test, lint, security scan, and deployment workflows. GitOps-based deployment strategies with staged rollouts, canary deployments, and automated rollback for production safety.
Observability Stack: Centralized logging, distributed tracing (OpenTelemetry), metrics collection (Prometheus/Grafana), and alerting. Full-stack visibility from API latency to Kafka consumer lag to LLM call performance to agent pipeline traces. Cloud Infrastructure: AWS-based infrastructure with cost-aware architecture.
VPC design, IAM, secrets management, and security posture appropriate for enterprise clients handling sensitive brand data. Stack Summary Frontend: Next.js, React, Tailwind CSS Backend: Python, Node.js, WebSockets, REST APIs Databases: PostgreSQL, Neo4j (knowledge graph), ClickHouse (OLAP/analytics), Redis (caching/state) Streaming & Messaging: Kafka AI/LLM: Anthropic Claude, Google Gemini/Veo, RAG pipelines, custom agent orchestration Infrastructure: AWS, Docker, Kubernetes, Terraform CI/CD & Observability: GitHub Actions, ArgoCD, Prometheus, Grafana, OpenTelemetry Project & Communication: Linear, Slack What We’re Looking For Must-Haves 8+ years of software engineering experience, with at least 2–3 years building AI/ML-powered products in production (not research, not prototyping — production systems with real users). Strong backend engineering fundamentals: you’ve designed and scaled production systems with microservices architectures, event-driven patterns (Kafka), real-time capabilities (WebSockets), caching layers (Redis), and relational databases (PostgreSQL).
You have strong opinions on API design, auth patterns, error handling, and service boundaries. Experience working across multiple database paradigms in production: relational (PostgreSQL), graph (Neo4j or similar), and columnar/OLAP (ClickHouse, BigQuery, or similar). Strong instincts on data modeling, query optimization, indexing strategies, and when to use which database for which workload.
Deep hands-on experience with agentic AI systems: multi-agent orchestration, agent memory and context management, tool use, or agentic frameworks (LangGraph, CrewAI, Autogen, or custom orchestration layers). Working experience with graph databases, preferably Neo4j. Comfortable designing graph schemas, writing Cypher queries, and reasoning about when graph-based retrieval beats vector search or relational joins.
Strong understanding of LLM-based systems: prompt engineering, RAG architectures, embedding pipelines, model routing, context window optimization, and the tradeoffs between different LLM providers. Experience building AI evaluation and monitoring systems: automated evals for output quality, regression testing for prompt/model changes, and production monitoring of AI system behaviour. DevOps and CI/CD experience: you’ve built or significantly improved deployment pipelines, worked with containerized environments (Docker/Kubernetes), infrastructure-as-code, and production observability (logging, tracing, metrics, alerting).
Proven track record of mentoring engineers. You’ve changed how someone thinks about a problem, not just solved it for them. Systems engineering depth: you understand distributed systems, async pipelines, connection pooling, backpressure, failover strategies, and the messy reality of AI in production (latency, cost, non-determinism, failures).
Fluency with AI-assisted and agent-driven development workflows. You should already be using coding agents (Cursor, Claude Code, or similar) in your daily work and have opinions on spec-driven autonomous development. Strong Signals Experience with autonomous coding agent frameworks (OpenSpec, Speckit, BMAD, or similar spec-to-code pipelines).
If you’ve built or evaluated these systems, we want to talk. Experience designing and operating knowledge graphs for AI reasoning — not just storage, but as an active substrate for agent decision-making. Experience building real-time systems at scale: WebSocket architectures with thousands of concurrent connections, Kafka pipelines processing high-throughput event streams, or Redis-backed distributed state systems.
Hands-on experience with OLAP/analytics databases (ClickHouse, BigQuery, Druid) for building customer-facing analytics dashboards over large datasets. Experience building AI eval frameworks: custom evaluation harnesses, prompt regression suites, or production quality monitoring for LLM-based systems. Familiarity with multimodal AI: vision models, video generation (Veo, Runway, Sora), or compliance/validation systems over multimodal outputs.
Experience with the enterprise AI stack: Anthropic Claude API, Google Vertex AI / Gemini, AWS Bedrock. Understanding of API compliance, rate limiting, cost management at scale. Strong DevOps background: Kubernetes cluster management, Terraform/Pulumi at scale, GitOps workflows (ArgoCD/Flux), and production incident management for enterprise SLAs.
Background in brand/marketing tech, ad platforms, creative generation, or competitive intelligence. Contributions to open-source AI tooling, infrastructure projects, or published technical writing on agent architectures, backend systems, or knowledge graphs. Startup experience — you’ve operated where you wore multiple hats and shipped fast with a small team.
Bonus if you’ve sold to or worked with enterprise clients.