Network Reliability & Observability Engineer
ADM
Job Description
Network Reliability & Observability Engineer Position Summary: The Network Reliability & Observability Engineer is an experienced, hands-on role focused on strengthening the reliability, availability, performance, and operational maturity of global Network Services. This is a platform ownership and service improvement role, not a traditional ticket-response position. The engineer will own and improve network observability capabilities, with SolarWinds as a required core competency, while turning telemetry, events, inventory, and service data into actionable operational insight.
This role partners closely with Network Engineering and Network Operations during new device, site, circuit, firewall, wireless, SD-WAN, voice, and platform deployments to ensure observability, inventory, alerting, logging, backup, documentation, and operational handoff are built into production readiness. The role also works directly with Network Operations and MSP partners to improve incident response, reduce alert noise, strengthen capacity and performance planning, and drive continuous service improvement. Why This Role Matters: Improves visibility into critical LAN, WAN, SD-WAN, wireless, firewall, internet, data center, cloud connectivity, and voice services.
Strengthens reliability and availability by detecting service risk earlier, improving alert quality, and reducing repeat incidents. Ensures new infrastructure deployments are fully monitored, properly inventoried, supportable, and operationally ready before handoff. Establishes credible service-level reporting so leaders, service owners, Engineering, and Operations can understand service health, risk, and improvement priorities.
Core Responsibilities: Own, administer, and continuously improve SolarWinds as a core Network Services observability platform, including node lifecycle, SNMP v2/v3 polling, custom properties, pollers, groups, dependencies, maps, alerts, reports, dashboards, and platform data quality. Develop and tune SolarWinds monitoring standards across NPM, NTA/NetFlow, NCM, IPAM/UDT, and related capabilities, including threshold design, alert logic, polling intervals, interface monitoring, capacity views, device classification, and escalation workflows. Use Syslog, NetFlow/sFlow/IPFIX, L4-L7 visibility, SNMP, device health metrics, interface counters, and other telemetry sources to identify availability, performance, capacity, and reliability risks.
Partner with Network Engineering on new deployments to define monitoring acceptance criteria and confirm devices, circuits, interfaces, technologies, logs, backups, inventory fields, ownership, and runbooks are complete before production support begins. Partner with Network Operations to improve incident response, availability, and performance by identifying chronic alerts, unstable devices, circuit risk, latency, packet loss, jitter, saturation, CPU/memory issues, interface errors, and recurring service degradations. Build and maintain service health dashboards, scorecards, and defined service-level reporting for availability, utilization, saturation, packet loss, latency, jitter, device health, voice quality indicators, alert volumes, incident trends, and improvement actions.
Ensure all in-scope inventory is fully monitored, accurately classified, and aligned to asset, CMDB, site, service, ownership, lifecycle, and support-model data requirements. Improve onboarding and offboarding so new infrastructure is efficiently added to monitoring, logging, backup, discovery, inventory, dashboards, and support workflows, while retired assets are removed from monitoring, reporting, CMDB, and alerting cleanly. Support incident and major incident analysis with telemetry evidence, timelines, probable cause indicators, service impact context, and recommendations for monitoring, automation, runbook, Problem Management, or engineering improvements.
Create practical automation, scripts, API integrations, data checks, and repeatable reports using Python, PowerShell, REST APIs, JSON, Ansible, SWQL/SQL, Bash, or similar tools to reduce manual work and improve operational accuracy. Drive continuous service improvement by converting incident trends, monitoring gaps, capacity forecasts, inventory exceptions, and operational feedback into measurable corrective actions, standards updates, and service-review materials. Required Experience and Technical Skills: 6+ years of experience in networking technologies and services and infrastructure observability, enterprise monitoring, or Network SRE roles supporting large distributed environments.
Detailed hands-on SolarWinds expertise is required, including SolarWinds Platform/Orion administration, NPM, NTA/NetFlow, NCM, custom properties, Alert Manager, reporting, dashboards, maps, SNMP credential management, UnDP/custom pollers, dependency mapping, automated discovery, node lifecycle, and platform troubleshooting. Strong understanding of enterprise network services, including routing, switching, WAN, SD-WAN, wireless, firewalls, internet connectivity, data center connectivity, cloud connectivity, DNS/DHCP dependencies, and voice/network service dependencies. Strong knowledge of SNMP v2/v3, MIB/OID expertise, Syslog, NetFlow/sFlow/IPFIX, ICMP, TCP/IP, QoS, latency, jitter, packet loss, interface counters, device CPU/memory, interface utilization, and performance telemetry.
Experience supporting production readiness with Infrastructure Engineering, including monitoring standards, device onboarding, operational acceptance criteria, documentation, escalation paths, configuration backup, and support handoff. Practical scripting and automation capability with Python, PowerShell, REST APIs, JSON, Ansible, Bash, SWQL/SQL, or similar tools for reporting, discovery, reconciliation, integration, validation, and operational improvement. Working knowledge of ITSM practices and ServiceNow workflows, including Incident, Change, Problem, Event, Configuration, Service Request, Asset Management, and continuous service improvement.
Preferred Experience: Experience with Cisco Catalyst Center/DNAC, Cisco SD-WAN/vManage, firewall management platforms, Splunk, Grafana, ELK, ThousandEyes, Prometheus, Datadog, Open Telemetry, cloud monitoring platforms, or similar observability ecosystems. Experience with SolarWinds API/SWIS/SWQL, SQL reporting, alert enrichment, ServiceNow integration, CMDB reconciliation, discovery rules, and monitored-versus-deployed inventory validation. Experience supporting voice or collaboration observability, including VoIP, SIP, SBCs, RTP, QoS, call quality, MOS, jitter, packet loss, and carrier/provider escalation.
Experience in a global enterprise, shared-services, offshore delivery, or MSP-governed operating model; CCNP-level knowledge or equivalent practical experience is preferred. Leadership Attributes and Ways of Working: Self-starter with the drive to identify gaps, organize ambiguous work, engage the right teams, and move improvements forward without waiting for every task to be assigned. Operational leader who influences Infrastructure Engineering, Network Operations, MSPs, vendors, and service owners through data, credibility, ownership, and follow-through.
Continuous service improvement mindset focused on preventing repeat incidents, improving service health, reducing operational toil, and making support more predictable. Strong analytical judgment with the ability to separate noise from meaningful risk and convert complex telemetry into clear recommendations and improvement actions. High ownership and accountability for SolarWinds platform quality, monitoring coverage, alert accuracy, inventory integrity, onboarding/offboarding discipline, dashboard usefulness, and service-reporting credibility.
Collaborative communicator who works effectively across time zones, explains findings clearly, challenges assumptions constructively, and escalates with confidence. Measures of Success: Defined service-level reporting is established and used to communicate Network Services health, risk, trends, and improvement actions for Operations, Engineering, service owners, and leadership. All in-scope network inventory is fully monitored, accurately classified, and aligned to asset, CMDB, ownership, lifecycle, support, and service-reporting requirements.
Onboarding and offboarding processes are efficient, effective, repeatable, and governed so new assets are operationally ready and retired assets are cleanly removed from monitoring, alerting, reporting, and inventory systems. SolarWinds monitoring maturity improves through better platform quality, data accuracy, alert tuning, dashboard reliability, automated discovery, capacity views, and actionable reporting. Reliability, availability, and performance improve through earlier risk detection, reduced false positives, improved incident triage, stronger capacity forecasting, and measurable continuous service improvement.