[00:03:55] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:55] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:47] PROBLEM - SSH on an-airflow1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:04:57] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:37] RECOVERY - SSH on an-airflow1006 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:08:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076279 (owner: 10TrainBranchBot) [00:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:26:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:03:13] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:38:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:29] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:57] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:16:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:17:31] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:20:21] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:20:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:20:51] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:26:01] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:26:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:27:31] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:27:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.744 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:28:21] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:28:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:40:16] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185258 (10phaultfinder) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:12] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [09:29:45] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 85712136 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:31:45] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 39160 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:10:17] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:53:23] (03CR) 10Gergő Tisza: [C:03+1] [beta-cluster] Enable cookie-based SUL3 feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076195 (https://phabricator.wikimedia.org/T375787) (owner: 10D3r1ck01) [11:05:27] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 910.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 821.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:32:40] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10185344 (10taavi) [11:32:47] 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10185345 (10taavi) [11:33:09] 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10185346 (10taavi) [11:33:19] 06SRE, 06cloud-services-team, 10Cloud-VPS: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10185347 (10taavi) [11:33:29] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10185348 (10taavi) [11:33:34] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10185349 (10taavi) [11:33:41] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10185350 (10taavi) [11:33:47] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10185351 (10taavi) [11:37:31] (03PS3) 10Majavah: P:acme_chief: allow enabling http-01 spport [puppet] - 10https://gerrit.wikimedia.org/r/1011167 (https://phabricator.wikimedia.org/T342398) [11:37:31] (03PS3) 10Majavah: P:wmcs::novaproxy: proxy http-01 challenges to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1011168 (https://phabricator.wikimedia.org/T342398) [12:35:28] (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: migrate to HAProxy internal exporter [puppet] - 10https://gerrit.wikimedia.org/r/1076297 (https://phabricator.wikimedia.org/T343885) [12:36:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4150/co" [puppet] - 10https://gerrit.wikimedia.org/r/1076297 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [12:43:26] (03CR) 10Majavah: [C:04-1] cloud-vps dynamic proxy: prometheus stats from nginx access logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) (owner: 10Andrew Bogott) [12:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:05:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185499 (10phaultfinder) [13:18:38] 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review, 10Sustainability (Incident Followup): Add external meta-monitoring for metricsinfra - https://phabricator.wikimedia.org/T288053#10185514 (10taavi) [13:42:51] 06SRE, 06cloud-services-team, 10Cloud-VPS: ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531#10185615 (10taavi) [13:49:10] (03PS1) 10Cwhite: logstash: cast airflow caller field to string [puppet] - 10https://gerrit.wikimedia.org/r/1076299 [14:38:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:29] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:27] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:57] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:50:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:50:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:20:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:24:57] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:05:18] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185674 (10phaultfinder) [18:15:12] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185705 (10phaultfinder) [20:24:51] PROBLEM - Host cr2-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:24:52] PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [20:24:52] PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:25:12] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:53] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 68, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:26:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:26:49] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:26:54] wow [20:27:05] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 439 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:27:22] getting to a laptop [20:29:36] looks like the router is down [20:29:50] ports connecting to it on cr3-eqsin hard down, BGP/OSPF adjacency down [20:29:53] RECOVERY - Host cr2-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 222.76 ms [20:29:55] yeah and it's not false positive, it is actually down [20:30:01] I'm checking now if the serial console shows any ife [20:30:03] hmm..... [20:30:08] I am depooling eqsin [20:30:21] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: cr2-eqsin down, no task ID specified] [20:30:33] ulsfo is also depooled sigh [20:30:38] this would result in terrible latency for eqsin [20:30:51] maybe hold on then [20:31:06] off-peak hours in east asia at least [20:31:09] yep [20:31:53] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:32:47] Bad week for juniper hw [20:32:57] topranks: should we depool at least in the meantime? [20:33:35] re-reading your comment - if the result is just poor latency I'd say go ahead for the time being [20:33:40] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: cr2-eqsin down, no task ID specified] [20:33:41] done [20:33:49] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:33:55] "just" [20:34:27] well I mean as opposed to us saturating something / packet loss which is worse [20:34:29] well what choice do we have? [20:34:49] in theory we can run off one router, but indeed best not to [20:35:23] I wonder if T375345 and are related [20:35:24] T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345 [20:35:27] but I think depooling buys us some time, and hopefully off peak for the region [20:36:11] !incidents [20:36:12] 5285 (ACKED) Host cr2-eqsin - PING - Packet loss = 100% [20:36:15] !ack 5285 [20:36:15] 5285 (ACKED) Host cr2-eqsin - PING - Packet loss = 100% [20:37:49] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:37:49] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:37:56] RECOVERY - Host cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 222.79 ms [20:38:34] well [20:40:09] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:40:10] device still reports being down in librenms though and I can't SSH [20:40:17] RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.85 ms [20:40:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [20:40:25] !incidents [20:40:26] 5285 (ACKED) Host cr2-eqsin - PING - Packet loss = 100% [20:40:26] 5286 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [20:40:31] !ack 5286 [20:40:32] 5286 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [20:40:43] FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [20:42:07] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 14 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:42:52] we can probably survive for some time but eqsin gets busy again and then that might get interesting [20:43:58] on to the router now via serial, it seems healty for past 10 mins [20:43:59] We could get some link saturation in codfw [20:44:21] unsure of reboot reason, it appears to have booted off backup system partiton so perhaps some file corruption / disk issue [20:44:40] topranks: you know better but [20:44:46] what does show system alarms say? [20:45:17] it just shows the problem with disk 0 [20:45:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [20:45:43] RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [20:49:02] the logs of before it reset aren't available on the local device as they are on the offline disk [20:49:20] topranks: so the question is then I guess: do we have enough confidence to repool eqsin? [20:53:10] I am updating the status page [20:53:16] if someone has objections, please let me know [20:53:44] Given that we are on low traffic hours for eqsin we could repool and see how the router behaves [20:53:51] ok [20:53:59] I am fine with it too to test but not unless topranks is confident I guess [20:54:09] because that would then depend on the severity of the error [20:54:12] Sure [20:54:22] if that means we are seeing sometihng similar like to what we saw with cr3-ulsfo, not sure [20:54:24] let's hold off for a few mins if we can [20:54:31] topranks: all good take your time please [20:54:35] (and let us know if we can help) [20:54:38] I am updating the status page in the meantime [20:54:47] this has some similarities to cr3-ulsfo but also different [20:54:49] disk issue [20:54:55] we had something similar before [20:54:55] https://phabricator.wikimedia.org/T372781 [20:55:14] same symtoms here - difference is the fix there was rebooting the redunant routing engine in the MX480 in eqiad [20:55:33] this device - MX204 - does not have a redunant re to reboot without affecting anything [20:55:37] this one is the 204? [20:55:38] right [20:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:10:29] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-eqsin disk failure Sept 2024 - https://phabricator.wikimedia.org/T375961 (10cmooney) 03NEW p:05Triage→03High [21:10:35] ok we have been stable for 38 mins now [21:10:39] I think we should try a repool [21:11:05] given the potential resource constraints in codfw and high latency from the region [21:11:35] ok [21:11:38] let's do it [21:12:00] if - as on the face of things - it's just a disk failure and things working ok from the backup partition, then no reason it won't be ok like that for a while [21:12:09] so yeah +1 from me for repool [21:12:13] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: testing pooling after cr2-eqsin was down and site was depooled, no task ID specified] [21:12:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: testing pooling after cr2-eqsin was down and site was depooled, no task ID specified] [21:12:19] pooled [21:12:35] thanks for filing the task. incident doc is https://docs.google.com/document/d/1eBOZe9bTZ9kGJPxSmsu8ToQXlPLB0H_S_W9aM2SIeGA/edit [21:12:38] updating as we go [21:13:11] keeping an eye out on errors (you can leave the CDN site to me and and just observe the cr) [21:13:26] sukhe: maybe I spoke to soon [21:13:30] leave it for now [21:13:35] it's pooled [21:13:38] you want me to depool? [21:14:18] yeah - sorry [21:14:24] ok [21:14:26] BGP is down from it to all the LVS for some reason [21:14:33] oh [21:14:35] restart probably [21:14:36] wait [21:15:01] !log sudo cumin 'A:lvs-eqsin' 'systemctl restart pybal' [21:15:03] check onw [21:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:55] nope [21:15:57] it's down [21:15:57] ok [21:16:03] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: testing failed, depooling again, no task ID specified] [21:16:16] topranks: I am depooling, it doesn't seem like we are in a good state [21:16:19] ok? [21:17:00] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-eqsin disk failure Sept 2024 - https://phabricator.wikimedia.org/T375961#10185783 (10cmooney) Actually looking at the output in more detail BGP to the LVS servers / PyBal is down. ` Peer AS InPkt OutPkt OutQ Flaps Last Up/Dw... [21:17:02] yeah, errors [21:17:34] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: testing failed, depooling again, no task ID specified] [21:17:48] sukhe: my guess is that's just PyBal not dealing with things right for some reason [21:18:23] I think we should restart the service, I guess we can try with lvs5003 anyway? [21:18:31] topranks: I did that but didn't help [21:18:32] so I depooled [21:18:58] huh [21:19:01] but now since it's depooled, we can experiment it again but we should check to make sure that the session is actually established [21:19:17] Sep 28 21:15:25 lvs5004 pybal[35057]: [bgp.BGPFactory@0x7f185931c6e0] INFO: Client connection failed: User timeout caused connection failure. [21:19:20] Sep 28 21:15:25 lvs5004 pybal[35057]: [bgp.FSM@0x7f185931f090 peer 103.102.166.130] INFO: State is now: IDLE [21:19:23] Sep 28 21:15:55 lvs5004 pybal[35057]: [bgp.BGPFactory@0x7f185931c6e0] INFO: Client connection failed: User timeout caused connection failure. [21:19:27] any logs on the pybal side as to what's not working? [21:19:29] hmm [21:20:08] has to be pybal I think, 91 working BGP sessions on the router, but all 3 pybals down [21:20:13] hmmm [21:21:08] checking [21:21:23] topranks: maybe it was a question of timing [21:21:35] nope [21:22:28] nah [21:22:34] the router has an outdated config [21:22:43] https://phabricator.wikimedia.org/T321545#8341024 [21:22:45] it's set up to peer with lvs500[1-3] [21:22:50] wow [21:22:55] ok [21:22:59] not lvs500[4-6] [21:23:04] and that didn't kick in all this time because it wasn't rebooted? [21:23:10] the uptime was like 2 years [21:23:46] tbh I don't know [21:23:59] I thought the snapshot was a regular thing [21:24:07] but cr3-eqsin looks fine? [21:24:13] 10.132.0.6 64600 17 17 0 31 2:22 Establ [21:24:16] 10.132.0.7 64600 53 56 0 32 8:29 Establ [21:24:19] 10.132.0.39 64600 53 56 0 33 8:25 Establ [21:24:22] these are all the correct LVSes [21:25:08] yeah it's working fine [21:25:13] cr2 rebooted with an old config [21:25:20] also this despite it can ping ok [21:25:21] ERROR:homer:Attempt 1/3 failed: Unable to connect to cr2-eqsin.wikimedia.org [21:25:34] topranks: fwiw I can't still SSH [21:25:45] hmm... why can I ? [21:26:09] anyway, aside that, where do we go from here? [21:26:36] what does ssh -v show ? [21:26:49] I'll get the config sorted is first step [21:27:10] I think it's just missing your SSH key tbh [21:27:23] but cr3 has it. that can happen? [21:27:33] that's not the focus to be clear but I was curious if that points to something else [21:29:05] these are different devices [21:29:20] they can definitely have different configurations, different "authorized_keys" files effectively [21:30:15] wow ok, I thought that the key was synced given there is only one place you put the key and that it works for all routers [21:30:28] topranks: coming back, did updating cr2 for the new LVS IPs help? [21:31:08] homer pushes the key [21:31:16] it's the same as for our servers really [21:31:35] em didn't quite get that far, homer push is now failing on me due to missing cert for gnmi [21:31:58] can you share the error? maybe some rubber duck debugging might help [21:37:18] nah it's ok I think [21:37:37] just need some cert to be there - I've copied the one from cr3-eqsin [21:39:09] ok [21:40:11] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:40:14] ok [21:40:16] that's a good sign [21:40:22] see if you can ssh now [21:40:39] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:40:42] I'll do a manual diff on the remaining config to catch anything else in themeantime [21:40:55] topranks: yep works! [21:42:19] BGP sessions still show "active". [21:42:25] wait no, old LVS IPs. [21:47:03] yeah.... is PyBal BGP not done by Homer?? [21:47:13] INFO:homer.transports.junos:Empty diff for cr2-eqsin.wikimedia.org, skipping device. [21:48:59] I don't see why and how it would be any different for cr2-eqsin though [21:49:30] no it shouldn't be [21:49:42] but I hardly think homer saying no diff is any issue [21:49:54] some assume PyBal group is being done manually still for some reason [21:50:00] maybe an overishgt, we don't have many LVS [21:50:37] BGP set to false for those lvs in netbox I think is the issue [21:51:00] which then means that it predates the BGP switch to Netbox [21:51:09] but yeah, +1 for fixing that [21:52:39] ok that's added the right neighbors now [21:52:48] sukhe@cr2-eqsin> show bgp group pybal [21:52:51] looks ok yep [21:52:59] you got a diff for the homer changes as well? [21:53:12] BGP session established! [21:53:13] looks good [21:53:17] https://www.irccloud.com/pastebin/JKFwGjec/ [21:54:37] pybal looks good fwiw [21:54:44] if everything else also does, we can try again [21:55:29] it's super late for you as well. you can leave this to me and worse case if we can't repool eqsin, I might need to pick up the phone to make a decision by bringing in at least one other SRE :) [21:59:11] yeah just gimme a minute applying the other config diffs from the rancid backup config now [21:59:31] all good. if you want me to pick up something, happy to [22:04:25] ok I'm fairly happy [22:04:44] a few BGP peers I don't have a way to find the MD5 auth keys for without maybe thrawling through emails [22:04:55] but it's ok I think - most are up and all transit is up [22:05:01] router has been stable since the reset [22:05:09] config is now up to date minus those bgp keys [22:05:19] ok [22:05:55] do we have a diff of the peers that are missing? [22:06:21] one more question: [22:06:26] from the motd, [22:06:27] "Note: VM Host is Currently running from alternate disk [22:06:28] " [22:09:18] 06SRE, 06Infrastructure-Foundations, 10netops: cr2-eqsin disk failure Sept 2024 - https://phabricator.wikimedia.org/T375961#10185804 (10cmooney) The obvious thing I didn't at first spot was the config on the router was seriously out of date. The PyBal group had the old lvs500[1-3] configured, which have lon... [22:09:28] no peers are missing, but some are down due to missing MD5 in the config [22:09:35] Groups: 17 Peers: 282 Down peers: 18 [22:09:38] cr2 [22:09:41] Groups: 17 Peers: 427 Down peers: 18 [22:09:42] cr3 [22:09:52] yeah [22:11:31] ah so the IX peers are down [22:11:32] ok [22:12:23] yeah just a few [22:12:34] I listed them on the task and set nda policy now [22:12:38] time to repool then? I mean things look OK but well [22:12:38] no biggie [22:12:42] yeah +1 [22:12:49] the only way to find it it out is to test it I guess [22:12:49] ok [22:13:07] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, no task ID specified] [22:13:14] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, no task ID specified] [22:13:22] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, T375961] [22:13:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, T375961] [22:13:31] done [22:14:10] ok - let's see how it goes! [22:14:24] ! [22:16:48] seems to be holding up OK [22:16:54] can you verify if cr2 is holding up fine? [22:19:52] looks good as well I think from whatever I know on how to check :) [22:19:58] thanks topranks! <3 [22:20:42] no signs of an issue - not expecting any tbh [22:20:56] I'll wait to see traffic levels reflected in LibreNMS though but should be fine [22:21:31] think you should head off now, it's quite late [22:21:45] I will step away for a bit but keep IRC open just in case and check in some time as well [22:22:24] thanks for stepping up as always -- really doubt we could have done it without you knowing the intracies of missing configs between crs and all :) [22:22:59] updating status page and then I will go AFK [22:23:07] ok thanks for the help dude [22:23:44] yeah restoring from the rancid config's not too tricky but there are a couple of gotcha's which might trip up someone less familiar with junos [22:55:03] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.014e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [23:38:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076319 [23:38:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076319 (owner: 10TrainBranchBot)