[00:03:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:03:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:04:47] <icinga-wm>	 PROBLEM - SSH on an-airflow1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:04:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:06:37] <icinga-wm>	 RECOVERY - SSH on an-airflow1006 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:08:12] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076279 (owner: 10TrainBranchBot)
[00:56:28] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[01:26:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:03:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:38:14] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:59:29] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:16:01] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:16:35] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:17:31] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:20:21] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:20:25] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:20:51] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:26:01] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:26:35] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:27:31] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:27:55] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.744 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:28:21] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:28:25] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:56:28] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[05:40:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185258 (10phaultfinder)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:05:12] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:56:28] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[09:29:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 85712136 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[09:31:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 39160 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:10:17] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:53:23] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] [beta-cluster] Enable cookie-based SUL3 feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076195 (https://phabricator.wikimedia.org/T375787) (owner: 10D3r1ck01)
[11:05:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:12:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 910.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:17:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 821.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:32:40] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10185344 (10taavi)
[11:32:47] <wikibugs>	 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#10185345 (10taavi)
[11:33:09] <wikibugs>	 06SRE, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712#10185346 (10taavi)
[11:33:19] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: openstack: verify security groups settings for IPv6 - https://phabricator.wikimedia.org/T374714#10185347 (10taavi)
[11:33:29] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10185348 (10taavi)
[11:33:34] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudgw: add support and enable IPv6 - https://phabricator.wikimedia.org/T374716#10185349 (10taavi)
[11:33:41] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10185350 (10taavi)
[11:33:47] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10185351 (10taavi)
[11:37:31] <wikibugs>	 (03PS3) 10Majavah: P:acme_chief: allow enabling http-01 spport [puppet] - 10https://gerrit.wikimedia.org/r/1011167 (https://phabricator.wikimedia.org/T342398)
[11:37:31] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::novaproxy: proxy http-01 challenges to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/1011168 (https://phabricator.wikimedia.org/T342398)
[12:35:28] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: migrate to HAProxy internal exporter [puppet] - 10https://gerrit.wikimedia.org/r/1076297 (https://phabricator.wikimedia.org/T343885)
[12:36:07] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4150/co" [puppet] - 10https://gerrit.wikimedia.org/r/1076297 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah)
[12:43:26] <wikibugs>	 (03CR) 10Majavah: [C:04-1] cloud-vps dynamic proxy: prometheus stats from nginx access logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059125 (https://phabricator.wikimedia.org/T371382) (owner: 10Andrew Bogott)
[12:56:28] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[13:05:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185499 (10phaultfinder)
[13:18:38] <wikibugs>	 06SRE-OnFire, 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review, 10Sustainability (Incident Followup): Add external meta-monitoring for metricsinfra - https://phabricator.wikimedia.org/T288053#10185514 (10taavi)
[13:42:51] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS: ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531#10185615 (10taavi)
[13:49:10] <wikibugs>	 (03PS1) 10Cwhite: logstash: cast airflow caller field to string [puppet] - 10https://gerrit.wikimedia.org/r/1076299
[14:38:14] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:29] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:05:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:24:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:30:03] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:50:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:50:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 215, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:20:03] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:24:57] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:56:28] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[17:05:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185674 (10phaultfinder)
[18:15:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10185705 (10phaultfinder)
[20:24:51] <icinga-wm>	 PROBLEM - Host cr2-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[20:24:52] <icinga-wm>	 PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[20:24:52] <icinga-wm>	 PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[20:25:12] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:25:53] <icinga-wm>	 PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 68, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:26:37] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:26:49] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:26:54] <sukhe>	 wow
[20:27:05] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 439 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:27:22] <sukhe>	 getting to a laptop
[20:29:36] <topranks>	 looks like the router is down 
[20:29:50] <topranks>	 ports connecting to it on cr3-eqsin hard down, BGP/OSPF adjacency down 
[20:29:53] <icinga-wm>	 RECOVERY - Host cr2-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 222.76 ms
[20:29:55] <sukhe>	 yeah and it's not false positive, it is actually down
[20:30:01] <topranks>	 I'm checking now if the serial console shows any ife 
[20:30:03] <topranks>	 hmm.....
[20:30:08] <sukhe>	 I am depooling eqsin
[20:30:21] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: cr2-eqsin down, no task ID specified]
[20:30:33] <sukhe>	 ulsfo is also depooled sigh
[20:30:38] <sukhe>	 this would result in terrible latency for eqsin
[20:30:51] <topranks>	 maybe hold on then 
[20:31:06] <topranks>	 off-peak hours in east asia at least 
[20:31:09] <sukhe>	 yep
[20:31:53] <icinga-wm>	 RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:32:47] <vgutierrez>	 Bad week for juniper hw
[20:32:57] <sukhe>	 topranks: should we depool at least in the meantime?
[20:33:35] <topranks>	 re-reading your comment - if the result is just poor latency I'd say go ahead for the time being 
[20:33:40] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: cr2-eqsin down, no task ID specified]
[20:33:41] <sukhe>	 done
[20:33:49] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:33:55] <vgutierrez>	 "just"
[20:34:27] <topranks>	 well I mean as opposed to us saturating something / packet loss which is worse 
[20:34:29] <sukhe>	 well what choice do we have?
[20:34:49] <topranks>	 in theory we can run off one router, but indeed best not to 
[20:35:23] <sukhe>	 I wonder if T375345 and are related
[20:35:24] <stashbot>	 T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345
[20:35:27] <topranks>	 but I think depooling buys us some time, and hopefully off peak for the region 
[20:36:11] <sukhe>	 !incidents
[20:36:12] <sirenbot>	 5285 (ACKED)  Host cr2-eqsin - PING  - Packet loss = 100%
[20:36:15] <sukhe>	 !ack 5285
[20:36:15] <sirenbot>	 5285 (ACKED)  Host cr2-eqsin - PING  - Packet loss = 100%
[20:37:49] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:37:49] <icinga-wm>	 RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:37:56] <icinga-wm>	 RECOVERY - Host cr2-eqsin is UP: PING OK - Packet loss = 0%, RTA = 222.79 ms
[20:38:34] <sukhe>	 well
[20:40:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:40:10] <sukhe>	 device still reports being down in librenms though and I can't SSH
[20:40:17] <icinga-wm>	 RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.85 ms
[20:40:17] <jinxer-wm>	 FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[20:40:25] <sukhe>	 !incidents
[20:40:26] <sirenbot>	 5285 (ACKED)  Host cr2-eqsin - PING  - Packet loss = 100%
[20:40:26] <sirenbot>	 5286 (UNACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[20:40:31] <sukhe>	 !ack 5286
[20:40:32] <sirenbot>	 5286 (ACKED)  NELHigh sre (thanos-rule tcp.timed_out)
[20:40:43] <jinxer-wm>	 FIRING: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[20:42:07] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 14 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:42:52] <sukhe>	 we can probably survive for some time but eqsin gets busy again and then that might get interesting
[20:43:58] <topranks>	 on to the router now via serial, it seems healty for past 10 mins 
[20:43:59] <vgutierrez>	 We could get some link saturation in codfw 
[20:44:21] <topranks>	 unsure of reboot reason, it appears to have booted off backup system partiton so perhaps some file corruption / disk issue 
[20:44:40] <sukhe>	 topranks: you know better but 
[20:44:46] <sukhe>	 what does show system alarms say?
[20:45:17] <topranks>	 it just shows the problem with disk 0 
[20:45:17] <jinxer-wm>	 RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[20:45:43] <jinxer-wm>	 RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[20:49:02] <topranks>	 the logs of before it reset aren't available on the local device as they are on the offline disk 
[20:49:20] <sukhe>	 topranks: so the question is then I guess: do we have enough confidence to repool eqsin? 
[20:53:10] <sukhe>	 I am updating the status page 
[20:53:16] <sukhe>	 if someone has objections, please let me know
[20:53:44] <vgutierrez>	 Given that we are on low traffic hours for eqsin we could repool and see how the router behaves 
[20:53:51] <sukhe>	 ok
[20:53:59] <sukhe>	 I am fine with it too to test but not unless topranks is confident I guess
[20:54:09] <sukhe>	 because that would then depend on the severity of the error
[20:54:12] <vgutierrez>	 Sure
[20:54:22] <sukhe>	 if that means we are seeing sometihng similar like to what we saw with cr3-ulsfo, not sure
[20:54:24] <topranks>	 let's hold off for a few mins if we can 
[20:54:31] <sukhe>	 topranks: all good take your time please 
[20:54:35] <sukhe>	 (and let us know if we can help)
[20:54:38] <sukhe>	 I am updating the status page in the meantime
[20:54:47] <topranks>	 this has some similarities to cr3-ulsfo but also different 
[20:54:49] <topranks>	 disk issue 
[20:54:55] <topranks>	 we had something similar before 
[20:54:55] <topranks>	 https://phabricator.wikimedia.org/T372781
[20:55:14] <topranks>	 same symtoms here - difference is the fix there was rebooting the redunant routing engine in the MX480 in eqiad 
[20:55:33] <topranks>	 this device - MX204 - does not have a redunant re to reboot without affecting anything 
[20:55:37] <sukhe>	 this one is the 204?
[20:55:38] <sukhe>	 right
[20:56:28] <jinxer-wm>	 FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections
[21:10:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr2-eqsin disk failure Sept 2024 - https://phabricator.wikimedia.org/T375961 (10cmooney) 03NEW p:05Triage→03High
[21:10:35] <topranks>	 ok we have been stable for 38 mins now 
[21:10:39] <topranks>	 I think we should try a repool 
[21:11:05] <topranks>	 given the potential resource constraints in codfw and high latency from the region 
[21:11:35] <sukhe>	 ok
[21:11:38] <sukhe>	 let's do it
[21:12:00] <topranks>	 if - as on the face of things - it's just a disk failure and things working ok from the backup partition, then no reason it won't be ok like that for a while 
[21:12:09] <topranks>	 so yeah +1 from me for repool 
[21:12:13] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: testing pooling after cr2-eqsin was down and site was depooled, no task ID specified]
[21:12:16] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: testing pooling after cr2-eqsin was down and site was depooled, no task ID specified]
[21:12:19] <sukhe>	 pooled
[21:12:35] <sukhe>	 thanks for filing the task. incident doc is https://docs.google.com/document/d/1eBOZe9bTZ9kGJPxSmsu8ToQXlPLB0H_S_W9aM2SIeGA/edit
[21:12:38] <sukhe>	 updating as we go 
[21:13:11] <sukhe>	 keeping an eye out on errors (you can leave the CDN site to me and and just observe the cr)
[21:13:26] <topranks>	 sukhe: maybe I spoke to soon 
[21:13:30] <topranks>	 leave it for now 
[21:13:35] <sukhe>	 it's pooled
[21:13:38] <sukhe>	 you want me to depool?
[21:14:18] <topranks>	 yeah - sorry 
[21:14:24] <sukhe>	 ok
[21:14:26] <topranks>	 BGP is down from it to all the LVS for some reason 
[21:14:33] <sukhe>	 oh
[21:14:35] <sukhe>	 restart probably
[21:14:36] <sukhe>	 wait
[21:15:01] <sukhe>	 !log sudo cumin 'A:lvs-eqsin' 'systemctl restart pybal'
[21:15:03] <sukhe>	 check onw
[21:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:55] <sukhe>	 nope
[21:15:57] <sukhe>	 it's down
[21:15:57] <sukhe>	 ok
[21:16:03] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: testing failed, depooling again, no task ID specified]
[21:16:16] <sukhe>	 topranks: I am depooling, it doesn't seem like we are in a good state
[21:16:19] <sukhe>	 ok?
[21:17:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr2-eqsin disk failure Sept 2024 - https://phabricator.wikimedia.org/T375961#10185783 (10cmooney) Actually looking at the output in more detail BGP to the LVS servers / PyBal is down. ` Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dw...
[21:17:02] <sukhe>	 yeah, errors
[21:17:34] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: testing failed, depooling again, no task ID specified]
[21:17:48] <topranks>	 sukhe: my guess is that's just PyBal not dealing with things right for some reason 
[21:18:23] <topranks>	 I think we should restart the service, I guess we can try with lvs5003 anyway?
[21:18:31] <sukhe>	 topranks: I did that but didn't help
[21:18:32] <sukhe>	 so I depooled
[21:18:58] <topranks>	 huh
[21:19:01] <sukhe>	 but now since it's depooled, we can experiment it again but we should check to make sure that the session is actually established
[21:19:17] <sukhe>	 Sep 28 21:15:25 lvs5004 pybal[35057]: [bgp.BGPFactory@0x7f185931c6e0] INFO: Client connection failed: User timeout caused connection failure.
[21:19:20] <sukhe>	 Sep 28 21:15:25 lvs5004 pybal[35057]: [bgp.FSM@0x7f185931f090 peer 103.102.166.130] INFO: State is now: IDLE
[21:19:23] <sukhe>	 Sep 28 21:15:55 lvs5004 pybal[35057]: [bgp.BGPFactory@0x7f185931c6e0] INFO: Client connection failed: User timeout caused connection failure.
[21:19:27] <topranks>	 any logs on the pybal side as to what's not working?
[21:19:29] <topranks>	 hmm 
[21:20:08] <topranks>	 has to be pybal I think, 91 working BGP sessions on the router, but all 3 pybals down 
[21:20:13] <sukhe>	 hmmm
[21:21:08] <sukhe>	 checking
[21:21:23] <sukhe>	 topranks: maybe it was a question of timing
[21:21:35] <sukhe>	 nope
[21:22:28] <topranks>	 nah 
[21:22:34] <topranks>	 the router has an outdated config 
[21:22:43] <sukhe>	 https://phabricator.wikimedia.org/T321545#8341024
[21:22:45] <topranks>	 it's set up to peer with lvs500[1-3] 
[21:22:50] <sukhe>	 wow
[21:22:55] <sukhe>	 ok
[21:22:59] <topranks>	 not lvs500[4-6]
[21:23:04] <sukhe>	 and that didn't kick in all this time because it wasn't rebooted?
[21:23:10] <sukhe>	 the uptime was like 2 years
[21:23:46] <topranks>	 tbh I don't know 
[21:23:59] <topranks>	 I thought the snapshot was a regular thing 
[21:24:07] <sukhe>	 but cr3-eqsin looks fine?
[21:24:13] <sukhe>	 10.132.0.6            64600         17         17       0      31        2:22 Establ
[21:24:16] <sukhe>	 10.132.0.7            64600         53         56       0      32        8:29 Establ
[21:24:19] <sukhe>	 10.132.0.39           64600         53         56       0      33        8:25 Establ
[21:24:22] <sukhe>	 these are all the correct LVSes
[21:25:08] <topranks>	 yeah it's working fine 
[21:25:13] <topranks>	 cr2 rebooted with an old config 
[21:25:20] <topranks>	 also this despite it can ping ok 
[21:25:21] <topranks>	 ERROR:homer:Attempt 1/3 failed: Unable to connect to cr2-eqsin.wikimedia.org
[21:25:34] <sukhe>	 topranks: fwiw I can't still SSH
[21:25:45] <topranks>	 hmm... why can I ?
[21:26:09] <sukhe>	 anyway, aside that, where do we go from here? 
[21:26:36] <topranks>	 what does ssh -v show ?
[21:26:49] <topranks>	 I'll get the config sorted is first step 
[21:27:10] <topranks>	 I think it's just missing your SSH key tbh 
[21:27:23] <sukhe>	 but cr3 has it. that can happen? 
[21:27:33] <sukhe>	 that's not the focus to be clear but I was curious if that points to something else
[21:29:05] <topranks>	 these are different devices 
[21:29:20] <topranks>	 they can definitely have different configurations, different "authorized_keys" files effectively 
[21:30:15] <sukhe>	 wow ok, I thought that the key was synced given there is only one place you put the key and that it works for all routers
[21:30:28] <sukhe>	 topranks: coming back, did updating cr2 for the new LVS IPs help?
[21:31:08] <topranks>	 homer pushes the key 
[21:31:16] <topranks>	 it's the same as for our servers really 
[21:31:35] <topranks>	 em didn't quite get that far, homer push is now failing on me due to missing cert for gnmi 
[21:31:58] <sukhe>	 can you share the error? maybe some rubber duck debugging might help
[21:37:18] <topranks>	 nah it's ok I think 
[21:37:37] <topranks>	 just need some cert to be there - I've copied the one from cr3-eqsin 
[21:39:09] <sukhe>	 ok
[21:40:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:40:14] <sukhe>	 ok
[21:40:16] <sukhe>	 that's a good sign
[21:40:22] <topranks>	 see if you can ssh now 
[21:40:39] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:40:42] <topranks>	 I'll do a manual diff on the remaining config to catch anything else in themeantime 
[21:40:55] <sukhe>	 topranks: yep works!
[21:42:19] <sukhe>	 BGP sessions still show "active". 
[21:42:25] <sukhe>	 wait no, old LVS IPs.
[21:47:03] <topranks>	 yeah.... is PyBal BGP not done by Homer??
[21:47:13] <topranks>	 INFO:homer.transports.junos:Empty diff for cr2-eqsin.wikimedia.org, skipping device.
[21:48:59] <sukhe>	 I don't see why and how it would be any different for cr2-eqsin though
[21:49:30] <topranks>	 no it shouldn't be 
[21:49:42] <topranks>	 but I hardly think homer saying no diff is any issue 
[21:49:54] <topranks>	 some assume PyBal group is being done manually still for some reason 
[21:50:00] <topranks>	 maybe an overishgt, we don't have many LVS 
[21:50:37] <topranks>	 BGP set to false for those lvs in netbox I think is the issue
[21:51:00] <sukhe>	 which then means that it predates the BGP switch to Netbox
[21:51:09] <sukhe>	 but yeah, +1 for fixing that
[21:52:39] <topranks>	 ok that's added the right neighbors now 
[21:52:48] <sukhe>	 sukhe@cr2-eqsin> show bgp group pybal    
[21:52:51] <sukhe>	  looks ok yep
[21:52:59] <sukhe>	 you got a diff for the homer changes as well?
[21:53:12] <sukhe>	 BGP session established!
[21:53:13] <topranks>	 looks good 
[21:53:17] <topranks>	 https://www.irccloud.com/pastebin/JKFwGjec/
[21:54:37] <sukhe>	 pybal looks good fwiw
[21:54:44] <sukhe>	 if everything else also does, we can try again 
[21:55:29] <sukhe>	 it's super late for you as well. you can leave this to me and worse case if we can't repool eqsin, I might need to pick up the phone to make a decision by bringing in at least one other SRE :)
[21:59:11] <topranks>	 yeah just gimme a minute applying the other config diffs from the rancid backup config now 
[21:59:31] <sukhe>	 all good. if you want me to pick up something, happy to 
[22:04:25] <topranks>	 ok I'm fairly happy 
[22:04:44] <topranks>	 a few BGP peers I don't have a way to find the MD5 auth keys for without maybe thrawling through emails 
[22:04:55] <topranks>	 but it's ok I think - most are up and all transit is up 
[22:05:01] <topranks>	 router has been stable since the reset 
[22:05:09] <topranks>	 config is now up to date minus those bgp keys 
[22:05:19] <sukhe>	 ok
[22:05:55] <sukhe>	 do we have a diff of the peers that are missing?
[22:06:21] <sukhe>	 one more question:
[22:06:26] <sukhe>	 from the motd, 
[22:06:27] <sukhe>	 "Note: VM Host is Currently running from alternate disk
[22:06:28] <sukhe>	 "
[22:09:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: cr2-eqsin disk failure Sept 2024 - https://phabricator.wikimedia.org/T375961#10185804 (10cmooney) The obvious thing I didn't at first spot was the config on the router was seriously out of date.  The PyBal group had the old lvs500[1-3] configured, which have lon...
[22:09:28] <topranks>	 no peers are missing, but some are down due to missing MD5 in the config 
[22:09:35] <sukhe>	 Groups: 17 Peers: 282 Down peers: 18
[22:09:38] <sukhe>	 cr2 
[22:09:41] <sukhe>	 Groups: 17 Peers: 427 Down peers: 18
[22:09:42] <sukhe>	 cr3
[22:09:52] <sukhe>	 yeah
[22:11:31] <sukhe>	 ah so the IX peers are down
[22:11:32] <sukhe>	 ok
[22:12:23] <topranks>	 yeah just a few 
[22:12:34] <topranks>	 I listed them on the task and set nda policy now
[22:12:38] <sukhe>	 time to repool then? I mean things look OK but well
[22:12:38] <topranks>	 no biggie 
[22:12:42] <topranks>	 yeah +1
[22:12:49] <sukhe>	 the only way to find it it out is to test it I guess
[22:12:49] <sukhe>	 ok
[22:13:07] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, no task ID specified]
[22:13:14] <logmsgbot>	 !log sukhe@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, no task ID specified]
[22:13:22] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, T375961]
[22:13:26] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: repool eqsin take two, cr2-eqsin config restored, T375961]
[22:13:31] <sukhe>	 done
[22:14:10] <topranks>	 ok - let's see how it goes!
[22:14:24] <sukhe>	 !
[22:16:48] <sukhe>	 seems to be holding up OK
[22:16:54] <sukhe>	 can you verify if cr2 is holding up fine?
[22:19:52] <sukhe>	 looks good as well I think from whatever I know on how to check :)
[22:19:58] <sukhe>	 thanks topranks! <3
[22:20:42] <topranks>	 no signs of an issue - not expecting any tbh 
[22:20:56] <topranks>	 I'll wait to see traffic levels reflected in LibreNMS though but should be fine 
[22:21:31] <sukhe>	 think you should head off now, it's quite late
[22:21:45] <sukhe>	 I will step away for a bit but keep IRC open just in case and check in some time as well
[22:22:24] <sukhe>	 thanks for stepping up as always -- really doubt we could have done it without you knowing the intracies of missing configs between crs and all :)
[22:22:59] <sukhe>	 updating status page and then I will go AFK
[22:23:07] <topranks>	 ok thanks for the help dude 
[22:23:44] <topranks>	 yeah restoring from the rancid config's not too tricky but there are a couple of gotcha's which might trip up someone less familiar with junos 
[22:55:03] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.014e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[23:38:18] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076319
[23:38:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076319 (owner: 10TrainBranchBot)