[00:31:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:36:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1094576 [00:38:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1094576 (owner: 10TrainBranchBot) [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1094579 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1094579 (owner: 10TrainBranchBot) [01:09:32] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 181010 MB (4% inode=92%): /srv/swift-storage/sdc1 189665 MB (4% inode=91%): /srv/swift-storage/sdf1 218427 MB (5% inode=91%): /srv/swift-storage/sdg1 196653 MB (5% inode=91%): /srv/swift-storage/sdd1 181869 MB (4% inode=91%): /srv/swift-storage/sde1 195988 MB (5% inode=92%): /srv/swift-storage/sdi1 191913 MB (5% inode=91%): /srv/swift-st [01:09:32] k1 184530 MB (4% inode=92%): /srv/swift-storage/sdj1 179715 MB (4% inode=91%): /srv/swift-storage/sdl1 178634 MB (4% inode=91%): /srv/swift-storage/sdm1 188623 MB (4% inode=91%): /srv/swift-storage/sdn1 152472 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [01:11:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:12:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1094576 (owner: 10TrainBranchBot) [01:41:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1094579 (owner: 10TrainBranchBot) [01:45:24] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:15:27] !log decommissioning Cassandra/restbase2023-{a,b,c} — T380236 [02:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:31] T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236 [02:21:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:05] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:51:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:56:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:41:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:51:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:07:10] (03PS1) 10Pppery: Ncredir: Use funnel rather than rewrite for deeplinked destinations [puppet] - 10https://gerrit.wikimedia.org/r/1094705 (https://phabricator.wikimedia.org/T380667) [04:09:06] (03PS2) 10Pppery: Ncredir: Use funnel rather than rewrite for deeplinked destinations [puppet] - 10https://gerrit.wikimedia.org/r/1094705 (https://phabricator.wikimedia.org/T380667) [04:21:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:26:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:41:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:45:22] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:54:14] PROBLEM - Disk space on thanos-be1001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 207583 MB (5% inode=92%): /srv/swift-storage/sdg1 196790 MB (5% inode=91%): /srv/swift-storage/sdc1 193277 MB (5% inode=91%): /srv/swift-storage/sdi1 182345 MB (4% inode=91%): /srv/swift-storage/sde1 177193 MB (4% inode=91%): /srv/swift-storage/sdh1 174643 MB (4% inode=91%): /srv/swift-storage/sdj1 207682 MB (5% inode=91%): /srv/swift-st [04:54:14] k1 203118 MB (5% inode=92%): /srv/swift-storage/sdd1 150958 MB (3% inode=90%): /srv/swift-storage/sdm1 195281 MB (5% inode=92%): /srv/swift-storage/sdl1 177985 MB (4% inode=91%): /srv/swift-storage/sdn1 168714 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [05:01:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:16:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:56:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:01:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:06:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:47:05] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:40:13] (03CR) 10Raymond Ndibe: profile::manifests::toolforge::bastion: harbor to /etc/toolforge/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [07:41:59] (03CR) 10Raymond Ndibe: "We can merge it ASAP. I don't expect any problems since we don't explicitly require the s3 config to be provided in this patch. tools and " [puppet] - 10https://gerrit.wikimedia.org/r/1093856 (https://phabricator.wikimedia.org/T350687) (owner: 10Raymond Ndibe) [07:47:05] FIRING: [13x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:10] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:22:20] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:47:26] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:11:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:12:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:26:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:01:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:11:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:15:12] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:56] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [11:24:04] PROBLEM - SSH on ganeti2042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:26:38] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:56] RECOVERY - SSH on ganeti2042 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:28:58] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [11:47:05] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:26] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [12:08:57] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop test cluster: Restart of jvm daemons. [12:12:54] (03PS2) 10Gergő Tisza: Disable more extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094071 (https://phabricator.wikimedia.org/T373737) [12:14:31] (03CR) 10Gergő Tisza: "PS2: just added some comments (mainly to point out that Allow(User|Site)(Js|Css) doesn't do what one would expect it to do)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094071 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [12:18:15] (03PS1) 10Gergő Tisza: Allow simulating the SUL3 shared domain settings via env var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095082 (https://phabricator.wikimedia.org/T380575) [12:32:17] (03CR) 10Gergő Tisza: [C:03+1] "Looks good as far as I can tell (which is not very far). See https://wikitech.wikimedia.org/wiki/Release_Engineering/Runbook/Puppet_patche" [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [12:35:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:57:05] FIRING: [14x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:05] FIRING: [14x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:07:32] PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 175149 MB (4% inode=91%): /srv/swift-storage/sdc1 152451 MB (3% inode=90%): /srv/swift-storage/sdf1 178983 MB (4% inode=91%): /srv/swift-storage/sdd1 188754 MB (4% inode=91%): /srv/swift-storage/sdg1 173432 MB (4% inode=91%): /srv/swift-storage/sdh1 171727 MB (4% inode=91%): /srv/swift-storage/sdi1 202423 MB (5% inode=92%): /srv/swift-st [13:07:32] j1 175176 MB (4% inode=92%): /srv/swift-storage/sdk1 172377 MB (4% inode=91%): /srv/swift-storage/sdm1 169920 MB (4% inode=92%): /srv/swift-storage/sdn1 174057 MB (4% inode=91%): /srv/swift-storage/sdl1 157663 MB (4% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [13:10:58] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdg1 192166 MB (5% inode=92%): /srv/swift-storage/sdd1 189708 MB (4% inode=91%): /srv/swift-storage/sdc1 180579 MB (4% inode=91%): /srv/swift-storage/sdf1 168174 MB (4% inode=91%): /srv/swift-storage/sdh1 158649 MB (4% inode=90%): /srv/swift-storage/sdi1 152371 MB (3% inode=90%): /srv/swift-storage/sde1 184835 MB (4% inode=92%): /srv/swift-st [13:10:58] j1 189944 MB (4% inode=91%): /srv/swift-storage/sdk1 192566 MB (5% inode=91%): /srv/swift-storage/sdm1 156176 MB (4% inode=90%): /srv/swift-storage/sdl1 193751 MB (5% inode=92%): /srv/swift-storage/sdn1 172133 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [13:34:14] PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 185855 MB (4% inode=92%): /srv/swift-storage/sdg1 209428 MB (5% inode=91%): /srv/swift-storage/sdc1 152242 MB (3% inode=90%): /srv/swift-storage/sdh1 174271 MB (4% inode=91%): /srv/swift-storage/sde1 182455 MB (4% inode=91%): /srv/swift-storage/sdd1 157630 MB (4% inode=91%): /srv/swift-storage/sdj1 182492 MB (4% inode=91%): /srv/swift-st [13:34:14] k1 165658 MB (4% inode=91%): /srv/swift-storage/sdi1 174149 MB (4% inode=91%): /srv/swift-storage/sdl1 194475 MB (5% inode=91%): /srv/swift-storage/sdn1 190812 MB (5% inode=91%): /srv/swift-storage/sdm1 189478 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [13:35:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:39:14] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10350696 (10Andrew) Just found another VM where this happened: liwa3-2.linkwatcher.eqiad1.wikimedia.cloud [13:45:28] PROBLEM - Hadoop Namenode - Stand By on an-master1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:48:28] RECOVERY - Hadoop Namenode - Stand By on an-master1004 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:52:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:52:30] PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdd1 181057 MB (4% inode=91%): /srv/swift-storage/sdc1 158609 MB (4% inode=91%): /srv/swift-storage/sdg1 181704 MB (4% inode=92%): /srv/swift-storage/sdh1 192327 MB (5% inode=92%): /srv/swift-storage/sde1 166885 MB (4% inode=91%): /srv/swift-storage/sdi1 150923 MB (3% inode=90%): /srv/swift-storage/sdj1 158005 MB (4% inode=91%): /srv/swift-st [13:52:30] k1 173617 MB (4% inode=91%): /srv/swift-storage/sdl1 172126 MB (4% inode=91%): /srv/swift-storage/sdm1 190667 MB (5% inode=91%): /srv/swift-storage/sdn1 175385 MB (4% inode=91%): /srv/swift-storage/sdf1 157687 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [13:56:46] (03PS1) 10Urbanecm: [Growth] enwiki: Deploy Add Link to 2% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095126 (https://phabricator.wikimedia.org/T377631) [14:20:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:14:08] (03PS1) 10Majavah: P:toolforge: mail: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095189 (https://phabricator.wikimedia.org/T380679) [15:14:10] (03PS1) 10Majavah: puppet_compiler: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095190 (https://phabricator.wikimedia.org/T380679) [15:14:11] (03PS1) 10Majavah: P:cumin: Drop support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095191 (https://phabricator.wikimedia.org/T380679) [15:14:13] (03PS1) 10Majavah: openstack: admin_scripts: Remove support for .wmflabs VM names [puppet] - 10https://gerrit.wikimedia.org/r/1095192 (https://phabricator.wikimedia.org/T380679) [15:14:15] (03PS1) 10Majavah: openstack: puppet: Drop support for .wmflabs names [puppet] - 10https://gerrit.wikimedia.org/r/1095193 (https://phabricator.wikimedia.org/T380679) [15:14:39] (03CR) 10Majavah: [C:04-2] "needs to wait until deployment-cumin is gone" [puppet] - 10https://gerrit.wikimedia.org/r/1095192 (https://phabricator.wikimedia.org/T380679) (owner: 10Majavah) [15:30:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:30:58] PROBLEM - Disk space on thanos-be1004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 174993 MB (4% inode=91%): /srv/swift-storage/sdc1 170284 MB (4% inode=91%): /srv/swift-storage/sdh1 152317 MB (3% inode=90%): /srv/swift-storage/sdd1 155421 MB (4% inode=91%): /srv/swift-storage/sdf1 162876 MB (4% inode=91%): /srv/swift-storage/sdg1 192592 MB (5% inode=92%): /srv/swift-storage/sdi1 178086 MB (4% inode=91%): /srv/swift-st [15:30:58] j1 159775 MB (4% inode=91%): /srv/swift-storage/sdl1 176253 MB (4% inode=92%): /srv/swift-storage/sdk1 176735 MB (4% inode=91%): /srv/swift-storage/sdm1 173191 MB (4% inode=91%): /srv/swift-storage/sdn1 166681 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [15:35:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:38:26] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:39:12] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:36:37] (03PS1) 10Lucas Werkmeister: deployment-prep: Remove leftover hhvm config [puppet] - 10https://gerrit.wikimedia.org/r/1095282 [16:41:28] (03CR) 10Lucas Werkmeister: "CC Effie who removed HHVM from Puppet five years ago :)" [puppet] - 10https://gerrit.wikimedia.org/r/1095282 (owner: 10Lucas Werkmeister) [17:02:05] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:17:38] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 177827 MB (4% inode=91%): /srv/swift-storage/sdd1 172580 MB (4% inode=91%): /srv/swift-storage/sdc1 180377 MB (4% inode=92%): /srv/swift-storage/sdh1 154740 MB (4% inode=90%): /srv/swift-storage/sdi1 170768 MB (4% inode=91%): /srv/swift-storage/sdg1 164376 MB (4% inode=91%): /srv/swift-storage/sdk1 164156 MB (4% inode=91%): /srv/swift-st [17:17:38] j1 158348 MB (4% inode=90%): /srv/swift-storage/sdl1 175128 MB (4% inode=91%): /srv/swift-storage/sde1 152002 MB (3% inode=91%): /srv/swift-storage/sdm1 154980 MB (4% inode=90%): /srv/swift-storage/sdn1 160494 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [17:44:25] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:16:41] (03PS1) 10Pppery: Disable DeadEndPages and LonelyPages on Commons per community request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 [18:17:19] (03CR) 10CI reject: [V:04-1] Disable DeadEndPages and LonelyPages on Commons per community request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (owner: 10Pppery) [18:18:51] (03PS2) 10Pppery: Disable DeadEndPages and LonelyPages on Commons per community request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (https://phabricator.wikimedia.org/T371662) [18:19:34] (03CR) 10CI reject: [V:04-1] Disable DeadEndPages and LonelyPages on Commons per community request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (https://phabricator.wikimedia.org/T371662) (owner: 10Pppery) [18:25:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:05] (03PS3) 10Pppery: Disable DeadEndPages and LonelyPages on Commons per community request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1095334 (https://phabricator.wikimedia.org/T371662) [18:32:23] (03PS1) 10Pppery: Don't try to update Special:DeadEndPages on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1095353 (https://phabricator.wikimedia.org/T371662) [18:33:43] (03PS2) 10Pppery: Don't try to update Special:DeadEndPages on Commons [puppet] - 10https://gerrit.wikimedia.org/r/1095353 (https://phabricator.wikimedia.org/T371662) [18:35:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:46:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:01:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:20:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:25:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:02:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:44:25] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:28] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:33:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:36:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:39:16] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:40:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status