[00:02:13] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [00:02:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:02:55] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [00:03:33] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:07:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079035 (owner: 10TrainBranchBot) [00:19:16] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus2005.codfw.wmnet [00:19:44] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet [00:19:54] FIRING: PyBalBGPUnstable: PyBal BGP sessions on instance lvs5004 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=eqsin%20prometheus/ops&var-server=lvs5004 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [00:24:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:24:36] !incidents [00:24:36] 5307 (ACKED) Host cr2-eqsin - PING - Packet loss = 100% [00:24:36] 5300 (RESOLVED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [00:24:37] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [00:24:37] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [00:24:37] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [00:24:37] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [00:24:37] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [00:24:38] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [00:24:38] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [00:24:47] no idea why this page is not being sent to VO [00:24:54] RESOLVED: PyBalBGPUnstable: PyBal BGP sessions on instance lvs5004 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://grafana.wikimedia.org/d/000000488/pybal-bgp?var-datasource=eqsin%20prometheus/ops&var-server=lvs5004 - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [00:24:55] we are still waiting to bring up cr2-eqsin [00:26:38] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet [00:27:15] !log denisse@cumin2002 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [00:27:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:32:37] sukhe: just noticed in the incident list that 5307 never resolved, although I vaguely recall seeing a notification for that (afk at the moment). do you think it makes sense to manually resolve it? [00:37:31] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:37:49] swfrench-wmf: thanks, I will make sure to resolve it on cr2-eqsin is actually up. [00:37:52] hopefully soon :> [00:38:17] s/on/once [00:39:14] ah, thanks! for some reason I thought it was up / reachable, but degraded [00:39:26] FIRING: [3x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2062:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:39:37] it's going to go down again for the Junos upgrade [00:40:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:40:50] got it, thanks [00:40:58] the flapping is related to that yep. sadly I have no way of silencing this that I am aware of. so I am just going to hope that the Junos upgrade fixes it :} [00:42:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:44:30] FIRING: [17x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:44:35] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:45:51] luckily, the BFD flaps don’t p.age, and hopefully it’s pretty clear to folks that eqsin is still depooled for anything that slips through whichever downtimes might still exist [00:45:58] yep [00:46:11] it took a lot longer than we expected [00:46:16] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus1006.eqiad.wmnet [00:46:19] it's still going to be at least an hour or more before we repool [00:46:28] the original downtime window was a mere 3 hours [00:46:31] we are now at 9 :) [00:47:45] FIRING: [3x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:49:26] RESOLVED: [17x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:51:26] RESOLVED: [3x] ProbeDown: Service prometheus1006:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:52:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:53:35] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:55:23] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10216097 (10Papaul) We did phase 2 today, all the 1G nodes are now connected to the new fasw2-c8a/b. We will me moving the 10G nodes next week. Thanks to @Jgree... [00:58:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:00:35] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:01:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1198.eqiad.wmnet onto db1223.eqiad.wmnet [01:03:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:06:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:06:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:09:35] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:13:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:15:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:19:35] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:41:33] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 68, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:42:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:42:54] FIRING: [2x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs5005 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [01:43:43] RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [01:47:27] (03PS1) 10Ssingh: team-sre: make PyBal alert paging (set severity) [alerts] - 10https://gerrit.wikimedia.org/r/1079049 [01:47:54] FIRING: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs5004 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [01:48:43] RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [01:49:13] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:49:31] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:49:33] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:50:27] !log restart bird on doh5001 and dns5003 to resolve flapping BFD session after cr2-eqsin junos upgrade [01:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:54] RESOLVED: [3x] PyBalBGPUnstable: PyBal BGP sessions on instance lvs5004 with peer 103.102.166.130 are failing #page - https://wikitech.wikimedia.org/wiki/PyBal#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPyBalBGPUnstable [01:52:59] cool [02:02:32] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repooling eqsin after cr2-eqsin replaced, T375961] [02:02:44] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: repooling eqsin after cr2-eqsin replaced, T375961] [02:07:50] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [02:11:27] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:13:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:22:27] !incidents [02:22:27] 5307 (ACKED) Host cr2-eqsin - PING - Packet loss = 100% [02:22:28] 5300 (RESOLVED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [02:22:28] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [02:22:28] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [02:22:28] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [02:22:28] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [02:22:28] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [02:22:29] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [02:22:29] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [02:23:12] !resolve 5307 [02:23:12] 5307 (ACKED) Host cr2-eqsin - PING - Packet loss = 100% [02:26:07] !incidents [02:26:07] 5307 (RESOLVED) Host cr2-eqsin - PING - Packet loss = 100% [02:26:07] 5300 (RESOLVED) Manual (paged) by Scott French (swfrench@wikimedia.org): need assistance - calico issues in codfw (please join #wikimedia-sre) [02:26:07] 5302 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqsin) [02:26:07] 5304 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (kartotherian.discovery.wmnet eqsin) [02:26:08] 5306 (RESOLVED) [2x] ProbeDown sre (ip4 probes/service codfw) [02:26:08] 5305 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [02:26:08] 5303 (RESOLVED) ProbeDown sre (ip4 probes/service codfw) [02:26:09] 5301 (RESOLVED) ProbeDown sre (10.2.1.88 ip4 mw-wikifunctions:4451 probes/service http_mw-wikifunctions_ip4 codfw) [02:26:09] 5299 (RESOLVED) GatewayBackendErrorsHigh sre (page-analytics_cluster rest-gateway codfw) [02:30:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69534 and previous config saved to /var/cache/conftool/dbconfig/20241010-023014-ladsgroup.json [02:30:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69535 and previous config saved to /var/cache/conftool/dbconfig/20241010-023037-ladsgroup.json [02:35:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10216153 (10phaultfinder) [02:45:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69536 and previous config saved to /var/cache/conftool/dbconfig/20241010-024519-ladsgroup.json [02:45:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69537 and previous config saved to /var/cache/conftool/dbconfig/20241010-024543-ladsgroup.json [02:52:25] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [02:52:30] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:53:25] RECOVERY - gerrit process on gerrit2003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [03:00:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69538 and previous config saved to /var/cache/conftool/dbconfig/20241010-030025-ladsgroup.json [03:00:26] (03PS6) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [03:00:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69539 and previous config saved to /var/cache/conftool/dbconfig/20241010-030048-ladsgroup.json [03:15:03] RECOVERY - Host elastic1064 is UP: PING WARNING - Packet loss = 60%, RTA = 2.39 ms [03:15:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69540 and previous config saved to /var/cache/conftool/dbconfig/20241010-031531-ladsgroup.json [03:15:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69541 and previous config saved to /var/cache/conftool/dbconfig/20241010-031553-ladsgroup.json [03:21:27] PROBLEM - Host elastic1064 is DOWN: PING CRITICAL - Packet loss = 100% [03:22:34] (03PS7) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [03:22:41] (03CR) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [03:31:46] (03PS8) 10Pppery: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) [03:40:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:40:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:41:25] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:42:17] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 10 Dec 2024 11:59:32 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:42:41] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52776 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:42:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:02:23] (03PS1) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 [04:03:06] (03CR) 10CI reject: [V:04-1] Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (owner: 10Pppery) [04:03:56] (03PS2) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 [04:04:25] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [04:04:38] (03CR) 10CI reject: [V:04-1] Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (owner: 10Pppery) [04:04:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10216184 (10phaultfinder) [04:05:25] RECOVERY - gerrit process on gerrit2003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [04:09:11] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376764#10216185 (10phaultfinder) [04:16:18] (03PS1) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 [04:17:01] (03CR) 10CI reject: [V:04-1] Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (owner: 10Pppery) [04:17:42] (03PS3) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 [04:17:59] (03PS2) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 [04:18:02] (03PS4) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 [04:18:06] (03PS3) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 [04:18:45] (03CR) 10CI reject: [V:04-1] Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (owner: 10Pppery) [04:18:50] (03CR) 10CI reject: [V:04-1] Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (owner: 10Pppery) [04:19:19] (03PS5) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 [04:19:27] (03CR) 10CI reject: [V:04-1] Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (owner: 10Pppery) [04:22:57] (03PS4) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 [04:23:40] (03CR) 10CI reject: [V:04-1] Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (owner: 10Pppery) [04:26:38] (03PS5) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 [04:27:18] (03CR) 10CI reject: [V:04-1] Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (owner: 10Pppery) [04:27:30] (03PS3) 10Pppery: Remove als redirects [puppet] - 10https://gerrit.wikimedia.org/r/1079056 [04:29:29] (03CR) 10CI reject: [V:04-1] Remove als redirects [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (owner: 10Pppery) [04:30:28] (03PS6) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 [04:32:26] (03PS4) 10Pppery: Remove als redirects [puppet] - 10https://gerrit.wikimedia.org/r/1079056 [04:41:35] (03CR) 10Pppery: "Won't be ready for review for a while but wanted to get this on someone's radar." [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (owner: 10Pppery) [04:51:05] (03CR) 10Arnaudb: [C:03+1] mariadb: Add SLAVE MONITOR to promotheus grants [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [05:19:25] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [05:20:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:25] RECOVERY - gerrit process on gerrit2003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [05:20:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:35] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:28:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:57] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:39:17] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:40:01] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:50:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:50:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:54:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T0600) [06:00:05] marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T0600) [06:03:50] !log cr2-eqsin> request vmhost snapshot - T375961 [06:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:25] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [06:05:25] RECOVERY - gerrit process on gerrit2003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [06:07:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 949.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:07:50] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [06:10:01] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [06:10:20] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [06:11:27] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:12:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 949.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:14:56] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1079083 (https://phabricator.wikimedia.org/T376867) [06:15:33] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1079094 (https://phabricator.wikimedia.org/T376868) [06:24:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:24:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:24:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1190.eqiad.wmnet with reason: Maintenance [06:24:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1190.eqiad.wmnet with reason: Maintenance [06:24:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T367781)', diff saved to https://phabricator.wikimedia.org/P69542 and previous config saved to /var/cache/conftool/dbconfig/20241010-062450-arnaudb.json [06:24:54] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:26:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T367781)', diff saved to https://phabricator.wikimedia.org/P69543 and previous config saved to /var/cache/conftool/dbconfig/20241010-062659-arnaudb.json [06:32:51] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: content_history: version bump image. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [06:34:54] (03Merged) 10jenkins-bot: dse-k8s-services: content_history: version bump image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078923 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [06:35:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:23] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [06:37:29] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [06:41:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s5 T376867 [06:41:51] T376867: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T376867 [06:42:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P69544 and previous config saved to /var/cache/conftool/dbconfig/20241010-064206-arnaudb.json [06:42:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T376867 [06:42:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1183 with weight 0 T376867', diff saved to https://phabricator.wikimedia.org/P69545 and previous config saved to /var/cache/conftool/dbconfig/20241010-064219-arnaudb.json [06:43:46] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [06:43:51] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [06:46:44] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1079083 (https://phabricator.wikimedia.org/T376867) (owner: 10Gerrit maintenance bot) [06:47:51] !log Starting s5 eqiad failover from db1230 to db1183 - T376867 [06:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:54] T376867: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T376867 [06:48:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1183 to s5 primary T376867', diff saved to https://phabricator.wikimedia.org/P69546 and previous config saved to /var/cache/conftool/dbconfig/20241010-064827-arnaudb.json [06:50:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1230 T376867', diff saved to https://phabricator.wikimedia.org/P69547 and previous config saved to /var/cache/conftool/dbconfig/20241010-065048-arnaudb.json [06:51:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 5%: T376867', diff saved to https://phabricator.wikimedia.org/P69548 and previous config saved to /var/cache/conftool/dbconfig/20241010-065145-arnaudb.json [06:52:30] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:56:27] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [06:57:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P69549 and previous config saved to /var/cache/conftool/dbconfig/20241010-065712-arnaudb.json [07:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T0700). [07:00:05] awight: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:06] I can self-serve my config patch. [07:03:44] (03PS2) 10Muehlenhoff: Point irc.w.o to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1078665 (https://phabricator.wikimedia.org/T376014) [07:06:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: T376867', diff saved to https://phabricator.wikimedia.org/P69550 and previous config saved to /var/cache/conftool/dbconfig/20241010-070650-arnaudb.json [07:06:59] T376867: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T376867 [07:07:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T376868 [07:07:18] T376868: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T376868 [07:07:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T376868 [07:08:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1181 with weight 0 T376868', diff saved to https://phabricator.wikimedia.org/P69551 and previous config saved to /var/cache/conftool/dbconfig/20241010-070843-arnaudb.json [07:08:46] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [07:08:52] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [07:12:15] This is a new local install so I'm reconfiguring ssh, but the host key doesn't match my bastion nor the deployment machines listed on wikitech. [07:12:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T367781)', diff saved to https://phabricator.wikimedia.org/P69552 and previous config saved to /var/cache/conftool/dbconfig/20241010-071219-arnaudb.json [07:12:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1199.eqiad.wmnet with reason: Maintenance [07:12:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:12:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1199.eqiad.wmnet with reason: Maintenance [07:12:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T367781)', diff saved to https://phabricator.wikimedia.org/P69553 and previous config saved to /var/cache/conftool/dbconfig/20241010-071242-arnaudb.json [07:12:49] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1079094 (https://phabricator.wikimedia.org/T376868) (owner: 10Gerrit maintenance bot) [07:12:54] Anyone have an explanation? I see: ED25519 key fingerprint is SHA256:meS3gCKwHzJWtflhVLOotPQVkYEpexjddK6hna5/t/0. [07:13:47] Yeah nothing matches https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy2002.codfw.wmnet [07:13:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T367781)', diff saved to https://phabricator.wikimedia.org/P69554 and previous config saved to /var/cache/conftool/dbconfig/20241010-071350-arnaudb.json [07:13:52] (03PS1) 10Slyngshede: R:idmcloud remove role, service will be moved to a different role. [puppet] - 10https://gerrit.wikimedia.org/r/1079151 (https://phabricator.wikimedia.org/T376871) [07:14:08] !log Starting s7 eqiad failover from db1236 to db1181 - T376868 [07:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:13] T376868: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T376868 [07:14:23] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [07:14:29] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [07:14:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1181 to s7 primary T376868', diff saved to https://phabricator.wikimedia.org/P69555 and previous config saved to /var/cache/conftool/dbconfig/20241010-071453-arnaudb.json [07:15:18] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudidm2001-dev.codfw.wmnet [07:15:31] (03CR) 10Jcrespo: ""This was what was missing for pc5" -> With this, do you mean that the error went away when deployed? Or that it was what was different fr" [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [07:15:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [07:15:39] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts cloudidm2001-dev.codfw.wmnet [07:16:10] (03CR) 10Jcrespo: [C:03+1] mariadb: Add SLAVE MONITOR to promotheus grants [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [07:16:29] (03PS1) 10DCausse: cirrussearch: CirrusSearchSaneitizerFixRateTooHigh break per cluster [alerts] - 10https://gerrit.wikimedia.org/r/1079152 [07:16:34] (03CR) 10Jcrespo: [C:03+1] "I just re-read the summary. All good." [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [07:16:35] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [07:16:40] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [07:17:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078463 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch) [07:17:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1236 T376868', diff saved to https://phabricator.wikimedia.org/P69556 and previous config saved to /var/cache/conftool/dbconfig/20241010-071721-arnaudb.json [07:17:57] (03Merged) 10jenkins-bot: [config] Rename moved gadget name setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078463 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch) [07:18:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 5%: T376868', diff saved to https://phabricator.wikimedia.org/P69557 and previous config saved to /var/cache/conftool/dbconfig/20241010-071820-arnaudb.json [07:18:46] !log awight@deploy2002 Started scap sync-world: Backport for [[gerrit:1078463|[config] Rename moved gadget name setting (T362771)]] [07:18:49] T362771: Move ReferencePreviews related config flags to Cite's codebase - https://phabricator.wikimedia.org/T362771 [07:19:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [07:19:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10216529 (10elukey) 05Resolved→03Open Please keep it open until we solve the firmware/BMC issue :) [07:21:07] !log awight@deploy2002 awight, wmde-fisch: Backport for [[gerrit:1078463|[config] Rename moved gadget name setting (T362771)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:21:22] (03CR) 10DCausse: [C:03+2] cirrussearch: CirrusSearchSaneitizerFixRateTooHigh break per cluster [alerts] - 10https://gerrit.wikimedia.org/r/1079152 (owner: 10DCausse) [07:21:53] RECOVERY - Host rdb1014 is UP: PING WARNING - Packet loss = 90%, RTA = 0.28 ms [07:21:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: T376867', diff saved to https://phabricator.wikimedia.org/P69558 and previous config saved to /var/cache/conftool/dbconfig/20241010-072155-arnaudb.json [07:21:59] T376867: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T376867 [07:22:19] (03PS2) 10Slyngshede: R:idmcloud remove role, service will be moved to a different role. [puppet] - 10https://gerrit.wikimedia.org/r/1079151 (https://phabricator.wikimedia.org/T376871) [07:22:33] (03Merged) 10jenkins-bot: cirrussearch: CirrusSearchSaneitizerFixRateTooHigh break per cluster [alerts] - 10https://gerrit.wikimedia.org/r/1079152 (owner: 10DCausse) [07:22:39] PROBLEM - SSH on rdb1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:23:26] !log awight@deploy2002 awight, wmde-fisch: Continuing with sync [07:23:40] (03PS3) 10Slyngshede: R:idmcloud remove role, service will be moved to a different role. [puppet] - 10https://gerrit.wikimedia.org/r/1079151 (https://phabricator.wikimedia.org/T376871) [07:25:02] !log slyngshede@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudidm2001-dev.codfw.wmnet [07:28:08] !log awight@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078463|[config] Rename moved gadget name setting (T362771)]] (duration: 09m 22s) [07:28:11] T362771: Move ReferencePreviews related config flags to Cite's codebase - https://phabricator.wikimedia.org/T362771 [07:28:13] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:17] PROBLEM - Host rdb1014 is DOWN: PING CRITICAL - Packet loss = 100% [07:28:29] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:42] (03PS3) 10Ilias Sarantopoulos: httpbb: add article-models namespace tests for articlequality [puppet] - 10https://gerrit.wikimedia.org/r/1063213 (https://phabricator.wikimedia.org/T360455) [07:28:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P69559 and previous config saved to /var/cache/conftool/dbconfig/20241010-072857-arnaudb.json [07:30:28] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [07:31:06] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:31:26] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:31:40] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:32:00] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:32:23] !log Stopped gerrit service on gerrit2003.codfw.wmnet since it is not starting up properly | T372804 [07:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:26] T372804: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804 [07:32:27] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:33:25] !log UTC morning deployments done. [07:33:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 10%: T376868', diff saved to https://phabricator.wikimedia.org/P69560 and previous config saved to /var/cache/conftool/dbconfig/20241010-073326-arnaudb.json [07:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:30] T376868: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T376868 [07:33:42] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudidm2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [07:34:11] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudidm2001-dev.codfw.wmnet decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1002" [07:34:11] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:34:12] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudidm2001-dev.codfw.wmnet [07:34:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10216559 (10elukey) @Jclark-ctr do you mind to pass me the BMC admin password via chat or email? [07:35:29] RECOVERY - gerrit process on gerrit2003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:37:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: T376867', diff saved to https://phabricator.wikimedia.org/P69561 and previous config saved to /var/cache/conftool/dbconfig/20241010-073700-arnaudb.json [07:37:04] T376867: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T376867 [07:37:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10216568 (10Jclark-ctr) Sent via chat [07:41:25] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804#10216570 (10hashar) The `gerrit` process on gerrit2003 does not start properly and is flapping: ` Oct 10 07:29:21 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job,... [07:42:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [07:43:28] (03CR) 10Isabelle Hurbain-Palatin: README.md: doc loading a plugin from the browser (032 comments) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1078735 (owner: 10Hashar) [07:44:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P69562 and previous config saved to /var/cache/conftool/dbconfig/20241010-074404-arnaudb.json [07:46:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [07:46:27] PROBLEM - gerrit process on gerrit2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:46:29] (03PS1) 10Hashar: Disable gerrit monitoring on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1079206 (https://phabricator.wikimedia.org/T372804) [07:47:27] RECOVERY - gerrit process on gerrit2003 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:47:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [07:47:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:48:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 25%: T376868', diff saved to https://phabricator.wikimedia.org/P69563 and previous config saved to /var/cache/conftool/dbconfig/20241010-074831-arnaudb.json [07:48:34] T376868: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T376868 [07:48:47] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10216592 (10MoritzMuehlenhoff) [07:49:05] RESOLVED: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [07:50:29] (03PS9) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [07:50:36] (03CR) 10Brouberol: Define a ceph rolling restart/reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [07:52:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: T376867', diff saved to https://phabricator.wikimedia.org/P69564 and previous config saved to /var/cache/conftool/dbconfig/20241010-075206-arnaudb.json [07:52:10] T376867: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T376867 [07:53:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1003.eqiad.wmnet [07:54:06] (03CR) 10Volans: [C:04-1] "spotted one typo" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [07:54:49] (03PS10) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [07:54:58] (03CR) 10Brouberol: Define a ceph rolling restart/reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [07:55:18] (03CR) 10Jelto: [C:04-1] "@dzahn is working on gerrit2003 and testing gerrit on bookworm. I added a downtime in icinga as well which should prevent all alerts in th" [puppet] - 10https://gerrit.wikimedia.org/r/1079206 (https://phabricator.wikimedia.org/T372804) (owner: 10Hashar) [07:55:32] (03CR) 10Volans: [C:03+1] sre.gitlab.upgrade: also use the service name for the downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [07:59:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T367781)', diff saved to https://phabricator.wikimedia.org/P69565 and previous config saved to /var/cache/conftool/dbconfig/20241010-075911-arnaudb.json [07:59:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet [07:59:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1221.eqiad.wmnet with reason: Maintenance [07:59:15] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:59:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1221.eqiad.wmnet with reason: Maintenance [07:59:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:59:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:59:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T367781)', diff saved to https://phabricator.wikimedia.org/P69566 and previous config saved to /var/cache/conftool/dbconfig/20241010-075951-arnaudb.json [08:00:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T367781)', diff saved to https://phabricator.wikimedia.org/P69567 and previous config saved to /var/cache/conftool/dbconfig/20241010-075959-arnaudb.json [08:00:05] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T0800) [08:01:24] (03CR) 10Muehlenhoff: [C:03+2] Point irc.w.o to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1078665 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [08:02:02] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:02:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:02:39] !log irc.wikimedia.org not directs to the ircstream implementation on irc1003.wikimedia.org T376014 [08:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:42] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [08:03:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 50%: T376868', diff saved to https://phabricator.wikimedia.org/P69568 and previous config saved to /var/cache/conftool/dbconfig/20241010-080336-arnaudb.json [08:03:39] T376868: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T376868 [08:03:59] (03CR) 10Hashar: [C:04-1] "**Thanks Isabelle**, I assumed Chromium would behave the same way as Firefox and it different. So I will rework the text:" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1078735 (owner: 10Hashar) [08:04:00] (03CR) 10Volans: [C:03+1] "Code LGTM from a cookbook point of view. As I'm not familiar with the ceph cluster I'll leave the rest to your team." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [08:04:33] (03CR) 10Ayounsi: Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48 (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [08:05:23] (03Abandoned) 10Hashar: Disable gerrit monitoring on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1079206 (https://phabricator.wikimedia.org/T372804) (owner: 10Hashar) [08:05:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet [08:06:14] (03CR) 10CI reject: [V:04-1] Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [08:06:32] (03CR) 10Brouberol: Define a ceph rolling restart/reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [08:07:07] (03PS11) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [08:07:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10216617 (10elukey) I've updated the firmware to version `20231203_01.03.10` but I keep seeing the same problem: ` {"error":{"code":"Base.v1_10_3.Gen... [08:07:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: T376867', diff saved to https://phabricator.wikimedia.org/P69569 and previous config saved to /var/cache/conftool/dbconfig/20241010-080711-arnaudb.json [08:07:18] T376867: Switchover s5 master (db1230 -> db1183) - https://phabricator.wikimedia.org/T376867 [08:08:11] (03PS5) 10Cathal Mooney: Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) [08:09:31] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [08:09:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10216631 (10phaultfinder) [08:15:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P69570 and previous config saved to /var/cache/conftool/dbconfig/20241010-081506-arnaudb.json [08:17:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org [08:18:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 75%: T376868', diff saved to https://phabricator.wikimedia.org/P69571 and previous config saved to /var/cache/conftool/dbconfig/20241010-081841-arnaudb.json [08:18:44] T376868: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T376868 [08:20:32] (03PS12) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [08:21:25] !log brouberol@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling restart_daemons on P{cephosd1001*} and (A:cephosd) [08:21:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org [08:21:50] !log brouberol@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling restart_daemons on P{cephosd1001*} and (A:cephosd) [08:25:03] (03PS2) 10Cathal Mooney: Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48 [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495) [08:25:24] (03CR) 10Cathal Mooney: Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48 (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [08:26:39] (03PS1) 10Muehlenhoff: Revert "Point irc.w.o to irc1003" [dns] - 10https://gerrit.wikimedia.org/r/1079212 [08:28:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10216684 (10elukey) I checked an-conf1004 that should be a similar host, and this is the firmware version: `20240313_01.04.04` Can't find it on the we... [08:30:10] (03CR) 10Muehlenhoff: [C:03+2] Revert "Point irc.w.o to irc1003" [dns] - 10https://gerrit.wikimedia.org/r/1079212 (owner: 10Muehlenhoff) [08:30:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P69572 and previous config saved to /var/cache/conftool/dbconfig/20241010-083013-arnaudb.json [08:30:56] (03CR) 10Giuseppe Lavagetto: [C:03+2] python_deploy::venv: transform into a define [puppet] - 10https://gerrit.wikimedia.org/r/1078707 (owner: 10Giuseppe Lavagetto) [08:31:15] (03PS1) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1079213 (https://phabricator.wikimedia.org/T363564) [08:32:14] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: also use the service name for the downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [08:33:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1236 (re)pooling @ 100%: T376868', diff saved to https://phabricator.wikimedia.org/P69573 and previous config saved to /var/cache/conftool/dbconfig/20241010-083347-arnaudb.json [08:33:50] T376868: Switchover s7 master (db1236 -> db1181) - https://phabricator.wikimedia.org/T376868 [08:37:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org [08:39:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1236.eqiad.wmnet with reason: Maintenance [08:39:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1236.eqiad.wmnet with reason: Maintenance [08:40:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T367781)', diff saved to https://phabricator.wikimedia.org/P69574 and previous config saved to /var/cache/conftool/dbconfig/20241010-084003-arnaudb.json [08:40:07] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:40:10] (03PS1) 10Ayounsi: Monitoring rename pfw3-codfw to pfw1 add new fasw [puppet] - 10https://gerrit.wikimedia.org/r/1079216 (https://phabricator.wikimedia.org/T374176) [08:40:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org [08:41:05] (03CR) 10Ayounsi: [C:04-1] "FYI I sent I30fc3d372d1d3940e46a864ab34893542a27a966 that takes care of both." [puppet] - 10https://gerrit.wikimedia.org/r/1075624 (https://phabricator.wikimedia.org/T374587) (owner: 10Papaul) [08:41:10] (03PS1) 10Kosta Harlan: QuickSurvey.vue: Support using HTML in thank you message [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079217 (https://phabricator.wikimedia.org/T376517) [08:41:12] (03PS1) 10Hashar: Revert "Use HTML markup instead of bidi control chars in wiki changes" [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079218 (https://phabricator.wikimedia.org/T375975) [08:41:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079217 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [08:41:48] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on cloudsw1-b1-codfw.mgmt with reason: prevent bgp alerts firing until CRs configured [08:42:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on cloudsw1-b1-codfw.mgmt with reason: prevent bgp alerts firing until CRs configured [08:42:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T367781)', diff saved to https://phabricator.wikimedia.org/P69575 and previous config saved to /var/cache/conftool/dbconfig/20241010-084214-arnaudb.json [08:42:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10216712 (10elukey) Email sent to Supermicro, we'll see if they are able to provide to us the right firmware. [08:44:46] (03CR) 10CI reject: [V:04-1] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1079213 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [08:45:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T367781)', diff saved to https://phabricator.wikimedia.org/P69576 and previous config saved to /var/cache/conftool/dbconfig/20241010-084521-arnaudb.json [08:45:23] hashar: is it possible to deploy a small patch during the train deployment window? (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/QuickSurveys/+/1079217) otherwise I will see if someone on my team can use the scheduled window this afternoon [08:45:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance [08:45:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:45:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1238.eqiad.wmnet with reason: Maintenance [08:45:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T367781)', diff saved to https://phabricator.wikimedia.org/P69577 and previous config saved to /var/cache/conftool/dbconfig/20241010-084543-arnaudb.json [08:46:07] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [08:46:58] (03CR) 10Aklapper: [V:03+2 C:03+2] "Unblock train" [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079218 (https://phabricator.wikimedia.org/T375975) (owner: 10Hashar) [08:47:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T367781)', diff saved to https://phabricator.wikimedia.org/P69578 and previous config saved to /var/cache/conftool/dbconfig/20241010-084752-arnaudb.json [08:48:20] (03CR) 10Alexandros Kosiaris: [C:03+1] echostore: adopt service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079012 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [08:52:29] (03PS2) 10Jelto: sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1079213 (https://phabricator.wikimedia.org/T363564) [08:55:39] !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1079218|Revert "Use HTML markup instead of bidi control chars in wiki changes" (T375975 T376814)]] [08:55:45] T375975: Remove uses of $lang->getDirMark and $lang->getDirMarkEntity in non plain text output - https://phabricator.wikimedia.org/T375975 [08:55:45] T376814: Regression: mobile watchlist's text overlaps - https://phabricator.wikimedia.org/T376814 [08:57:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P69579 and previous config saved to /var/cache/conftool/dbconfig/20241010-085721-arnaudb.json [08:57:52] !log aklapper@deploy2002 hashar, aklapper: Backport for [[gerrit:1079218|Revert "Use HTML markup instead of bidi control chars in wiki changes" (T375975 T376814)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:00:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org [09:02:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P69580 and previous config saved to /var/cache/conftool/dbconfig/20241010-090259-arnaudb.json [09:03:15] !log aklapper@deploy2002 hashar, aklapper: Continuing with sync [09:06:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org [09:07:15] (03CR) 10JMeybohm: [C:03+2] Migrate kubestage1003 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078677 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:07:18] (03CR) 10Klausman: [V:03+2 C:03+2] httpbb: add article-models namespace tests for articlequality [puppet] - 10https://gerrit.wikimedia.org/r/1063213 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [09:07:49] !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079218|Revert "Use HTML markup instead of bidi control chars in wiki changes" (T375975 T376814)]] (duration: 12m 09s) [09:07:53] T375975: Remove uses of $lang->getDirMark and $lang->getDirMarkEntity in non plain text output - https://phabricator.wikimedia.org/T375975 [09:07:53] T376814: Regression: mobile watchlist's text overlaps - https://phabricator.wikimedia.org/T376814 [09:10:04] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1003.eqiad.wmnet [09:10:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage1003.eqiad.wmnet [09:12:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet [09:12:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P69581 and previous config saved to /var/cache/conftool/dbconfig/20241010-091228-arnaudb.json [09:13:26] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079221 (https://phabricator.wikimedia.org/T375657) [09:13:28] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079221 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [09:13:39] kostajh: sorry I m missed your ping about deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/QuickSurveys/+/1079217 . andre is running the train so I guess sync up with him [09:14:10] kostajh, hashar: eh, deploying .26 to group2 right now in progress [09:14:11] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079221 (https://phabricator.wikimedia.org/T375657) (owner: 10TrainBranchBot) [09:14:21] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1077712 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [09:14:35] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage1003.eqiad.wmnet with OS bookworm [09:15:12] (03PS4) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) [09:16:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet [09:16:39] (03PS5) 10Kosta Harlan: dumps: Drop the globalblocks table dump [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) [09:16:43] (03CR) 10Kosta Harlan: "Good catch, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [09:17:01] (03CR) 10Kosta Harlan: [C:04-1] "I believe we still need to have one patch in front of this to mark the files as "absent"" [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [09:17:31] (03PS3) 10Arturo Borrero Gonzalez: wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) [09:17:40] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [09:18:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P69582 and previous config saved to /var/cache/conftool/dbconfig/20241010-091806-arnaudb.json [09:18:15] (03CR) 10Kosta Harlan: [C:04-1] "^ @Ladsgroup@gmail.com this is based on a comment you made, IIRC. I'm not sure what to update, though. Any pointers are welcome :)" [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [09:20:59] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.26 refs T375657 [09:21:02] T375657: 1.43.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T375657 [09:23:37] (03PS1) 10Ayounsi: sre.hosts.dhcp: add --force-dhcp-tftp [cookbooks] - 10https://gerrit.wikimedia.org/r/1079224 [09:24:32] (03CR) 10Elukey: [C:03+1] fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [09:26:53] (03CR) 10Cathal Mooney: [C:03+2] Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48 [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [09:27:27] (03Merged) 10jenkins-bot: Add elements for WMCS IPv6 range in codfw 2a02:ec80:a100::/48 [homer/public] - 10https://gerrit.wikimedia.org/r/1078990 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [09:27:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T367781)', diff saved to https://phabricator.wikimedia.org/P69583 and previous config saved to /var/cache/conftool/dbconfig/20241010-092735-arnaudb.json [09:27:38] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:30:09] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [09:33:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [09:33:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T367781)', diff saved to https://phabricator.wikimedia.org/P69584 and previous config saved to /var/cache/conftool/dbconfig/20241010-093313-arnaudb.json [09:33:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1241.eqiad.wmnet with reason: Maintenance [09:33:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:33:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1241.eqiad.wmnet with reason: Maintenance [09:33:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T367781)', diff saved to https://phabricator.wikimedia.org/P69585 and previous config saved to /var/cache/conftool/dbconfig/20241010-093335-arnaudb.json [09:35:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T367781)', diff saved to https://phabricator.wikimedia.org/P69586 and previous config saved to /var/cache/conftool/dbconfig/20241010-093544-arnaudb.json [09:36:14] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [09:41:19] (03CR) 10Btullis: [C:04-1] "Waiting on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1076905 before proceeding." [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:47:45] ACKNOWLEDGEMENT - Host an-presto1010 is DOWN: PING CRITICAL - Packet loss = 100% Btullis T376880 [09:47:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1002.wikimedia.org [09:49:00] ACKNOWLEDGEMENT - Host elastic1064 is DOWN: PING CRITICAL - Packet loss = 100% Btullis T376881 [09:49:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [09:50:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1003.eqiad.wmnet with OS bookworm [09:50:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P69587 and previous config saved to /var/cache/conftool/dbconfig/20241010-095050-arnaudb.json [09:52:26] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1003.eqiad.wmnet [09:52:29] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1003.eqiad.wmnet [09:52:38] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1004.eqiad.wmnet [09:53:24] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1002.wikimedia.org [09:54:44] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host kubestage1004.eqiad.wmnet [09:55:53] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10216940 (10elukey) We tried to move irc.wikimedia.org to irc1003 but we noticed some issues in messages relay... [09:57:05] (03PS1) 10Joal: Update webrequest raw retention period on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1079231 (https://phabricator.wikimedia.org/T376882) [09:58:24] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:47] 06SRE, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Timeout while retrieving the catalog from the Docker Registry - https://phabricator.wikimedia.org/T376285#10216942 (10elukey) 05Open→03Resolved a:03elukey The issue seems solved, I tested docker-report multiple times and I did... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1000) [10:00:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:02:19] (03CR) 10Btullis: [C:03+2] Update webrequest raw retention period on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/1079231 (https://phabricator.wikimedia.org/T376882) (owner: 10Joal) [10:05:57] FIRING: [4x] KubernetesCalicoDown: kubestage1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:05:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P69589 and previous config saved to /var/cache/conftool/dbconfig/20241010-100557-arnaudb.json [10:08:42] 06SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137#10216986 (10Yann) https://commons.wikimedia.org/w/index.php?title=File:Reserve_Bank_of_Zimbabwe_5_Dollars_2019_obseve.jpg can't be undeleted due to this bug:... [10:09:40] (03PS1) 10Zabe: s2: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079233 (https://phabricator.wikimedia.org/T183490) [10:10:04] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: keepalived: support separate IPv6 VRRP instance [puppet] - 10https://gerrit.wikimedia.org/r/1079234 (https://phabricator.wikimedia.org/T376879) [10:10:57] RESOLVED: [4x] KubernetesCalicoDown: kubestage1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:11:27] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:12:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079234 (https://phabricator.wikimedia.org/T376879) (owner: 10Arturo Borrero Gonzalez) [10:15:37] (03CR) 10Cathal Mooney: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1079234 (https://phabricator.wikimedia.org/T376879) (owner: 10Arturo Borrero Gonzalez) [10:16:21] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: keepalived: support separate IPv6 VRRP instance [puppet] - 10https://gerrit.wikimedia.org/r/1079234 (https://phabricator.wikimedia.org/T376879) [10:16:55] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079234 (https://phabricator.wikimedia.org/T376879) (owner: 10Arturo Borrero Gonzalez) [10:19:37] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: keepalived: support separate IPv6 VRRP instance [puppet] - 10https://gerrit.wikimedia.org/r/1079234 (https://phabricator.wikimedia.org/T376879) (owner: 10Arturo Borrero Gonzalez) [10:21:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T367781)', diff saved to https://phabricator.wikimedia.org/P69590 and previous config saved to /var/cache/conftool/dbconfig/20241010-102104-arnaudb.json [10:21:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1242.eqiad.wmnet with reason: Maintenance [10:21:08] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:21:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1242.eqiad.wmnet with reason: Maintenance [10:21:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T367781)', diff saved to https://phabricator.wikimedia.org/P69591 and previous config saved to /var/cache/conftool/dbconfig/20241010-102127-arnaudb.json [10:21:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet [10:22:32] (03CR) 10FNegri: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [10:22:36] (03CR) 10Lucas Werkmeister (WMDE): "Seems to work in production now \o/ thanks for merging!" [puppet] - 10https://gerrit.wikimedia.org/r/1078900 (https://phabricator.wikimedia.org/T341553) (owner: 10Lucas Werkmeister (WMDE)) [10:23:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T367781)', diff saved to https://phabricator.wikimedia.org/P69592 and previous config saved to /var/cache/conftool/dbconfig/20241010-102336-arnaudb.json [10:24:52] RECOVERY - Host an-presto1010 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:25:05] (03PS1) 10Cathal Mooney: Adjust cloudsw/cr bgp policies and include new IPv6 range for codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1079237 (https://phabricator.wikimedia.org/T245495) [10:25:24] RECOVERY - SSH on an-presto1010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:25:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet [10:25:56] (03PS2) 10Cathal Mooney: Adjust cloudsw/cr bgp policies and include new IPv6 range for codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1079237 (https://phabricator.wikimedia.org/T245495) [10:27:12] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testhost2001.codfw.wmnet [10:37:12] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:38:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P69593 and previous config saved to /var/cache/conftool/dbconfig/20241010-103843-arnaudb.json [10:39:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testhost2001.codfw.wmnet [10:41:22] (03PS1) 10Clément Goubert: kubernetes: Hosts refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) [10:42:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet [10:46:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet [10:48:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [10:50:43] (03PS1) 10Clément Goubert: kubernetes: Hosts expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) [10:51:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [10:52:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2006.codfw.wmnet [10:53:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P69594 and previous config saved to /var/cache/conftool/dbconfig/20241010-105350-arnaudb.json [10:56:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2006.codfw.wmnet [10:57:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2007.codfw.wmnet [10:57:54] (03PS2) 10Clément Goubert: kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) [10:57:54] (03PS2) 10Clément Goubert: kubernetes: codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) [10:57:54] (03PS1) 10Clément Goubert: kubernetes: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079241 (https://phabricator.wikimedia.org/T376185) [10:58:07] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1079237 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [11:00:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2007.codfw.wmnet [11:03:26] (03PS1) 10Clément Goubert: kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) [11:07:22] (03Abandoned) 10Cathal Mooney: Enable gNMI / gRPC on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1031904 (https://phabricator.wikimedia.org/T365012) (owner: 10Cathal Mooney) [11:08:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T367781)', diff saved to https://phabricator.wikimedia.org/P69595 and previous config saved to /var/cache/conftool/dbconfig/20241010-110857-arnaudb.json [11:08:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1243.eqiad.wmnet with reason: Maintenance [11:09:01] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:09:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1243.eqiad.wmnet with reason: Maintenance [11:09:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T367781)', diff saved to https://phabricator.wikimedia.org/P69596 and previous config saved to /var/cache/conftool/dbconfig/20241010-110920-arnaudb.json [11:10:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T367781)', diff saved to https://phabricator.wikimedia.org/P69597 and previous config saved to /var/cache/conftool/dbconfig/20241010-111028-arnaudb.json [11:10:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2008.wikimedia.org [11:14:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2008.wikimedia.org [11:16:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7001.magru.wmnet [11:20:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7001.magru.wmnet [11:21:01] (03PS1) 10Cathal Mooney: Modify LVS_Import policy to support both Liberica and PyBal [homer/public] - 10https://gerrit.wikimedia.org/r/1079244 (https://phabricator.wikimedia.org/T375464) [11:21:58] (03CR) 10Clément Goubert: "It was created when we still did a switchback, yes, and it serves less of a purpose now except in an emergency." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [11:22:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [11:24:44] (03CR) 10Btullis: Define a ceph rolling restart/reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [11:25:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P69598 and previous config saved to /var/cache/conftool/dbconfig/20241010-112535-arnaudb.json [11:26:12] jouncebot: nowandnext [11:26:12] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [11:26:13] In 0 hour(s) and 33 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1200) [11:26:19] (03CR) 10Zabe: [C:03+2] s2: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079233 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [11:26:30] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: fix keepalived IPv6 setting [puppet] - 10https://gerrit.wikimedia.org/r/1079246 (https://phabricator.wikimedia.org/T376879) [11:26:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [11:26:56] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: fix keepalived IPv6 setting [puppet] - 10https://gerrit.wikimedia.org/r/1079246 (https://phabricator.wikimedia.org/T376879) [11:27:00] (03Merged) 10jenkins-bot: s2: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079233 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [11:27:19] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1079233|s2: Reduce revision-slots cache expiry to 60 seconds (T183490)]] [11:27:22] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [11:27:58] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079246 (https://phabricator.wikimedia.org/T376879) (owner: 10Arturo Borrero Gonzalez) [11:28:33] (03CR) 10Volans: sre.discovery.datacenter: Add failover action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [11:29:27] !log zabe@deploy2002 zabe: Backport for [[gerrit:1079233|s2: Reduce revision-slots cache expiry to 60 seconds (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:29:36] !log zabe@deploy2002 zabe: Continuing with sync [11:29:49] (03CR) 10Ayounsi: [C:03+1] "nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/1079244 (https://phabricator.wikimedia.org/T375464) (owner: 10Cathal Mooney) [11:31:12] (03CR) 10Cathal Mooney: [C:03+2] Adjust cloudsw/cr bgp policies and include new IPv6 range for codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1079237 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [11:31:42] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645#10217225 (10matmarex) I've been told that this project would let me process Logstash data with SQL queries, and I would like that v... [11:31:45] (03Merged) 10jenkins-bot: Adjust cloudsw/cr bgp policies and include new IPv6 range for codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1079237 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [11:32:54] (03CR) 10Cathal Mooney: [C:03+2] Modify LVS_Import policy to support both Liberica and PyBal [homer/public] - 10https://gerrit.wikimedia.org/r/1079244 (https://phabricator.wikimedia.org/T375464) (owner: 10Cathal Mooney) [11:33:26] (03Merged) 10jenkins-bot: Modify LVS_Import policy to support both Liberica and PyBal [homer/public] - 10https://gerrit.wikimedia.org/r/1079244 (https://phabricator.wikimedia.org/T375464) (owner: 10Cathal Mooney) [11:34:17] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079233|s2: Reduce revision-slots cache expiry to 60 seconds (T183490)]] (duration: 06m 58s) [11:34:20] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [11:36:15] (03Abandoned) 10Ammarpad: enwiktionary: Enable $wgMFCollapseSectionsByDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078397 (https://phabricator.wikimedia.org/T376446) (owner: 10Ammarpad) [11:36:55] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:01] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T376764#10217236 (10phaultfinder) [11:40:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P69599 and previous config saved to /var/cache/conftool/dbconfig/20241010-114042-arnaudb.json [11:42:39] (03PS1) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/1079251 [11:42:39] (03PS1) 10Cathal Mooney: Fix typos in updated prefix-list for cloud ranges eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1079252 (https://phabricator.wikimedia.org/T245495) [11:43:25] (03PS2) 10Cathal Mooney: Fix typos in updated prefix-list for cloud ranges eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1079252 (https://phabricator.wikimedia.org/T245495) [11:43:54] (03Abandoned) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/1079251 (owner: 10Cathal Mooney) [11:44:32] (03CR) 10Cathal Mooney: [C:03+2] Fix typos in updated prefix-list for cloud ranges eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1079252 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [11:45:10] (03Merged) 10jenkins-bot: Fix typos in updated prefix-list for cloud ranges eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1079252 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [11:46:55] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:22] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1079246 (https://phabricator.wikimedia.org/T376879) (owner: 10Arturo Borrero Gonzalez) [11:47:53] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: fix keepalived IPv6 setting [puppet] - 10https://gerrit.wikimedia.org/r/1079246 (https://phabricator.wikimedia.org/T376879) (owner: 10Arturo Borrero Gonzalez) [11:51:55] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:44] (03CR) 10Clément Goubert: "Done" [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [11:54:48] (03Merged) 10jenkins-bot: Fix similar typo in the codfw policy [homer/public] - 10https://gerrit.wikimedia.org/r/1079254 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [11:54:58] (03CR) 10JMeybohm: [C:03+2] Migrate kubestage1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078678 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [11:55:02] (03PS2) 10JMeybohm: Migrate kubestage1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078678 (https://phabricator.wikimedia.org/T362408) [11:55:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T367781)', diff saved to https://phabricator.wikimedia.org/P69601 and previous config saved to /var/cache/conftool/dbconfig/20241010-115549-arnaudb.json [11:55:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1244.eqiad.wmnet with reason: Maintenance [11:56:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1244.eqiad.wmnet with reason: Maintenance [11:56:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T367781)', diff saved to https://phabricator.wikimedia.org/P69602 and previous config saved to /var/cache/conftool/dbconfig/20241010-115612-arnaudb.json [11:57:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T367781)', diff saved to https://phabricator.wikimedia.org/P69603 and previous config saved to /var/cache/conftool/dbconfig/20241010-115720-arnaudb.json [11:57:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:58:21] (03CR) 10JMeybohm: [V:03+2 C:03+2] Migrate kubestage1004 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1078678 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:00:42] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage1004.eqiad.wmnet with OS bookworm [12:04:56] (03CR) 10David Caro: [V:03+1 C:03+2] "Agree yep, I also think we should move it to `epp` both eventually." [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [12:05:35] (03PS1) 10Clément Goubert: kubestage: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079257 (https://phabricator.wikimedia.org/T376171) [12:07:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ms-be1065 - https://phabricator.wikimedia.org/T376775#10217305 (10jcrespo) DC ops: Would you have an 8 TB disk spare for this host? It seems out of warranty. [12:09:04] 06SRE, 06Infrastructure-Foundations, 10netops: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw - https://phabricator.wikimedia.org/T360772#10217316 (10cmooney) 05Open→03Declined [12:12:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P69604 and previous config saved to /var/cache/conftool/dbconfig/20241010-121227-arnaudb.json [12:12:36] 06SRE, 10Data-Persistence-Backup, 10media-backups: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892 (10jcrespo) 03NEW [12:12:50] 06SRE, 10Data-Persistence-Backup, 10media-backups: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10217337 (10jcrespo) p:05Triage→03High [12:14:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10217353 (10phaultfinder) [12:16:26] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [12:16:32] jouncebot: nowandnext [12:16:32] For the next 0 hour(s) and 43 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1200) [12:16:32] In 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1300) [12:16:55] FIRING: [3x] SystemdUnitFailed: prometheus-debian-version-textfile.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:25] 06SRE, 10Data-Persistence-Backup, 10media-backups: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10217354 (10jcrespo) [12:17:46] 06SRE, 10Data-Persistence-Backup, 10media-backups: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10217355 (10jcrespo) [12:17:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10217356 (10jcrespo) [12:17:58] 06SRE, 10Data-Persistence-Backup, 10media-backups: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10217357 (10jcrespo) [12:18:00] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10217358 (10jcrespo) [12:19:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: host reimage [12:21:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [12:21:55] FIRING: [3x] SystemdUnitFailed: prometheus-debian-version-textfile.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:22:30] (03CR) 10Lucas Werkmeister (WMDE): "Not strictly required for the backport, but you’ll probably want to add `mediawiki.jqueryMsg` to the ResourceLoader dependencies to ensure" [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079217 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [12:22:38] (03PS1) 10JMeybohm: Fix app and lamp module.json, adding back dropped versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079268 (https://phabricator.wikimedia.org/T356885) [12:24:53] (03CR) 10Jelto: [C:03+1] "looks good to me, thanks for the quick fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079268 (https://phabricator.wikimedia.org/T356885) (owner: 10JMeybohm) [12:25:04] (03CR) 10JMeybohm: [C:03+2] Fix app and lamp module.json, adding back dropped versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079268 (https://phabricator.wikimedia.org/T356885) (owner: 10JMeybohm) [12:26:04] (03Merged) 10jenkins-bot: Fix app and lamp module.json, adding back dropped versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079268 (https://phabricator.wikimedia.org/T356885) (owner: 10JMeybohm) [12:26:38] 06SRE, 10Data-Persistence-Backup, 10media-backups: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892#10217370 (10jcrespo) [12:26:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [12:27:06] (03CR) 10Kosta Harlan: "Oops. Thanks for that. Done in I9ff95e1e7850029691874740172f1bf35aed6431" [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079217 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [12:27:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P69605 and previous config saved to /var/cache/conftool/dbconfig/20241010-122734-arnaudb.json [12:29:33] 06SRE, 06Infrastructure-Foundations, 10netops: Move codfw dns hosts to per-rack vlans and BGP peer with top-of-rack switch - https://phabricator.wikimedia.org/T376894 (10cmooney) 03NEW p:05Triage→03Low [12:29:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [12:35:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [12:38:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1004.eqiad.wmnet with OS bookworm [12:38:37] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-Needs-Improvement: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10217427 (10Aklapper) [12:39:18] 06SRE, 06Infrastructure-Foundations, 10netops: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635#10217448 (10Aklapper) [12:42:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T367781)', diff saved to https://phabricator.wikimedia.org/P69606 and previous config saved to /var/cache/conftool/dbconfig/20241010-124241-arnaudb.json [12:42:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [12:42:46] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:42:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1245.eqiad.wmnet with reason: Maintenance [12:42:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1247.eqiad.wmnet with reason: Maintenance [12:43:10] RESOLVED: SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1247.eqiad.wmnet with reason: Maintenance [12:43:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T367781)', diff saved to https://phabricator.wikimedia.org/P69607 and previous config saved to /var/cache/conftool/dbconfig/20241010-124319-arnaudb.json [12:45:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T367781)', diff saved to https://phabricator.wikimedia.org/P69608 and previous config saved to /var/cache/conftool/dbconfig/20241010-124528-arnaudb.json [12:45:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: Likely memory issue on ml-serv2001.codfw.wmnet - https://phabricator.wikimedia.org/T376706#10217459 (10klausman) 05Open→03Resolved a:03klausman Thanks! Machine is back in service. [12:45:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet [12:47:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) (owner: 10C. Scott Ananian) [12:51:55] FIRING: SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet [12:52:04] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Remove pct based pulls [puppet] - 10https://gerrit.wikimedia.org/r/1079273 (https://phabricator.wikimedia.org/T366778) [12:52:10] 06SRE, 06SRE-OnFire, 13Patch-Needs-Improvement: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355#10217521 (10Aklapper) [12:52:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [12:52:57] (03PS1) 10C. Scott Ananian: Turn on Parsoid Selective Update metrics (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079274 (https://phabricator.wikimedia.org/T371713) [12:53:17] (03PS2) 10C. Scott Ananian: Turn on Parsoid Selective Update metrics (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079274 (https://phabricator.wikimedia.org/T371713) [12:53:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079274 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [12:53:50] (03PS3) 10CDanis: WIP: Basic seeding of an oncall handoff message [software/klaxon] - 10https://gerrit.wikimedia.org/r/830259 (https://phabricator.wikimedia.org/T317159) [12:54:01] (03CR) 10CI reject: [V:04-1] mediawiki-image-download: Remove pct based pulls [puppet] - 10https://gerrit.wikimedia.org/r/1079273 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [12:54:52] (03CR) 10CI reject: [V:04-1] WIP: Basic seeding of an oncall handoff message [software/klaxon] - 10https://gerrit.wikimedia.org/r/830259 (https://phabricator.wikimedia.org/T317159) (owner: 10CDanis) [12:55:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org [12:55:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [12:56:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [12:57:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [12:59:49] 06SRE, 10conftool, 13Patch-Needs-Improvement: ipblocks support for other "entities" (not clouds, not abuse nets) - https://phabricator.wikimedia.org/T305581#10217561 (10Aklapper) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1300). [13:00:05] kostajh, Dreamy_Jazz, and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] \o/ [13:00:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P69609 and previous config saved to /var/cache/conftool/dbconfig/20241010-130035-arnaudb.json [13:00:39] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Turn on Parsoid Selective Update metrics (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079274 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [13:00:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet [13:00:51] \o [13:01:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [13:01:08] (03CR) 10Dreamy Jazz: [C:03+2] QuickSurvey.vue: Support using HTML in thank you message [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079217 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [13:01:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org [13:01:32] (03PS1) 10JMeybohm: cumin/aliases: Remove P{O:kubernetes::staging::worker} [puppet] - 10https://gerrit.wikimedia.org/r/1079276 (https://phabricator.wikimedia.org/T362408) [13:01:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [13:01:51] as usual on Thursdays, I have a meeting that may or may not happen, so I don’t know if I can deploy in a few minutes or in 15-20 ^^ [13:02:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2034.codfw.wmnet [13:02:37] I'm deploying my change now [13:03:49] (03PS1) 10Dreamy Jazz: extension.json: Add mediawiki.jqueryMsg to dependencies for ext.quicksurveys.lib [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079278 (https://phabricator.wikimedia.org/T376517) [13:04:22] (03CR) 10Dreamy Jazz: [C:03+2] extension.json: Add mediawiki.jqueryMsg to dependencies for ext.quicksurveys.lib [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079278 (https://phabricator.wikimedia.org/T376517) (owner: 10Dreamy Jazz) [13:04:49] (03PS1) 10Jelto: miscweb: bump base and mesh templates to newest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079279 (https://phabricator.wikimedia.org/T350793) [13:05:01] (03Merged) 10jenkins-bot: QuickSurvey.vue: Support using HTML in thank you message [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079217 (https://phabricator.wikimedia.org/T376517) (owner: 10Kosta Harlan) [13:05:25] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:06:42] (03CR) 10JMeybohm: [C:03+2] cumin/aliases: Remove P{O:kubernetes::staging::worker} [puppet] - 10https://gerrit.wikimedia.org/r/1079276 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [13:06:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079278 (https://phabricator.wikimedia.org/T376517) (owner: 10Dreamy Jazz) [13:06:55] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:54] alright, my meeting is not happening which means I can deploy ^^ [13:08:01] (03Merged) 10jenkins-bot: extension.json: Add mediawiki.jqueryMsg to dependencies for ext.quicksurveys.lib [extensions/QuickSurveys] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079278 (https://phabricator.wikimedia.org/T376517) (owner: 10Dreamy Jazz) [13:08:19] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1079217|QuickSurvey.vue: Support using HTML in thank you message (T376517)]], [[gerrit:1079278|extension.json: Add mediawiki.jqueryMsg to dependencies for ext.quicksurveys.lib (T376517)]] [13:08:21] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777#10217593 (10Aklapper) 05Resolved→03Open [13:08:24] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [13:10:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2034.codfw.wmnet [13:10:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet [13:10:34] !log dreamyjazz@deploy2002 dreamyjazz, kharlan: Backport for [[gerrit:1079217|QuickSurvey.vue: Support using HTML in thank you message (T376517)]], [[gerrit:1079278|extension.json: Add mediawiki.jqueryMsg to dependencies for ext.quicksurveys.lib (T376517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:04] Lucas_WMDE: I think Dreamy_Jazz is in the middle of his deploy still [13:11:43] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1004.eqiad.wmnet [13:11:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1004.eqiad.wmnet [13:12:09] (03PS2) 10Jelto: miscweb: bump base and mesh templates to newest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079279 (https://phabricator.wikimedia.org/T350793) [13:12:14] yes, I saw that [13:12:48] !log dreamyjazz@deploy2002 dreamyjazz, kharlan: Continuing with sync [13:15:19] (03CR) 10JMeybohm: [C:03+1] miscweb: bump base and mesh templates to newest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079279 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:15:33] (03PS1) 10Bking: stat hosts: create/enable cgroups for memory and i/o [puppet] - 10https://gerrit.wikimedia.org/r/1079281 (https://phabricator.wikimedia.org/T376653) [13:15:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P69610 and previous config saved to /var/cache/conftool/dbconfig/20241010-131542-arnaudb.json [13:16:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079281 (https://phabricator.wikimedia.org/T376653) (owner: 10Bking) [13:17:25] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:ulsfo and A:dnsbox [13:17:25] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org [13:17:27] (03PS2) 10Bking: stat hosts: create/enable cgroups for memory and i/o [puppet] - 10https://gerrit.wikimedia.org/r/1079281 (https://phabricator.wikimedia.org/T376653) [13:17:31] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079217|QuickSurvey.vue: Support using HTML in thank you message (T376517)]], [[gerrit:1079278|extension.json: Add mediawiki.jqueryMsg to dependencies for ext.quicksurveys.lib (T376517)]] (duration: 09m 12s) [13:17:35] T376517: First test, then launch the new Safety Survey - https://phabricator.wikimedia.org/T376517 [13:17:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079281 (https://phabricator.wikimedia.org/T376653) (owner: 10Bking) [13:17:47] (03PS2) 10Alexandros Kosiaris: mediawiki-image-download: Remove pct based pulls [puppet] - 10https://gerrit.wikimedia.org/r/1079273 (https://phabricator.wikimedia.org/T366778) [13:17:51] Finished with my deploys [13:17:51] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#10217619 (10Ottomata) [13:18:02] Over to you Lucas_WMDE [13:18:07] ok! [13:18:25] cscott: how about we deploy your two changes together? [13:18:35] (i.e., in one scap) [13:18:35] that should be fine [13:18:43] (I realized “together” is ambiguous ^^) [13:18:56] let’s see if scap will even do it [13:18:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) (owner: 10C. Scott Ananian) [13:18:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079274 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [13:19:03] looks like it [13:19:10] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL - https://phabricator.wikimedia.org/T291645#10217616 (10Ottomata) [13:19:13] I guess the second one will be auto-rebased on merge [13:19:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:44] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:54] ^ expected [13:20:42] (03Merged) 10jenkins-bot: Turn on mobile support for Parsoid Read Views (but not on talk pages) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075635 (https://phabricator.wikimedia.org/T269499) (owner: 10C. Scott Ananian) [13:20:44] (03Merged) 10jenkins-bot: Turn on Parsoid Selective Update metrics (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079274 (https://phabricator.wikimedia.org/T371713) (owner: 10C. Scott Ananian) [13:20:59] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1075635|Turn on mobile support for Parsoid Read Views (but not on talk pages) (T269499 T376048)]], [[gerrit:1079274|Turn on Parsoid Selective Update metrics (take 2) (T371713 T376433)]] [13:21:06] T269499: [Epic] Make MobileFrontend compatible with Parsoid HTML - https://phabricator.wikimedia.org/T269499 [13:21:07] T376048: MFE still have issues with Parsoid Read Views on talk pages (Discussion Tools) - https://phabricator.wikimedia.org/T376048 [13:21:07] T371713: Instrumentation & data gathering to inform future performance & templating improvements - https://phabricator.wikimedia.org/T371713 [13:21:08] T376433: TypeError: Argument 1 passed to Wikimedia\Stats\Metrics\CounterMetric::incrementBy() must be of the type float, null given - https://phabricator.wikimedia.org/T376433 [13:21:27] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Remove pct based pulls [puppet] - 10https://gerrit.wikimedia.org/r/1079273 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [13:21:42] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:21:44] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:01] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, cscott: Backport for [[gerrit:1075635|Turn on mobile support for Parsoid Read Views (but not on talk pages) (T269499 T376048)]], [[gerrit:1079274|Turn on Parsoid Selective Update metrics (take 2) (T371713 T376433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:23:17] ok, i'll go ahead and test [13:23:21] ok, thanks! [13:23:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet [13:26:20] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10217676 (10elukey) I put some thoughts on the current situation, and even if there are a lot of unknowns, I realized that garbage collection m... [13:26:20] ok, the parsoid read views on mobile one looks good, checking the other [13:26:44] (03CR) 10Muehlenhoff: [C:03+2] profile::envoy: When adding rules based on nftables check for empty ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/1076905 (owner: 10Muehlenhoff) [13:26:58] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10217682 (10jcrespo) So this is my request for you @Volans, this is the best thing... [13:27:42] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:27:44] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:27:56] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns4003 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [13:29:16] (03PS4) 10Alexandros Kosiaris: mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) [13:29:16] (03PS2) 10Alexandros Kosiaris: mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 [13:29:16] (03PS1) 10Alexandros Kosiaris: mw-debug: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079283 (https://phabricator.wikimedia.org/T374907) [13:30:04] (03PS2) 10Alexandros Kosiaris: mw-debug: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079283 (https://phabricator.wikimedia.org/T374907) [13:30:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T367781)', diff saved to https://phabricator.wikimedia.org/P69611 and previous config saved to /var/cache/conftool/dbconfig/20241010-133049-arnaudb.json [13:30:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1248.eqiad.wmnet with reason: Maintenance [13:31:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:31:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1248.eqiad.wmnet with reason: Maintenance [13:31:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T367781)', diff saved to https://phabricator.wikimedia.org/P69612 and previous config saved to /var/cache/conftool/dbconfig/20241010-133113-arnaudb.json [13:31:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T367781)', diff saved to https://phabricator.wikimedia.org/P69613 and previous config saved to /var/cache/conftool/dbconfig/20241010-133121-arnaudb.json [13:32:01] Lucas_WMDE: looks good [13:32:03] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, cscott: Continuing with sync [13:32:05] ok! [13:32:17] (03PS1) 10Clément Goubert: mw-debug-repl: Support next release [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) [13:34:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [13:35:03] 06SRE, 06serviceops: low rate of mw-memcached errors - https://phabricator.wikimedia.org/T371881#10217731 (10jijiki) [13:35:36] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns4003.wikimedia.org [13:35:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns4003.wikimedia.org [13:36:13] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org [13:37:09] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1075635|Turn on mobile support for Parsoid Read Views (but not on talk pages) (T269499 T376048)]], [[gerrit:1079274|Turn on Parsoid Selective Update metrics (take 2) (T371713 T376433)]] (duration: 16m 09s) [13:37:14] (03CR) 10Clément Goubert: [C:03+1] mw-debug: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079283 (https://phabricator.wikimedia.org/T374907) (owner: 10Alexandros Kosiaris) [13:37:24] T269499: [Epic] Make MobileFrontend compatible with Parsoid HTML - https://phabricator.wikimedia.org/T269499 [13:37:24] I’ll roll out a small config cleanup [13:37:24] T376048: MFE still have issues with Parsoid Read Views on talk pages (Discussion Tools) - https://phabricator.wikimedia.org/T376048 [13:37:25] T371713: Instrumentation & data gathering to inform future performance & templating improvements - https://phabricator.wikimedia.org/T371713 [13:37:25] T376433: TypeError: Argument 1 passed to Wikimedia\Stats\Metrics\CounterMetric::incrementBy() must be of the type float, null given - https://phabricator.wikimedia.org/T376433 [13:37:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077417 (https://phabricator.wikimedia.org/T376245) (owner: 10Fomafix) [13:37:53] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777#10217741 (10MoritzMuehlenhoff) 05Open→03Resolved [13:38:16] (03Merged) 10jenkins-bot: Use ?? instead of default value in getRawVal() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077417 (https://phabricator.wikimedia.org/T376245) (owner: 10Fomafix) [13:38:32] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1077417|Use ?? instead of default value in getRawVal() (T376245)]] [13:38:33] (03CR) 10Btullis: [C:03+1] stat hosts: create/enable cgroups for memory and i/o [puppet] - 10https://gerrit.wikimedia.org/r/1079281 (https://phabricator.wikimedia.org/T376653) (owner: 10Bking) [13:38:36] T376245: Deprecate parameter $default in MediaWiki\Request\WebRequest::getRawVal - https://phabricator.wikimedia.org/T376245 [13:39:11] cscott: the two config changes should be deployed btw, let me know if something needs rolling back [13:39:43] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10217752 (10cmooney) [13:39:47] I'm seeing metrics being generated by the config change, seems ok [13:39:57] nice [13:40:53] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, fomafix: Backport for [[gerrit:1077417|Use ?? instead of default value in getRawVal() (T376245)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:56] testing… [13:41:16] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, fomafix: Continuing with sync [13:41:19] seems to work fine [13:41:36] ah, and these *do* show up in logstash, good to know ^^ [13:42:18] looks like I’m the only one who used fatal-error.php in the past 30 days, which sounds likely enough ^^ [13:42:31] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10217761 (10cmooney) [13:42:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [13:42:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet [13:43:10] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet [13:45:49] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1077417|Use ?? instead of default value in getRawVal() (T376245)]] (duration: 07m 16s) [13:45:52] T376245: Deprecate parameter $default in MediaWiki\Request\WebRequest::getRawVal - https://phabricator.wikimedia.org/T376245 [13:46:27] !log UTC afternoon backport+config window done [13:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P69615 and previous config saved to /var/cache/conftool/dbconfig/20241010-134628-arnaudb.json [13:46:52] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: cloudsw: codfw: enable IPv6 - https://phabricator.wikimedia.org/T374713#10217784 (10cmooney) 05Open→03Resolved This is now complete, the cloudsw is set up to route the networks are required and announcing them upst... [13:48:13] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot begin reboot of dns4004.wikimedia.org [13:49:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:49:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:49:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T367781)', diff saved to https://phabricator.wikimedia.org/P69616 and previous config saved to /var/cache/conftool/dbconfig/20241010-134926-arnaudb.json [13:49:30] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [13:49:34] (03CR) 10Fabfur: [C:03+1] team-sre: make PyBal alert paging (set severity) [alerts] - 10https://gerrit.wikimedia.org/r/1079049 (owner: 10Ssingh) [13:49:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:44] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:49:51] (03CR) 10Ssingh: [C:03+2] team-sre: make PyBal alert paging (set severity) [alerts] - 10https://gerrit.wikimedia.org/r/1079049 (owner: 10Ssingh) [13:50:16] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10217807 (10cmooney) >>! In T375847#10195673, @aborrero wrote: > `lang=shell-session > root@ipv6-test-1:~# ip -br a > lo... [13:51:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet [13:51:42] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:44] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:51:47] ^ expected [13:51:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T367781)', diff saved to https://phabricator.wikimedia.org/P69617 and previous config saved to /var/cache/conftool/dbconfig/20241010-135152-arnaudb.json [13:52:45] (03CR) 10Btullis: [C:03+1] stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [13:53:25] (03CR) 10Btullis: [C:03+1] stat hosts: enable zRAM-based swap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [13:54:11] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: CloudVPS: IPv6 in codfw1dev - https://phabricator.wikimedia.org/T245495#10217836 (10cmooney) The edge (cloudsw/cr) networking is now complete, elements in the range are reachable externally. ` cathal@officepc:~$ mtr -z -b... [13:54:30] (03PS1) 10Cathal Mooney: Add orlonger to policy on announced v6 routes from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1079288 (https://phabricator.wikimedia.org/T245495) [13:55:25] (03CR) 10Muehlenhoff: [C:03+2] Add members of platform-engineering to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1078948 (https://phabricator.wikimedia.org/T376808) (owner: 10Muehlenhoff) [13:56:55] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:38] RECOVERY - Check if ntpsec.service has been restarted after /etc/ntpsec/ntp.conf was changed on dns4004 is OK: OK: ntpsec.service was restarted after /etc/ntpsec/ntp.conf was changed. https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [13:58:42] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:58:44] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:47] !log sukhe@cumin1002 cookbooks.sre.dns.roll-reboot finished rebooting dns4004.wikimedia.org [13:59:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:ulsfo and A:dnsbox [14:00:58] PROBLEM - NTP peers and stratum check on dns4004 is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown, stratum=-1 (CRITICAL) https://wikitech.wikimedia.org/wiki/NTP [14:01:04] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10217857 (10Volans) Thanks @jcrespo for the detailed request. I'll get to it. Only... [14:01:13] ^ should resolve soon [14:01:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P69618 and previous config saved to /var/cache/conftool/dbconfig/20241010-140135-arnaudb.json [14:01:42] sync not established yet, I need to bump the retry interval so that it doesn't keep on checking [14:01:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet [14:01:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet [14:02:16] RECOVERY - NTP peers and stratum check on dns4004 is OK: NTP OK: Offset -0.000876706 secs, stratum=2 https://wikitech.wikimedia.org/wiki/NTP [14:03:25] (03PS14) 10Tiziano Fogli: atlas: adding prometheus blackbox icmp checks [puppet] - 10https://gerrit.wikimedia.org/r/1079226 (https://phabricator.wikimedia.org/T370506) [14:04:22] (03CR) 10Elukey: [C:03+1] modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:04:47] (03PS1) 10Ssingh: P:ntp: adjust retry_interval in light of iburst removal [puppet] - 10https://gerrit.wikimedia.org/r/1079290 [14:05:53] (03CR) 10Elukey: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079224 (owner: 10Ayounsi) [14:06:55] (03CR) 10Ssingh: [C:03+2] P:ntp: adjust retry_interval in light of iburst removal [puppet] - 10https://gerrit.wikimedia.org/r/1079290 (owner: 10Ssingh) [14:07:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69619 and previous config saved to /var/cache/conftool/dbconfig/20241010-140659-arnaudb.json [14:07:17] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10217866 (10jcrespo) Yes, the REPLACE is not the issue, it is ROW that translates... [14:07:20] (03CR) 10Jelto: [C:03+2] miscweb: bump base and mesh templates to newest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079279 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [14:08:33] (03CR) 10Tiziano Fogli: [V:03+1] atlas: adding prometheus blackbox icmp checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079226 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:08:37] (03Merged) 10jenkins-bot: miscweb: bump base and mesh templates to newest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079279 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [14:09:31] (03PS1) 10Elukey: profile::docker::reporter: avoid OCI indexes for k8s images [puppet] - 10https://gerrit.wikimedia.org/r/1079291 [14:09:51] (03CR) 10Scott French: "Thanks for fixing this! One maybe issue, but otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [14:10:01] (03PS2) 10Elukey: profile::docker::reporter: avoid OCI indexes for k8s images [puppet] - 10https://gerrit.wikimedia.org/r/1079291 [14:11:27] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:12:01] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:13:10] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:13] (03CR) 10Muehlenhoff: "We should rather use/extend the gpu_users group, though?" [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:15:21] (03CR) 10Lucas Werkmeister (WMDE): mw-debug-repl: Support next release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [14:16:15] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10217895 (10Jhancock.wm) new drive installed. looks like alert has cleared. lmk if you need any further assistance. [14:16:39] !log failover Ganeti masters in magru to secondary node [14:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T367781)', diff saved to https://phabricator.wikimedia.org/P69620 and previous config saved to /var/cache/conftool/dbconfig/20241010-141642-arnaudb.json [14:16:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance [14:16:46] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir [14:16:46] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:16:49] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:16:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1249.eqiad.wmnet with reason: Maintenance [14:17:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T367781)', diff saved to https://phabricator.wikimedia.org/P69621 and previous config saved to /var/cache/conftool/dbconfig/20241010-141704-arnaudb.json [14:18:19] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:18:23] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:19:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T367781)', diff saved to https://phabricator.wikimedia.org/P69622 and previous config saved to /var/cache/conftool/dbconfig/20241010-141912-arnaudb.json [14:19:30] PROBLEM - ganeti-wconfd running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:19:35] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] [14:19:48] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] (duration: 00m 13s) [14:20:11] (03CR) 10Elukey: [C:03+1] "I had the same idea, but the group IIUC was added to experiment with ml-labs nodes without a broader audience involved. The group is alrea" [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:21:28] (03CR) 10Scott French: mw-debug-repl: Support next release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [14:21:55] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69623 and previous config saved to /var/cache/conftool/dbconfig/20241010-142206-arnaudb.json [14:23:36] PROBLEM - ganeti-wconfd running on ganeti7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:23:37] (03CR) 10Muehlenhoff: "But we can simply add "*ml-lab-users" to gpu-users and apply the access group to the ml-lab hosts? That group already has @ops as well, so" [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:25:14] (03PS1) 10Bking: idp.yaml: Add airflow service [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) [14:26:25] (03PS2) 10Bking: idp.yaml: Add airflow service [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) [14:26:59] (03CR) 10Muehlenhoff: [C:03+1] "But yeah, we can also merge this, but please check with IF before creating any POSIX groups next time." [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:27:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:28:52] !log jhathaway@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1002.eqiad.wmnet'] [14:29:33] (03CR) 10Lucas Werkmeister (WMDE): mw-debug-repl: Support next release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [14:31:46] (03CR) 10Dzahn: "I'd like to ask you to please reach out to the serviceops team for these." [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (owner: 10Pppery) [14:34:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P69624 and previous config saved to /var/cache/conftool/dbconfig/20241010-143419-arnaudb.json [14:34:50] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1002.eqiad.wmnet'] [14:36:55] FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T367781)', diff saved to https://phabricator.wikimedia.org/P69625 and previous config saved to /var/cache/conftool/dbconfig/20241010-143713-arnaudb.json [14:37:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [14:37:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:43] (03PS1) 10Brouberol: idp: add dummy client secret for aitflow_analytics_test [labs/private] - 10https://gerrit.wikimedia.org/r/1079296 (https://phabricator.wikimedia.org/T374948) [14:37:51] gerrit pls. [14:37:56] TheresNoTime: +1 [14:39:25] jelto: ^ [14:39:33] (03CR) 10Bking: [C:03+2] idp: add dummy client secret for aitflow_analytics_test [labs/private] - 10https://gerrit.wikimedia.org/r/1079296 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [14:39:38] (03CR) 10Bking: [V:03+2 C:03+2] idp: add dummy client secret for aitflow_analytics_test [labs/private] - 10https://gerrit.wikimedia.org/r/1079296 (https://phabricator.wikimedia.org/T374948) (owner: 10Brouberol) [14:39:51] (03CR) 10Klausman: "I am fine using a different approach here: AIUI, the render group is a Debianism, so we did not created it from scratch." [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:40:29] (seems alright now fwiw) [14:40:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:40:45] indeed! [14:40:49] oh let me take a look, thanks forthe ping [14:41:55] FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:10] FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:05] yeah it resolved, but I'll check the latest traffic [14:44:56] FIRING: CalicoTyphaDown: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [14:44:57] FIRING: [4x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:46:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:46:18] (03PS2) 10Clément Goubert: mw-debug-repl: Support next release [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) [14:46:33] (03CR) 10Clément Goubert: "Thanks for the catches, both of you :)" [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [14:48:01] (03CR) 10Brouberol: [C:03+1] idp.yaml: Add airflow service [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:48:10] FIRING: [3x] SystemdUnitFailed: prometheus-dpkg-success-textfile.service on kubestage1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:35] (03CR) 10Bking: "PCC is failing, but we are confident in this patch, will go ahead and merge (and revert immediately if it causes problems)" [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:48:51] (03CR) 10Bking: [C:03+2] idp.yaml: Add airflow service [puppet] - 10https://gerrit.wikimedia.org/r/1079292 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:48:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [14:49:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P69626 and previous config saved to /var/cache/conftool/dbconfig/20241010-144926-arnaudb.json [14:49:35] (03CR) 10Andrea Denisse: [C:03+1] "LGTM! thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1079226 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [14:49:57] FIRING: [5x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:51:50] (03CR) 10Elukey: [C:03+1] "No problem with the approach, I think this change is fine, we asked if possible to ping us before creating new posix groups in the future." [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:53:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [14:54:38] (03CR) 10Muehlenhoff: [C:03+1] "Ack, let's merge the patch as-is." [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:54:44] (03CR) 10Scott French: [C:03+1] "One last fix needed, but otherwise LGTM once that's done." [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [14:54:56] RESOLVED: CalicoTyphaDown: Too few (0) calico-typha replicas running - https://wikitech.wikimedia.org/wiki/Calico#Typha" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoTyphaDown [14:54:57] FIRING: [5x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:56:10] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@4b69f50]: Revert previous staging of Refine fixes on test cluster [airflow-dags@4b69f503] [14:56:16] (03PS3) 10Clément Goubert: mw-debug-repl: Support next release [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) [14:56:23] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@4b69f50]: Revert previous staging of Refine fixes on test cluster [airflow-dags@4b69f503] (duration: 00m 13s) [14:56:36] (03CR) 10Clément Goubert: mw-debug-repl: Support next release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [14:58:23] (03CR) 10Ssingh: "I am also interested in moving forward on this and can also do the roll-out. Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:58:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:45] (03CR) 10Klausman: [C:03+2] modules/admin: add ml-lab-users to render group [puppet] - 10https://gerrit.wikimedia.org/r/1078963 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [14:59:57] RESOLVED: [5x] KubernetesCalicoDown: kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:00:05] andre and hashar: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1500) [15:02:30] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10218077 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Low a:03Clement_Goubert Yay, thank you! [15:02:35] !log ongoing maintenance on mr1-drmrs [15:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T367781)', diff saved to https://phabricator.wikimedia.org/P69628 and previous config saved to /var/cache/conftool/dbconfig/20241010-150433-arnaudb.json [15:04:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:04:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:04:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:04:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2136.codfw.wmnet with reason: Maintenance [15:05:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2136.codfw.wmnet with reason: Maintenance [15:05:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T367781)', diff saved to https://phabricator.wikimedia.org/P69629 and previous config saved to /var/cache/conftool/dbconfig/20241010-150512-arnaudb.json [15:27:12] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:53] (03PS6) 10Cathal Mooney: Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) [15:27:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:28:12] (03PS3) 10Klausman: modules/admin: drop gid field from ml-lab-users on lab machines [puppet] - 10https://gerrit.wikimedia.org/r/1079309 (https://phabricator.wikimedia.org/T376380) [15:28:17] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [15:29:31] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4270/co" [puppet] - 10https://gerrit.wikimedia.org/r/1079309 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [15:29:41] (03CR) 10Cathal Mooney: [C:03+2] Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [15:30:29] (03PS2) 10Jforrester: wikifunctions: Enable Wikidata dereferencing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078951 (https://phabricator.wikimedia.org/T370072) [15:31:18] (03CR) 10Jforrester: [C:03+2] wikifunctions: Enable Wikidata dereferencing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078951 (https://phabricator.wikimedia.org/T370072) (owner: 10Jforrester) [15:31:31] (03CR) 10EoghanGaffney: [C:03+1] gerrit: add specific chrome headless version to bad_browser [puppet] - 10https://gerrit.wikimedia.org/r/1079308 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [15:32:14] (03Merged) 10jenkins-bot: wikifunctions: Enable Wikidata dereferencing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078951 (https://phabricator.wikimedia.org/T370072) (owner: 10Jforrester) [15:32:50] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: add specific chrome headless version to bad_browser [puppet] - 10https://gerrit.wikimedia.org/r/1079308 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [15:33:03] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:34:22] (03PS13) 10Brouberol: Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) [15:35:07] !log dancy@deploy2002 Installing scap version "4.110.0" for 211 hosts [15:36:41] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.19 ms [15:36:41] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 86.46 ms [15:37:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:38:27] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P69632 and previous config saved to /var/cache/conftool/dbconfig/20241010-153838-arnaudb.json [15:38:58] (03CR) 10Brouberol: Define a ceph rolling restart/reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [15:39:06] (03CR) 10Pppery: "This is the only one I plan to do, but OK, noted." [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (owner: 10Pppery) [15:39:24] !log dancy@deploy2002 Installation of scap version "4.110.0" completed for 211 hosts [15:40:56] !log mr1-drmrs maintenance complete [15:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:07] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10218317 (10Papaul) [15:46:08] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10218350 (10Papaul) [15:46:50] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jonathan Tweed - https://phabricator.wikimedia.org/T376777#10218352 (10JTweed-WMF) Thanks @MoritzMuehlenhoff and @Aklapper - confirmed as working. [15:47:22] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [15:49:20] 10ops-codfw, 06DC-Ops: lsw-d[18]-codfw missing console port info in netbox - https://phabricator.wikimedia.org/T376917 (10RobH) 03NEW [15:49:54] 10ops-codfw, 06DC-Ops: lsw-d[18]-codfw missing console port info in netbox - https://phabricator.wikimedia.org/T376917#10218384 (10RobH) a:05RobH→03None [15:53:13] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:53:33] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T367781)', diff saved to https://phabricator.wikimedia.org/P69633 and previous config saved to /var/cache/conftool/dbconfig/20241010-155345-arnaudb.json [15:53:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:53:49] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:54:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:54:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2140.codfw.wmnet with reason: Maintenance [15:54:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2140.codfw.wmnet with reason: Maintenance [15:54:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2140 (T367781)', diff saved to https://phabricator.wikimedia.org/P69634 and previous config saved to /var/cache/conftool/dbconfig/20241010-155426-arnaudb.json [15:56:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T367781)', diff saved to https://phabricator.wikimedia.org/P69635 and previous config saved to /var/cache/conftool/dbconfig/20241010-155638-arnaudb.json [15:56:48] (03PS1) 10Cathal Mooney: Fix WMCS openstack reverse delegations in codfw [dns] - 10https://gerrit.wikimedia.org/r/1079312 (https://phabricator.wikimedia.org/T374715) [15:57:44] (03PS3) 10Btullis: Switch cephosd1001 to use the nftables based firewall [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) [15:58:38] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4271/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:59:09] (03CR) 10Btullis: [C:03+1] "Looks good to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [16:00:05] jhathaway and rzl: That opportune time for a Puppet request window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:15] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubestage1003.eqiad.wmnet [16:02:40] (03PS4) 10Btullis: Switch cephosd1001 to use the nftables based firewall [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) [16:03:38] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading, 10Thumbor, and 2 others: Some POST of thumbnails to Swift time out - https://phabricator.wikimedia.org/T374911#10218497 (10Krinkle) The given example in the task description (reqId:"83cdd2c7-ef60-4fcc-b4dc-69939b6c1e9b") is a HEAD request from... [16:03:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubestage1003.eqiad.wmnet [16:04:04] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4272/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:04:05] (03CR) 10Muehlenhoff: "Best to also run PCC against a another ceph node staying with ferm for now" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:04:10] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading, 10Thumbor, and 2 others: Some POST of thumbnails to Swift time out - https://phabricator.wikimedia.org/T374911#10218502 (10Krinkle) [16:04:24] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestage1003.eqiad.wmnet with OS bookworm [16:04:53] (03CR) 10Muehlenhoff: "nvm, that referred to PS3 only." [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:05:09] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:05:59] (03CR) 10Muehlenhoff: Switch cephosd1001 to use the nftables based firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:06:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1079309 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [16:06:33] (03PS1) 10RLazarus: deployment_server: Tweak mwscript-cleanup `helm list` pagination [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) [16:07:01] (03CR) 10Klausman: [V:03+1 C:03+2] modules/admin: drop gid field from ml-lab-users on lab machines [puppet] - 10https://gerrit.wikimedia.org/r/1079309 (https://phabricator.wikimedia.org/T376380) (owner: 10Klausman) [16:07:12] (03CR) 10Ssingh: Fix WMCS openstack reverse delegations in codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1079312 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [16:10:09] RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:11:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P69636 and previous config saved to /var/cache/conftool/dbconfig/20241010-161145-arnaudb.json [16:13:34] !log swfrench@cumin2002 START - Cookbook sre.discovery.service-route depool echostore in codfw: Depooling echostore for migration to service mesh - T376766 [16:13:37] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [16:13:37] (03PS5) 10Btullis: Switch cephosd1001 to use the nftables based firewall [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) [16:14:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4273/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:17:52] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 13Patch-For-Review, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#10218586 (10Pppery) [16:18:06] (03PS6) 10Pppery: Redirect all namespace-in-Wikipedia cases to Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) [16:18:13] (03PS7) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 [16:18:39] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool echostore in codfw: Depooling echostore for migration to service mesh - T376766 [16:18:42] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [16:19:01] (03CR) 10Scott French: [C:03+2] echostore: adopt service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079012 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:19:50] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10218594 (10phaultfinder) [16:20:06] (03Merged) 10jenkins-bot: echostore: adopt service mesh in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079012 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:20:48] (03PS8) 10Pppery: Deploy missing.php redirects for Allemanic German [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079055 (https://phabricator.wikimedia.org/T376923) [16:20:48] (03CR) 10RLazarus: "mwscript-cleanup runs yesterday skipped some ranges of releases in the middle -- I don't have smoking-gun evidence that this is the cause," [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [16:20:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:20:57] (03PS5) 10Pppery: Remove als redirects [puppet] - 10https://gerrit.wikimedia.org/r/1079056 (https://phabricator.wikimedia.org/T376923) [16:21:08] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [16:23:10] !log jhathaway@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [16:23:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: host reimage [16:24:43] RECOVERY - Host sretest1002 is UP: PING WARNING - Packet loss = 80%, RTA = 482.95 ms [16:26:29] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140', diff saved to https://phabricator.wikimedia.org/P69637 and previous config saved to /var/cache/conftool/dbconfig/20241010-162652-arnaudb.json [16:27:48] FIRING: PuppetFailure: Puppet has failed on ml-lab1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:30:57] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003.eqiad.wmnet'] [16:31:04] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [16:32:48] FIRING: [2x] PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:37:44] (03PS1) 10Scott French: echostore: null out certs.kask in env values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079320 (https://phabricator.wikimedia.org/T376766) [16:37:48] FIRING: [2x] PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:40:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage1003.eqiad.wmnet with OS bookworm [16:41:10] (03CR) 10RLazarus: [C:03+1] echostore: null out certs.kask in env values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079320 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:41:49] (03CR) 10Cathal Mooney: Fix WMCS openstack reverse delegations in codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1079312 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [16:41:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T367781)', diff saved to https://phabricator.wikimedia.org/P69638 and previous config saved to /var/cache/conftool/dbconfig/20241010-164159-arnaudb.json [16:42:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [16:42:03] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:42:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [16:42:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T367781)', diff saved to https://phabricator.wikimedia.org/P69639 and previous config saved to /var/cache/conftool/dbconfig/20241010-164221-arnaudb.json [16:42:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:43:01] (03CR) 10Scott French: [C:03+2] echostore: null out certs.kask in env values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079320 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:44:02] (03Merged) 10jenkins-bot: echostore: null out certs.kask in env values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079320 (https://phabricator.wikimedia.org/T376766) (owner: 10Scott French) [16:45:47] (03CR) 10Btullis: [V:03+1 C:03+2] Switch cephosd1001 to use the nftables based firewall [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:45:59] (03CR) 10Ssingh: [C:03+1] Fix WMCS openstack reverse delegations in codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1079312 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [16:46:03] (03CR) 10Btullis: [V:03+1 C:03+2] Switch cephosd1001 to use the nftables based firewall (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050330 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [16:47:34] !log removing echostore codfw deployment (depooled) to unblock breaking change - T376766 [16:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:38] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [16:47:52] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-10-03-122026-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079322 [16:49:37] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet [16:50:15] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/echostore: apply [16:50:47] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [16:51:17] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host kubestage1003.eqiad.wmnet [16:51:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host kubestage1003.eqiad.wmnet [16:51:27] FIRING: [2x] CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:51:40] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:52:32] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:52:38] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:53:05] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:53:11] (03CR) 10Cathal Mooney: [C:03+2] Fix WMCS openstack reverse delegations in codfw [dns] - 10https://gerrit.wikimedia.org/r/1079312 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [16:53:47] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:53:51] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:55:27] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-10-03-122026-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079322 (owner: 10BryanDavis) [16:55:38] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:58:32] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:57] !log swfrench@cumin2002 START - Cookbook sre.discovery.service-route pool echostore in codfw: Repooling echostore after migration to service mesh - T376766 [16:59:04] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [16:59:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet [17:00:05] bd808: Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1700). Please do the needful. [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T1700) [17:01:18] (03PS1) 10Jforrester: wikifunctions: Actually enable Wikidata use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079325 [17:01:34] (03CR) 10Jforrester: [C:03+2] wikifunctions: Actually enable Wikidata use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079325 (owner: 10Jforrester) [17:03:22] I have a developer portal built to roll out today. I'm in a meeting now though so it will probably be a bit. [17:03:32] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-10-03-122026-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079322 (owner: 10BryanDavis) [17:04:00] (03Merged) 10jenkins-bot: wikifunctions: Actually enable Wikidata use [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079325 (owner: 10Jforrester) [17:04:02] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool echostore in codfw: Repooling echostore after migration to service mesh - T376766 [17:04:06] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [17:04:17] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:04:35] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:05:25] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:07:58] (03PS1) 10Jforrester: wikifunctions: Add mw-api-int-async-ro route for Wikidata fetches [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079326 [17:12:57] (03CR) 10Bking: [C:03+2] stat hosts: create/enable cgroups for memory and i/o [puppet] - 10https://gerrit.wikimedia.org/r/1079281 (https://phabricator.wikimedia.org/T376653) (owner: 10Bking) [17:15:55] (03CR) 10Jforrester: "Not sure if this is the correct mechanism, or if there's more to do? There's the services_proxy block in values.yaml but it's talking abou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079326 (owner: 10Jforrester) [17:16:54] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10218766 (10cmooney) Reverse delegation is now working for the ranges we've assigned to OpenStack. I've not gotten an answer... [17:20:16] !log swfrench@cumin2002 START - Cookbook sre.discovery.service-route depool echostore in eqiad: Depooling echostore for migration to service mesh - T376766 [17:20:28] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [17:25:21] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool echostore in eqiad: Depooling echostore for migration to service mesh - T376766 [17:29:30] FIRING: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:06] (03CR) 10Dzahn: aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:30:12] (03PS3) 10Dzahn: aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) [17:32:12] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:33:05] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:33:41] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:34:04] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:34:15] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:34:35] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:37:36] (03PS6) 10Bking: stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) [17:38:10] !log removing echostore eqiad deployment (depooled) to unblock breaking change - T376766 [17:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:13] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [17:38:55] (03CR) 10Fabfur: "Added the $schema modification" [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [17:39:30] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/echostore: apply [17:39:47] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [17:41:27] RESOLVED: CertAlmostExpired: Certificate for service echostore:8082 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#echostore:8082 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:41:34] (03PS3) 10Jdlrobson: DONOTMERGE: Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [17:41:53] (03PS4) 10Kimberly Sarabia: Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) [17:41:56] \o/ [17:42:38] !log swfrench@cumin2002 START - Cookbook sre.discovery.service-route pool echostore in eqiad: Repooling echostore after migration to service mesh - T376766 [17:42:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T367781)', diff saved to https://phabricator.wikimedia.org/P69640 and previous config saved to /var/cache/conftool/dbconfig/20241010-174247-arnaudb.json [17:42:51] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:47:43] !log swfrench@cumin2002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool echostore in eqiad: Repooling echostore after migration to service mesh - T376766 [17:47:47] T376766: echostore's TLS certificate expires on 2024-10-13 - https://phabricator.wikimedia.org/T376766 [17:49:46] (03PS1) 10Ebernhardson: Migrate package to opensearch [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1079333 (https://phabricator.wikimedia.org/T372769) [17:49:53] (03CR) 10Pppery: "(figured you'd be interested in this since you reviewed my other missing.php patches)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [17:50:43] (03Abandoned) 10Ebernhardson: Migrate package to opensearch [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/1079333 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson) [17:53:01] (03CR) 10Bking: [C:03+2] stat hosts: enable zRAM-based swap [puppet] - 10https://gerrit.wikimedia.org/r/1078973 (https://phabricator.wikimedia.org/T376813) (owner: 10Bking) [17:54:53] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov1001.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [17:57:17] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:57:17] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:57:42] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov1001.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [17:57:48] is that just a reboot? [17:57:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P69641 and previous config saved to /var/cache/conftool/dbconfig/20241010-175754-arnaudb.json [17:57:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:17] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:58:42] despite that, metris seems down for mw [17:58:45] *metrics [17:58:56] !incidents [17:58:56] 5309 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [17:59:01] hmm no not a reboot afaik [17:59:08] !ack 5309 [17:59:08] 5309 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [17:59:17] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:59:22] ok, metrics back for me [17:59:29] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:59:52] please have a look if not expected, even if things are back to working [17:59:53] interesting ... ah, and yeah if we can't query, then makes sense statograph might break [18:00:05] !log ongoing maintenance on mr1-eqiad [18:00:06] jynus: yes, indeed, doing so [18:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:18] leaving for the day [18:00:50] there was a spike in latenc [18:00:54] *latency [18:01:24] https://grafana.wikimedia.org/goto/vLas13zNg?orgId=1 [18:02:11] Not a reboot, I rebooted the alert hosts yesterday. [18:02:25] yeah, and a correlated jump in load / network: https://grafana.wikimedia.org/goto/PPlU1qzHg?orgId=1 [18:02:41] then could be traffic-related [18:02:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:42] yeah, this _sounds_ like expensive queries? [18:04:18] errors on the store too: https://grafana.wikimedia.org/goto/2NC9JqzNR?orgId=1 [18:04:50] seems to be on eqiad only: https://grafana.wikimedia.org/goto/lWYCJqkNg?orgId=1 [18:05:32] since the load has died down and the service recovered, I'll go reset statograph so the status page is not stale. then start looking at queries to back out what this was. [18:05:45] things are under control so I am going to go away, I don't think you need me :-) [18:05:54] thanks, swfrench-wmf [18:06:11] ah, yeah - thanks for surfacing the swift graphs, jynus! [18:06:15] and have a good night [18:08:45] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [18:08:59] ah, statograph recovered on a subsequent run - it'll presumably take a minute for the alert to clear [18:09:29] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:09:44] there it is :) [18:13:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P69642 and previous config saved to /var/cache/conftool/dbconfig/20241010-181301-arnaudb.json [18:14:38] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [18:17:33] swfrench-wmf: yeah statograph should clear automatically, the only time manual intervention is required is when there's a gap in the available timeseries [18:18:08] cdanis: ah, that totally makes sense. thanks! [18:18:43] (also, for visibility - moving discussion about queries to -observability) [18:18:51] some more notes at the top of https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/statograph/+/refs/heads/master/statograph/uploader.py [18:21:56] (03PS1) 10Jforrester: Update VE core submodule to master (c98f3a542) [extensions/VisualEditor] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079335 (https://phabricator.wikimedia.org/T376901) [18:24:45] PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - No response from remote host 10.65.2.143 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:15] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:26:15] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:26:17] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:26:17] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:26:17] PROBLEM - Host ps1-a5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:26:17] PROBLEM - Host ps1-a1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:26:17] PROBLEM - Host ps1-b4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:28:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T367781)', diff saved to https://phabricator.wikimedia.org/P69643 and previous config saved to /var/cache/conftool/dbconfig/20241010-182808-arnaudb.json [18:28:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [18:28:12] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:28:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [18:28:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:28:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:28:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P69644 and previous config saved to /var/cache/conftool/dbconfig/20241010-182846-arnaudb.json [18:32:12] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:40:10] 06SRE, 06SRE-OnFire, 10observability: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569#10219128 (10CDanis) 05Open→03Resolved a:03CDanis In practice the very basic alerting from systemd unit failures has been enough f... [18:42:12] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:42:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/VisualEditor] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079335 (https://phabricator.wikimedia.org/T376901) (owner: 10Jforrester) [18:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10219162 (10phaultfinder) [18:50:01] !log maintenance on mr1-eqiad complete [18:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:23] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 23.4R2-S2 - https://phabricator.wikimedia.org/T369504#10219169 (10Papaul) [18:57:56] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079342 [19:12:05] (03CR) 10RLazarus: [C:03+2] mediawiki: Allow setting mwscript job activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078720 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [19:13:56] (03Merged) 10jenkins-bot: mediawiki: Allow setting mwscript job activeDeadlineSeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078720 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [19:14:42] jouncebot: nowandnext [19:14:43] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [19:14:43] In 0 hour(s) and 45 minute(s): UTC late backport window. Note: Tyler is hoping to run this window as a demo/training/refresher on backports. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T2000) [19:15:24] starting a helmfile-only deploy to clear the diff from a chart version bump, no-op for everything except mw-script [19:19:15] (03CR) 10Dzahn: [C:04-1] "Error: Could not find resource 'File[/etc/ferm/conf.d]' in parameter 'require'" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:19:48] (03PS4) 10Dzahn: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) [19:21:52] !log rzl@deploy2002 Started scap sync-world: chart version bump for 1078720 [19:23:25] !log rzl@deploy2002 Finished scap sync-world: chart version bump for 1078720 (duration: 02m 09s) [19:24:13] 06SRE, 10Charts, 07Service-deployment-requests: New Service Request: chart-renderer - https://phabricator.wikimedia.org/T376939 (10CDanis) 03NEW [19:25:13] (03CR) 10RLazarus: [C:03+2] deployment_server: Add --timeout flag to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1078721 (https://phabricator.wikimedia.org/T376099) (owner: 10RLazarus) [19:26:18] (03PS13) 10JHathaway: sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) [19:28:14] (03CR) 10Dzahn: [C:04-1] "Could not find resource 'File[/etc/ferm/conf.d]' in parameter 'require'" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:29:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P69645 and previous config saved to /var/cache/conftool/dbconfig/20241010-192912-arnaudb.json [19:29:16] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:30:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [19:30:32] 06SRE-OnFire, 10Incident Tooling: Corto: Scrutinize/finalize template text - https://phabricator.wikimedia.org/T376941 (10Eevans) 03NEW [19:33:27] (03CR) 10Scott French: [C:03+1] mw-debug-repl: Support next release [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [19:37:37] (03CR) 10CI reject: [V:04-1] sre.hosts.reimage: add UEFI HTTP Boot support [cookbooks] - 10https://gerrit.wikimedia.org/r/1077497 (https://phabricator.wikimedia.org/T373519) (owner: 10JHathaway) [19:38:16] (03PS1) 10CDanis: Add chart-renderer deployment server profile [puppet] - 10https://gerrit.wikimedia.org/r/1079345 (https://phabricator.wikimedia.org/T376939) [19:39:57] (03PS1) 10CDanis: Add chart-renderer namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079350 (https://phabricator.wikimedia.org/T376939) [19:41:47] (03PS2) 10CDanis: Add chart-renderer namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079350 (https://phabricator.wikimedia.org/T372081) [19:41:58] (03PS2) 10CDanis: Add chart-renderer deployment server profile [puppet] - 10https://gerrit.wikimedia.org/r/1079345 (https://phabricator.wikimedia.org/T372081) [19:42:24] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:17] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@4b69f50]: Stage Webrequest-Refine fix on test cluster [airflow-dags@4b69f503] [19:43:30] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@4b69f50]: Stage Webrequest-Refine fix on test cluster [airflow-dags@4b69f503] (duration: 00m 13s) [19:44:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P69646 and previous config saved to /var/cache/conftool/dbconfig/20241010-194419-arnaudb.json [19:59:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P69647 and previous config saved to /var/cache/conftool/dbconfig/20241010-195926-arnaudb.json [20:00:04] thcipriani, RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. Note: Tyler is hoping to run this window as a demo/training/refresher on backports.. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241010T2000) [20:00:05] kemayo and bpirkle: A patch you scheduled for UTC late backport window. Note: Tyler is hoping to run this window as a demo/training/refresher on backports. is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] o/ [20:00:19] I'm here [20:01:43] o/ [20:05:22] howdy I can deploy :) [20:05:44] we're doing a little hangout for deployment today, so please bear with me. [20:05:59] I'm in no hurry [20:06:18] No worries [20:09:22] Kemayo: I'm going to do bpirkle 's patch first since it looks like you've got some localization updates that are going to take a minute [20:09:30] RESOLVED: [2x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:06] thcipriani: Sure thing [20:12:30] (03CR) 10CDanis: [C:03+1] thanos-query: set OTEL_SERVICE_NAME env variable [puppet] - 10https://gerrit.wikimedia.org/r/1077068 (https://phabricator.wikimedia.org/T376179) (owner: 10Herron) [20:12:56] (03CR) 10CDanis: [C:03+1] opentelemetry::collector: set default port and update template [puppet] - 10https://gerrit.wikimedia.org/r/1076006 (https://phabricator.wikimedia.org/T376179) (owner: 10Herron) [20:12:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:13:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [20:13:57] (03CR) 10CDanis: [C:03+1] "I don't know anything about this new deployment mechanism but sure :D" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1078435 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:14:25] (03Merged) 10jenkins-bot: REST: Make experimental endpoints available on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [20:14:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P69648 and previous config saved to /var/cache/conftool/dbconfig/20241010-201433-arnaudb.json [20:14:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [20:14:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:14:44] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1076058|REST: Make experimental endpoints available on beta and testwiki (T375512)]] [20:14:47] T375512: REST API Sandbox throwing 404 on test wiki - https://phabricator.wikimedia.org/T375512 [20:14:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [20:14:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P69649 and previous config saved to /var/cache/conftool/dbconfig/20241010-201456-arnaudb.json [20:15:53] (03CR) 10CDanis: [C:03+1] profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:15:59] (03CR) 10CDanis: [C:03+1] hiddenparma: add to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1078983 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:16:06] (03CR) 10CDanis: [C:03+1] acme_chief: add SAN for requestctl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1078984 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:16:39] (03PS2) 10CDanis: role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:16:40] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:16:57] !log thcipriani@deploy2002 bpirkle, thcipriani: Backport for [[gerrit:1076058|REST: Make experimental endpoints available on beta and testwiki (T375512)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:17:21] (03CR) 10CI reject: [V:04-1] role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:17:43] (03PS3) 10CDanis: role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:17:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [20:18:09] bpirkle: should be on test servers, check please! [20:18:13] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:18:21] Looks good [20:18:36] bpirkle: thanks going live everywhere [20:18:39] !log thcipriani@deploy2002 bpirkle, thcipriani: Continuing with sync [20:19:58] (03PS4) 10CDanis: role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:20:00] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:21:59] (03PS5) 10CDanis: role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:22:00] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [20:23:18] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076058|REST: Make experimental endpoints available on beta and testwiki (T375512)]] (duration: 08m 34s) [20:23:22] T375512: REST API Sandbox throwing 404 on test wiki - https://phabricator.wikimedia.org/T375512 [20:23:31] Thanks Tyler [20:24:09] Kemayo: we'll get yours going now. fair warning, l10n updates mean this might take ~40 minutes ish(?) [20:24:28] thcipriani: Sure, just ping me when you want me to test it. [20:24:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079335 (https://phabricator.wikimedia.org/T376901) (owner: 10Jforrester) [20:25:59] 06SRE-OnFire, 10Incident Tooling: Corto: configuration improvements - https://phabricator.wikimedia.org/T375309#10219586 (10Eevans) p:05Triage→03Medium [20:27:03] (03Abandoned) 10RLazarus: deployment_server: Tweak mwscript-cleanup `helm list` pagination [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [20:34:37] Kemayo: revised estimate: 40 minutes + time for testing :D [20:35:02] 🎉 [20:38:50] (03Restored) 10RLazarus: deployment_server: Tweak mwscript-cleanup `helm list` pagination [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [20:39:07] (03PS2) 10RLazarus: deployment_server: Tweak mwscript-cleanup `helm list` pagination [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) [20:41:47] (03PS3) 10RLazarus: deployment_server: Tweak mwscript-cleanup `helm list` pagination [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) [20:42:02] (03Restored) 10Dzahn: Disable gerrit monitoring on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1079206 (https://phabricator.wikimedia.org/T372804) (owner: 10Hashar) [20:42:49] (03CR) 10Dzahn: "Thanks! After looking at it I actually do want to mask it, at least for today. So let me use it after all." [puppet] - 10https://gerrit.wikimedia.org/r/1079206 (https://phabricator.wikimedia.org/T372804) (owner: 10Hashar) [20:42:58] (03CR) 10Dzahn: [C:03+2] Disable gerrit monitoring on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1079206 (https://phabricator.wikimedia.org/T372804) (owner: 10Hashar) [20:43:41] (03PS4) 10RLazarus: deployment_server: Tweak mwscript-cleanup `helm list` pagination [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) [20:54:51] (03Merged) 10jenkins-bot: Update VE core submodule to master (c98f3a542) [extensions/VisualEditor] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079335 (https://phabricator.wikimedia.org/T376901) (owner: 10Jforrester) [20:55:07] !log thcipriani@deploy2002 Started scap sync-world: Backport for [[gerrit:1079335|Update VE core submodule to master (c98f3a542) (T376901)]] [20:55:11] T376901: Copy and paste within tables does not work anymore in the Visual Editor - https://phabricator.wikimedia.org/T376901 [20:57:57] !log thcipriani@deploy2002 jforrester, thcipriani: Backport for [[gerrit:1079335|Update VE core submodule to master (c98f3a542) (T376901)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:58:10] 10ops-eqiad, 06SRE, 06DC-Ops: Repurposing 2x Decommissioned Servers for Phasing Out Puppet 5 - https://phabricator.wikimedia.org/T375000#10219691 (10VRiley-WMF) a:03VRiley-WMF [20:58:22] Kemayo: good news, we only rebuilt a few languages, so it's on mwdebug, check please! [20:59:00] thcipriani: It looks good. [20:59:18] thanks for checking, going live everywhere. [20:59:25] !log thcipriani@deploy2002 jforrester, thcipriani: Continuing with sync [21:02:43] (03PS1) 10Dzahn: gerrit: set Hiera keys for nist_keys, nftables [puppet] - 10https://gerrit.wikimedia.org/r/1079358 (https://phabricator.wikimedia.org/T372804) [21:04:04] !log thcipriani@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079335|Update VE core submodule to master (c98f3a542) (T376901)]] (duration: 08m 56s) [21:04:08] T376901: Copy and paste within tables does not work anymore in the Visual Editor - https://phabricator.wikimedia.org/T376901 [21:04:15] ^ Kemayo live everywhere! [21:04:21] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@c9a2532]: Webrequest-Refine fix [airflow-dags@c9a2532e] [21:04:31] thanks for flying scap deployment, we know you have a choice in deployers and we appreciate you choosing us <3 [21:05:13] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@c9a2532]: Webrequest-Refine fix [airflow-dags@c9a2532e] (duration: 00m 51s) [21:05:25] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [21:08:55] (03CR) 10Jdlrobson: [C:03+1] Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [21:09:26] thcipriani: o7 [21:15:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P69650 and previous config saved to /var/cache/conftool/dbconfig/20241010-211522-arnaudb.json [21:15:34] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:30:01] (03PS1) 10Bking: ATS: add mapping for airflow-analytics-test [puppet] - 10https://gerrit.wikimedia.org/r/1079361 (https://phabricator.wikimedia.org/T374948) [21:30:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P69651 and previous config saved to /var/cache/conftool/dbconfig/20241010-213029-arnaudb.json [21:45:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P69652 and previous config saved to /var/cache/conftool/dbconfig/20241010-214536-arnaudb.json [21:50:26] (03CR) 10Dzahn: [C:03+2] gerrit: set Hiera keys for nist_keys, nftables [puppet] - 10https://gerrit.wikimedia.org/r/1079358 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:52:41] !log jforrester@deploy2002 Started deploy [integration/docroot@ff9e25a]: Add Codex PHP doc and source code link, for T375939 [21:52:46] T375939: CodexPHP: Publish API documentation to doc.wikimedia.org - https://phabricator.wikimedia.org/T375939 [21:52:49] !log jforrester@deploy2002 Finished deploy [integration/docroot@ff9e25a]: Add Codex PHP doc and source code link, for T375939 (duration: 00m 08s) [21:54:30] 06SRE: host elastic1064 is down - https://phabricator.wikimedia.org/T376960 (10Dzahn) 03NEW [21:55:00] 06SRE: host rdb1014 is down - https://phabricator.wikimedia.org/T376961 (10Dzahn) 03NEW [22:00:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P69653 and previous config saved to /var/cache/conftool/dbconfig/20241010-220043-arnaudb.json [22:00:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2199.codfw.wmnet with reason: Maintenance [22:00:59] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:01:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2199.codfw.wmnet with reason: Maintenance [22:01:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2206.codfw.wmnet with reason: Maintenance [22:01:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2206.codfw.wmnet with reason: Maintenance [22:01:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T367781)', diff saved to https://phabricator.wikimedia.org/P69654 and previous config saved to /var/cache/conftool/dbconfig/20241010-220125-arnaudb.json [22:01:26] (03PS1) 10Dzahn: gerrit2003: move bind_serviceIP Hiera key host name level [puppet] - 10https://gerrit.wikimedia.org/r/1079363 (https://phabricator.wikimedia.org/T372804) [22:02:04] (03PS2) 10Dzahn: gerrit2003: move bind_service_ip Hiera key host name level [puppet] - 10https://gerrit.wikimedia.org/r/1079363 (https://phabricator.wikimedia.org/T372804) [22:04:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T367781)', diff saved to https://phabricator.wikimedia.org/P69655 and previous config saved to /var/cache/conftool/dbconfig/20241010-220437-arnaudb.json [22:04:52] (03PS1) 10Papaul: Remove pfw3 and add pf1 [puppet] - 10https://gerrit.wikimedia.org/r/1079364 (https://phabricator.wikimedia.org/T374176) [22:06:59] (03PS2) 10Papaul: Remove pfw3 and add pfw1 [puppet] - 10https://gerrit.wikimedia.org/r/1079364 (https://phabricator.wikimedia.org/T374176) [22:10:44] (03CR) 10Dzahn: [C:03+2] gerrit2003: move bind_service_ip Hiera key host name level [puppet] - 10https://gerrit.wikimedia.org/r/1079363 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:15:44] (03PS3) 10Dzahn: site: (WIP) try applying gerrit role on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) [22:15:49] (03PS4) 10Dzahn: site: (WIP) try applying gerrit role on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) [22:17:22] (03PS5) 10Dzahn: site: apply gerrit role on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) [22:17:46] (03CR) 10Dzahn: [C:03+2] gerrit: sync lfs data also to new machine [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:18:55] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1079314 (https://phabricator.wikimedia.org/T376795) (owner: 10RLazarus) [22:19:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P69656 and previous config saved to /var/cache/conftool/dbconfig/20241010-221943-arnaudb.json [22:34:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P69657 and previous config saved to /var/cache/conftool/dbconfig/20241010-223450-arnaudb.json [22:35:08] (03PS7) 10Scott French: hieradata: convert remaining mw_releases entries [puppet] - 10https://gerrit.wikimedia.org/r/1077482 (https://phabricator.wikimedia.org/T370934) [22:35:08] (03CR) 10Scott French: "Hey Ahmon - Although this should be low-risk given that we've used the new format for mwdebug since last week, I don't plan to take any ac" [puppet] - 10https://gerrit.wikimedia.org/r/1077482 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [22:35:35] (03CR) 10Dzahn: [C:03+2] "yea, timer/service/script was added on gerrit2003 and no-op on the prod servers." [puppet] - 10https://gerrit.wikimedia.org/r/1078752 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [22:35:41] (03PS7) 10Scott French: types: remove older Mediawiki_deployment variant [puppet] - 10https://gerrit.wikimedia.org/r/1077483 (https://phabricator.wikimedia.org/T370934) [22:36:14] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1077482 (https://phabricator.wikimedia.org/T370934) (owner: 10Scott French) [22:49:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T367781)', diff saved to https://phabricator.wikimedia.org/P69658 and previous config saved to /var/cache/conftool/dbconfig/20241010-224957-arnaudb.json [22:49:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2210.codfw.wmnet with reason: Maintenance [22:50:01] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:50:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2210.codfw.wmnet with reason: Maintenance [22:50:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T367781)', diff saved to https://phabricator.wikimedia.org/P69659 and previous config saved to /var/cache/conftool/dbconfig/20241010-225019-arnaudb.json [22:52:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T367781)', diff saved to https://phabricator.wikimedia.org/P69660 and previous config saved to /var/cache/conftool/dbconfig/20241010-225231-arnaudb.json [22:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10220083 (10phaultfinder) [22:55:21] 10ops-codfw, 06DC-Ops, 06serviceops: Q#:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965 (10RobH) 03NEW [22:55:42] 10ops-codfw, 06DC-Ops, 06serviceops: Q#:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10220101 (10RobH) [22:55:58] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10220102 (10RobH) [23:05:04] 10ops-codfw, 06DC-Ops, 06serviceops: Q#:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968 (10RobH) 03NEW [23:05:08] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10220158 (10RobH) [23:05:55] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10220166 (10RobH) [23:07:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P69661 and previous config saved to /var/cache/conftool/dbconfig/20241010-230738-arnaudb.json [23:22:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P69662 and previous config saved to /var/cache/conftool/dbconfig/20241010-232245-arnaudb.json [23:37:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T367781)', diff saved to https://phabricator.wikimedia.org/P69663 and previous config saved to /var/cache/conftool/dbconfig/20241010-233752-arnaudb.json [23:37:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [23:37:55] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:38:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [23:38:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P69664 and previous config saved to /var/cache/conftool/dbconfig/20241010-233814-arnaudb.json [23:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079368 [23:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079368 (owner: 10TrainBranchBot) [23:42:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed