[00:04:23] PROBLEM - Check systemd state on urldownloader1001 is CRITICAL: CRITICAL - degraded: The following units failed: man-db.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:15] (03PS1) 10Andrew Bogott: cinder-backups: patch to support multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/909002 [00:12:42] (03CR) 10CI reject: [V: 04-1] cinder-backups: patch to support multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/909002 (owner: 10Andrew Bogott) [00:13:52] (03PS2) 10Andrew Bogott: cinder-backups: patch to support multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/909002 [00:16:51] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backups: patch to support multiple backends [puppet] - 10https://gerrit.wikimedia.org/r/909002 (owner: 10Andrew Bogott) [00:20:48] (03PS1) 10Andrew Bogott: cinder-backup: fix path to manager.py patch file [puppet] - 10https://gerrit.wikimedia.org/r/909003 [00:20:53] RECOVERY - Disk space on urldownloader1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops [00:22:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:23] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: fix path to manager.py patch file [puppet] - 10https://gerrit.wikimedia.org/r/909003 (owner: 10Andrew Bogott) [00:22:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [00:24:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [00:24:47] (03PS1) 10Andrew Bogott: cinder-backup: fix path to manager.py patch file, again [puppet] - 10https://gerrit.wikimedia.org/r/909004 [00:26:14] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: fix path to manager.py patch file, again [puppet] - 10https://gerrit.wikimedia.org/r/909004 (owner: 10Andrew Bogott) [00:30:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:34] (03PS1) 10Andrew Bogott: Add cloudbackup2001 as a second cinder-backup host [puppet] - 10https://gerrit.wikimedia.org/r/909006 [00:38:40] (03CR) 10Andrew Bogott: [C: 03+2] Add cloudbackup2001 as a second cinder-backup host [puppet] - 10https://gerrit.wikimedia.org/r/909006 (owner: 10Andrew Bogott) [00:39:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908875 [00:39:11] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908875 (owner: 10TrainBranchBot) [00:42:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [00:44:45] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 0 days) https://phabricator.wikimedia.org/tag/toolforge/ [00:45:31] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [00:47:57] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 60 days) https://phabricator.wikimedia.org/tag/toolforge/ [00:50:08] ACKNOWLEDGEMENT - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.timer Andrew Bogott rebooting https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/908875 (owner: 10TrainBranchBot) [00:58:42] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 0 days) https://phabricator.wikimedia.org/tag/toolforge/ [01:01:56] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 60 days) https://phabricator.wikimedia.org/tag/toolforge/ [01:02:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [01:03:30] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [01:27:30] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 0 days) https://phabricator.wikimedia.org/tag/toolforge/ [01:29:02] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 60 days) https://phabricator.wikimedia.org/tag/toolforge/ [01:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:00:48] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:32] RECOVERY - Check systemd state on cloudbackup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:28] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2023-04-17 10:55:19 +0000 (expires in 0 days) https://phabricator.wikimedia.org/tag/toolforge/ [02:39:58] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-06-16 10:47:52 +0000 (expires in 60 days) https://phabricator.wikimedia.org/tag/toolforge/ [02:46:42] RECOVERY - Check systemd state on cloudbackup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:27:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [04:29:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [04:47:16] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [04:48:19] !log phedenskog@deploy2002 Started deploy [performance/navtiming@e21f08f]: (no justification provided) [04:48:26] !log phedenskog@deploy2002 Finished deploy [performance/navtiming@e21f08f]: (no justification provided) (duration: 00m 06s) [04:50:16] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [04:51:46] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:51:48] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:53:22] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:53:24] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:07:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [05:08:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [05:15:42] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:16] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:52] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 T326669', diff saved to https://phabricator.wikimedia.org/P46931 and previous config saved to /var/cache/conftool/dbconfig/20230417-051903-marostegui.json [05:19:08] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [05:20:58] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:02] (03PS1) 10Marostegui: db1212: Move it to s3 [puppet] - 10https://gerrit.wikimedia.org/r/909008 (https://phabricator.wikimedia.org/T326669) [05:21:28] (03CR) 10Marostegui: [C: 03+2] db1212: Move it to s3 [puppet] - 10https://gerrit.wikimedia.org/r/909008 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:21:52] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:17] (03PS1) 10Marostegui: instances.yaml: Remove db1100 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909009 (https://phabricator.wikimedia.org/T329352) [05:26:28] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:06] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:35] !log Stop MariaDB on db1112 to clone db1212 - this will generate lag on s3 wiki replicas [05:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:48] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1100 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909009 (https://phabricator.wikimedia.org/T329352) (owner: 10Marostegui) [05:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1100 from dbctl T329352', diff saved to https://phabricator.wikimedia.org/P46933 and previous config saved to /var/cache/conftool/dbconfig/20230417-053310-marostegui.json [05:33:16] T329352: decommission db1100.eqiad.wmnet - https://phabricator.wikimedia.org/T329352 [05:34:29] (03PS1) 10Marostegui: db1214: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909010 (https://phabricator.wikimedia.org/T326669) [05:35:34] (03CR) 10Marostegui: [C: 03+2] db1214: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909010 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:37:18] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:54] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:14] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:39:46] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:39:50] (03PS1) 10Marostegui: instances.yaml: Add db1214 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909011 (https://phabricator.wikimedia.org/T326669) [05:40:47] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1214 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909011 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [05:41:00] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1214 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46934 and previous config saved to /var/cache/conftool/dbconfig/20230417-054154-marostegui.json [05:42:00] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [05:42:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 1%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46935 and previous config saved to /var/cache/conftool/dbconfig/20230417-054235-root.json [05:47:15] (03PS1) 10Marostegui: db1152: Promote it to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/909012 (https://phabricator.wikimedia.org/T334663) [05:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:50:30] (03CR) 10Marostegui: [C: 03+2] db1152: Promote it to x2 master [puppet] - 10https://gerrit.wikimedia.org/r/909012 (https://phabricator.wikimedia.org/T334663) (owner: 10Marostegui) [05:51:57] (03PS1) 10Marostegui: db1151: No longer master [puppet] - 10https://gerrit.wikimedia.org/r/909013 [05:52:27] (03CR) 10Marostegui: [C: 03+2] db1151: No longer master [puppet] - 10https://gerrit.wikimedia.org/r/909013 (owner: 10Marostegui) [05:52:33] (03PS3) 10Clément Goubert: linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) [05:56:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1152 to x2 primary T334663', diff saved to https://phabricator.wikimedia.org/P46936 and previous config saved to /var/cache/conftool/dbconfig/20230417-055644-root.json [05:56:50] T334663: Upgrade + reboot x2 DB hosts - https://phabricator.wikimedia.org/T334663 [05:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change db1152 weight', diff saved to https://phabricator.wikimedia.org/P46937 and previous config saved to /var/cache/conftool/dbconfig/20230417-055721-root.json [05:57:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 2%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46938 and previous config saved to /var/cache/conftool/dbconfig/20230417-055739-root.json [05:57:45] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:06:07] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:21] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:56] (03PS3) 10Clément Goubert: push-notifications: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905942 (https://phabricator.wikimedia.org/T334061) [06:09:21] (03PS1) 10Marostegui: db1151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909014 [06:10:00] (03CR) 10Marostegui: [C: 03+2] db1151: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909014 (owner: 10Marostegui) [06:12:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 3%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46939 and previous config saved to /var/cache/conftool/dbconfig/20230417-061244-root.json [06:12:50] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:14:19] (03PS3) 10Clément Goubert: recommendation-api: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905943 (https://phabricator.wikimedia.org/T334062) [06:21:03] (03PS1) 10Marostegui: dbproxy1014: Add db1217 to m1 non active proxy [puppet] - 10https://gerrit.wikimedia.org/r/909024 (https://phabricator.wikimedia.org/T326669) [06:21:28] (03CR) 10Marostegui: [C: 03+2] dbproxy1014: Add db1217 to m1 non active proxy [puppet] - 10https://gerrit.wikimedia.org/r/909024 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:21:49] qchris: o/ thanks a lot for the new repo :) [06:25:03] (03PS1) 10Marostegui: mariadb: Add db1217 to non active proxies [puppet] - 10https://gerrit.wikimedia.org/r/909038 (https://phabricator.wikimedia.org/T326669) [06:25:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1217 to non active proxies [puppet] - 10https://gerrit.wikimedia.org/r/909038 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:25:48] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:27:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 4%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46940 and previous config saved to /var/cache/conftool/dbconfig/20230417-062749-root.json [06:27:55] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:29:12] (03PS1) 10Marostegui: mariadb: Add db1217 to active proxies [puppet] - 10https://gerrit.wikimedia.org/r/909040 (https://phabricator.wikimedia.org/T326669) [06:29:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1217 to active proxies [puppet] - 10https://gerrit.wikimedia.org/r/909040 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:32:47] (03PS1) 10Marostegui: dbprov1002,dbprov1003: Replace db1117 with db1217 [puppet] - 10https://gerrit.wikimedia.org/r/909041 (https://phabricator.wikimedia.org/T326669) [06:33:09] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:33:31] (03CR) 10Marostegui: "jcrespo can you merge this whenever you are ready? If possible, also run a backup to make sure everything works as expected. db1217 was cl" [puppet] - 10https://gerrit.wikimedia.org/r/909041 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:34:15] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:34:39] (03PS1) 10Marostegui: install_server: Do not reimage db1217 [puppet] - 10https://gerrit.wikimedia.org/r/909042 (https://phabricator.wikimedia.org/T326669) [06:35:12] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1217 [puppet] - 10https://gerrit.wikimedia.org/r/909042 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:42:52] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [06:42:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 5%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46941 and previous config saved to /var/cache/conftool/dbconfig/20230417-064254-root.json [06:43:00] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:45:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1109 T334820', diff saved to https://phabricator.wikimedia.org/P46942 and previous config saved to /var/cache/conftool/dbconfig/20230417-064525-marostegui.json [06:45:31] T334820: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 [06:46:41] (03PS1) 10Marostegui: db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909043 (https://phabricator.wikimedia.org/T326683) [06:47:14] (03CR) 10Marostegui: [C: 03+2] db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909043 (https://phabricator.wikimedia.org/T326683) (owner: 10Marostegui) [06:51:05] (03PS3) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [06:52:10] (03CR) 10CI reject: [V: 04-1] api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [06:58:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 10%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46943 and previous config saved to /var/cache/conftool/dbconfig/20230417-065759-root.json [06:58:05] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:00:04] Amir1, Urbanecm, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:07] PROBLEM - DPKG on dse-k8s-worker1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:03:19] (03CR) 10Muehlenhoff: [C: 03+2] Install ruby-sorted-set on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908833 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:06:57] (03PS1) 10Slyngshede: Requirements: Add django-simple-captcha [software/bitu] - 10https://gerrit.wikimedia.org/r/909045 [07:07:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:10:55] (03PS4) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [07:11:47] (03CR) 10CI reject: [V: 04-1] api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [07:13:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 25%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46944 and previous config saved to /var/cache/conftool/dbconfig/20230417-071304-root.json [07:13:10] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:13:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [07:13:40] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm [07:13:53] (03PS5) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [07:18:59] (03PS8) 10Slyngshede: SSH Keymanagement, allow user to manage ssh keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/899519 [07:23:00] (03PS6) 10Clément Goubert: api-gateway: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) [07:25:14] (03PS1) 10Elukey: Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) [07:26:59] (03PS2) 10Elukey: Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) [07:28:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 50%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46945 and previous config saved to /var/cache/conftool/dbconfig/20230417-072809-root.json [07:28:15] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:30:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [07:30:55] (03Abandoned) 10Slyngshede: Requirements: Add django-simple-captcha [software/bitu] - 10https://gerrit.wikimedia.org/r/909045 (owner: 10Slyngshede) [07:33:54] (03CR) 10Elukey: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [07:34:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [07:35:36] (03PS4) 10Slyngshede: Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 [07:36:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow6002.drmrs.wmnet [07:36:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:43:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 75%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46946 and previous config saved to /var/cache/conftool/dbconfig/20230417-074313-root.json [07:43:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [07:43:19] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:43:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [07:43:29] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002 (**FAIL**) - Re... [07:44:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [07:44:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:44:24] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow6002.drmrs.wmnet on all recursors [07:44:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow6002.drmrs.wmnet on all recursors [07:47:57] (03PS1) 10Marostegui: db1122: No more candidate master [puppet] - 10https://gerrit.wikimedia.org/r/909178 (https://phabricator.wikimedia.org/T326669) [07:48:38] (03CR) 10Marostegui: [C: 03+2] db1122: No more candidate master [puppet] - 10https://gerrit.wikimedia.org/r/909178 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:49:33] !log restart haproxy on cp3054 - T334448 [07:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:37] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [07:49:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1100.eqiad.wmnet [07:50:36] (03PS1) 10Marostegui: mariadb: Decommission db1100 [puppet] - 10https://gerrit.wikimedia.org/r/909179 (https://phabricator.wikimedia.org/T329352) [07:54:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow6002.drmrs.wmnet - jmm@cumin2002" [07:54:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:55:09] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:55:31] 10SRE, 10Phabricator: Remove phabricator Multi-factor Auth for Atieno - https://phabricator.wikimedia.org/T334480 (10Clement_Goubert) I can do that as part of Clinic Duty, please schedule some time between 08:00 and 16:00UTC. [07:55:45] 10SRE, 10Phabricator: Remove phabricator Multi-factor Auth for Atieno - https://phabricator.wikimedia.org/T334480 (10Clement_Goubert) a:03Clement_Goubert [07:56:18] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1100 [puppet] - 10https://gerrit.wikimedia.org/r/909179 (https://phabricator.wikimedia.org/T329352) (owner: 10Marostegui) [07:57:19] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1100.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:58:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1214 (re)pooling @ 100%: Pooling db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46948 and previous config saved to /var/cache/conftool/dbconfig/20230417-075818-root.json [07:58:24] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:58:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1100.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:58:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:58:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1100.eqiad.wmnet [07:59:17] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1100.eqiad.wmnet - https://phabricator.wikimedia.org/T329352 (10Marostegui) This is ready for DC-Ops [07:59:29] 10ops-eqiad, 10decommission-hardware: decommission db1100.eqiad.wmnet - https://phabricator.wikimedia.org/T329352 (10Marostegui) [08:02:13] (03PS1) 10Marostegui: mariadb: Productionize db1212 [puppet] - 10https://gerrit.wikimedia.org/r/909180 (https://phabricator.wikimedia.org/T326669) [08:04:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:07:09] 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Clement_Goubert) [08:07:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:09:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [08:10:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1207.eqiad.wmnet with reason: Maintenance [08:11:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P46950 and previous config saved to /var/cache/conftool/dbconfig/20230417-081108-ladsgroup.json [08:11:11] 10SRE, 10Infrastructure-Foundations: Updated java security policy in OpenJDK 8 u265 - https://phabricator.wikimedia.org/T261196 (10MoritzMuehlenhoff) [08:11:13] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10Clement_Goubert) a:03Clement_Goubert [08:14:50] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Clement_Goubert) [08:16:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:16:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:17:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance [08:17:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance [08:17:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46951 and previous config saved to /var/cache/conftool/dbconfig/20230417-081732-ladsgroup.json [08:17:37] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [08:17:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1212 [puppet] - 10https://gerrit.wikimedia.org/r/909180 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [08:18:52] (03CR) 10JMeybohm: "I'm missing the node labeler binary (package) here" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:19:23] (03CR) 10Clément Goubert: [C: 03+2] admin: add Marco Aurelio to LDAP-only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/908622 (https://phabricator.wikimedia.org/T333884) (owner: 10Dzahn) [08:19:29] (03CR) 10Clément Goubert: [C: 03+2] "NDA confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/908622 (https://phabricator.wikimedia.org/T333884) (owner: 10Dzahn) [08:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46952 and previous config saved to /var/cache/conftool/dbconfig/20230417-081953-ladsgroup.json [08:20:40] (03CR) 10Elukey: Add initial debianizazion (032 comments) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:21:04] 10SRE, 10SRE-swift-storage: Memory exhaustion when uploading large TIFF files by URL - https://phabricator.wikimedia.org/T334814 (10Peachey88) [08:22:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:22:56] (03PS1) 10Marostegui: db1123: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/909181 (https://phabricator.wikimedia.org/T326669) [08:23:28] (03CR) 10Marostegui: [C: 03+2] db1123: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/909181 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [08:24:33] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:40] (03PS5) 10Ayounsi: Manage drmrs LVS/bird BGP with Homer [homer/public] - 10https://gerrit.wikimedia.org/r/905257 [08:25:04] (03CR) 10JMeybohm: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:25:20] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10Clement_Goubert) 05In progress→03Resolved ` cgoubert@mwmaint2002:~$ ldapsearch -x cn=nda | grep maur member: uid=maurelio,ou=people,dc=wikimedia,dc=o... [08:26:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P46953 and previous config saved to /var/cache/conftool/dbconfig/20230417-082613-ladsgroup.json [08:26:43] (03CR) 10Ayounsi: [C: 03+2] Manage drmrs LVS/bird BGP with Homer [homer/public] - 10https://gerrit.wikimedia.org/r/905257 (owner: 10Ayounsi) [08:27:17] (03Merged) 10jenkins-bot: Manage drmrs LVS/bird BGP with Homer [homer/public] - 10https://gerrit.wikimedia.org/r/905257 (owner: 10Ayounsi) [08:27:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [08:29:30] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [08:29:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46954 and previous config saved to /var/cache/conftool/dbconfig/20230417-082934-root.json [08:30:04] (03CR) 10Elukey: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:31:28] (03PS1) 10Hnowlan: rest-gateway: Extend timeout to 150s [deployment-charts] - 10https://gerrit.wikimedia.org/r/909183 (https://phabricator.wikimedia.org/T334611) [08:31:32] (03CR) 10Muehlenhoff: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:32:39] (03CR) 10Elukey: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:33:59] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow6002.drmrs.wmnet - jmm@cumin2002" [08:34:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:35:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P46955 and previous config saved to /var/cache/conftool/dbconfig/20230417-083459-ladsgroup.json [08:36:08] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) @Atieno Please read and sign the L3 "Acknowledgement of Wikimedia Server Access Responsibilities" document. Thank you. https://... [08:37:37] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Clement_Goubert) 05Open→03Stalled [08:38:45] (03PS2) 10Hnowlan: rest-gateway: Extend proton timeout to 150s [deployment-charts] - 10https://gerrit.wikimedia.org/r/909183 (https://phabricator.wikimedia.org/T334611) [08:39:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:39:42] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Clement_Goubert) 05In progress→03Stalled [08:39:54] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:40:03] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [08:40:25] (03CR) 10Elukey: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:41:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for KMorgan - https://phabricator.wikimedia.org/T334432 (10Clement_Goubert) 05In progress→03Stalled @KMorgan-WMF Please read and sign the L3 "Acknowledgement of Wikimedia Server Access Responsibilities" documen... [08:41:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P46956 and previous config saved to /var/cache/conftool/dbconfig/20230417-084118-ladsgroup.json [08:42:48] 10SRE, 10SRE-swift-storage: Memory exhaustion when uploading large TIFF files by URL - https://phabricator.wikimedia.org/T334814 (10MatthewVernon) @Peachey88 not sure why you adjusted the projects thus: these errors aren't coming from the Swift layer, I don't think, but from a bit of the upload tooling? [08:44:10] (03CR) 10Elukey: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [08:44:19] (03PS3) 10Elukey: Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) [08:44:23] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/909183 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [08:44:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46957 and previous config saved to /var/cache/conftool/dbconfig/20230417-084439-root.json [08:45:14] (03CR) 10Ayounsi: [C: 03+1] "Would be nice but not a blocker that @volans reviews it, otherwise let's give it a try!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [08:45:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Clement_Goubert) [08:47:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [08:48:18] !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [08:50:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P46958 and previous config saved to /var/cache/conftool/dbconfig/20230417-085005-ladsgroup.json [08:50:30] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334786 (10phaultfinder) [08:51:09] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [08:52:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [08:52:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:07] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow6002.drmrs.wmnet on all recursors [08:52:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow6002.drmrs.wmnet on all recursors [08:52:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netflow6002.drmrs.wmnet [08:54:11] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [08:54:31] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [08:55:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow6002.drmrs.wmnet [08:55:15] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:55:48] !log kamila@deploy2002 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [08:56:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1207 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P46959 and previous config saved to /var/cache/conftool/dbconfig/20230417-085623-ladsgroup.json [08:57:25] (03PS1) 10Muehlenhoff: Update access date [puppet] - 10https://gerrit.wikimedia.org/r/909185 [08:57:56] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [08:58:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [08:58:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:58:55] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow6002.drmrs.wmnet on all recursors [08:58:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow6002.drmrs.wmnet on all recursors [08:59:03] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:59:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46960 and previous config saved to /var/cache/conftool/dbconfig/20230417-085944-root.json [09:00:47] RECOVERY - DPKG on dse-k8s-worker1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:01:54] (03CR) 10Muehlenhoff: [C: 03+2] Update access date [puppet] - 10https://gerrit.wikimedia.org/r/909185 (owner: 10Muehlenhoff) [09:02:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905626 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [09:03:09] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [09:03:43] (03Merged) 10jenkins-bot: Also broadcast RCFeed/IRC events to irc1002/irc2002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905626 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [09:03:53] (03CR) 10JMeybohm: Add initial debianizazion (032 comments) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:04:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM netflow6002.drmrs.wmnet - jmm@cumin2002" [09:04:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:04:12] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow6002.drmrs.wmnet on all recursors [09:04:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow6002.drmrs.wmnet on all recursors [09:04:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netflow6002.drmrs.wmnet [09:04:24] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:905626|Also broadcast RCFeed/IRC events to irc1002/irc2002 (T331702)]] [09:04:31] T331702: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 [09:05:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T333332)', diff saved to https://phabricator.wikimedia.org/P46961 and previous config saved to /var/cache/conftool/dbconfig/20230417-090512-ladsgroup.json [09:05:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:05:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [09:05:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:05:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T333332)', diff saved to https://phabricator.wikimedia.org/P46962 and previous config saved to /var/cache/conftool/dbconfig/20230417-090535-ladsgroup.json [09:07:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:07:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T333332)', diff saved to https://phabricator.wikimedia.org/P46963 and previous config saved to /var/cache/conftool/dbconfig/20230417-090751-ladsgroup.json [09:08:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [09:12:54] !log ladsgroup@deploy2002 jmm and ladsgroup: Backport for [[gerrit:905626|Also broadcast RCFeed/IRC events to irc1002/irc2002 (T331702)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [09:12:59] T331702: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 [09:14:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46964 and previous config saved to /var/cache/conftool/dbconfig/20230417-091449-root.json [09:18:46] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Ladsgroup) 05Open→03Resolved I properly archived it now. It's possible to revert this if there is a need later. [09:22:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P46965 and previous config saved to /var/cache/conftool/dbconfig/20230417-092258-ladsgroup.json [09:23:11] (03PS4) 10Elukey: Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) [09:23:40] (03CR) 10Elukey: Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:24:10] (03CR) 10David Caro: [C: 03+1] "Neat, this works well for me :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [09:24:56] (03PS5) 10Elukey: Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) [09:26:28] (03CR) 10Jcrespo: [C: 03+2] dbprov1002,dbprov1003: Replace db1117 with db1217 [puppet] - 10https://gerrit.wikimedia.org/r/909041 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [09:29:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736) (owner: 10EoghanGaffney) [09:29:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46966 and previous config saved to /var/cache/conftool/dbconfig/20230417-092954-root.json [09:32:50] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move s2 and s3 backups from db1102 to db1225 [puppet] - 10https://gerrit.wikimedia.org/r/908798 (https://phabricator.wikimedia.org/T334057) (owner: 10Jcrespo) [09:34:10] (03CR) 10Jbond: [C: 03+1] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [09:34:41] PROBLEM - Check systemd state on dse-k8s-worker1001 is CRITICAL: CRITICAL - degraded: The following units failed: amd-k8s-device-plugin.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:25] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service,ferm.service,prometheus-nic-firmware-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:09] PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100% [09:37:10] (03PS6) 10Elukey: Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) [09:38:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P46967 and previous config saved to /var/cache/conftool/dbconfig/20230417-093804-ladsgroup.json [09:38:27] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1116.eqiad.wmnet with reason: T334066 [09:38:31] T334066: Replace db1116 with db1216 - https://phabricator.wikimedia.org/T334066 [09:38:53] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1116.eqiad.wmnet with reason: T334066 [09:39:03] (03PS2) 10Hokwelum: make dumpsdata1006 the xmlfallback host [puppet] - 10https://gerrit.wikimedia.org/r/908995 [09:39:05] RECOVERY - Check systemd state on urldownloader1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:21] (03CR) 10Jbond: [C: 03+1] Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845 (owner: 10Slyngshede) [09:41:08] (03Abandoned) 10Jbond: ssl_ssl_ciphersuite: Add AES256-SHA256 to list of mid cipher [puppet] - 10https://gerrit.wikimedia.org/r/908902 (owner: 10Jbond) [09:42:15] (03PS1) 10Jcrespo: dbbackups: Setup db1216 as the new backup source instead of db1116 [puppet] - 10https://gerrit.wikimedia.org/r/909195 (https://phabricator.wikimedia.org/T334066) [09:42:33] (03CR) 10Hokwelum: make dumpsdata1006 the xmlfallback host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908995 (owner: 10Hokwelum) [09:42:41] RECOVERY - Check systemd state on dse-k8s-worker1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:03] (03CR) 10Elukey: "built and tested on dse-k8s-worker1001, seems working as expected :)" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46968 and previous config saved to /var/cache/conftool/dbconfig/20230417-094459-root.json [09:46:13] (03PS1) 10Ladsgroup: filebackend: Find thumbnails from all backends in FileBackendMultiWrite [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908959 (https://phabricator.wikimedia.org/T331138) [09:46:39] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/908891 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [09:48:45] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:905626|Also broadcast RCFeed/IRC events to irc1002/irc2002 (T331702)]] (duration: 44m 21s) [09:48:50] T331702: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 [09:49:00] (03PS1) 10Elukey: Add a simple Docker image to test AMD GPUs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909196 (https://phabricator.wikimedia.org/T333009) [09:49:10] (03CR) 10Slyngshede: [C: 03+2] Fix bug where connection timeout is read as tuple. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/908845 (owner: 10Slyngshede) [09:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:51:50] (03Merged) 10jenkins-bot: rest-gateway: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/908891 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [09:52:01] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Setup db1216 as the new backup source instead of db1116 [puppet] - 10https://gerrit.wikimedia.org/r/909195 (https://phabricator.wikimedia.org/T334066) (owner: 10Jcrespo) [09:52:15] PROBLEM - Check whether ferm is active by checking the default input chain on sretest1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T333332)', diff saved to https://phabricator.wikimedia.org/P46969 and previous config saved to /var/cache/conftool/dbconfig/20230417-095311-ladsgroup.json [09:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [09:53:16] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [09:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [09:53:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:53:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:54:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P46970 and previous config saved to /var/cache/conftool/dbconfig/20230417-095404-ladsgroup.json [09:55:19] (03PS1) 10David Caro: replica_cnf: skip robots.txt when listing paws users [puppet] - 10https://gerrit.wikimedia.org/r/909198 (https://phabricator.wikimedia.org/T334828) [09:56:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P46971 and previous config saved to /var/cache/conftool/dbconfig/20230417-095625-ladsgroup.json [09:57:04] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10BTullis) I'm looking into this now. Having rebooted the host and gone into the RAID controller setup, I can confirm that we see all 12 data disks and both O/S physical d... [09:57:46] (03CR) 10JMeybohm: [C: 03+1] Add initial debianizazion (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:58:44] (03CR) 10JMeybohm: [C: 03+1] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/908955 (https://phabricator.wikimedia.org/T334064) (owner: 10Clément Goubert) [09:59:51] (03CR) 10JMeybohm: Add a simple Docker image to test AMD GPUs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909196 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [09:59:58] (03CR) 10Marostegui: "jcrespo let me know if all went well! Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/909041 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:00:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46972 and previous config saved to /var/cache/conftool/dbconfig/20230417-100003-root.json [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1000) [10:05:28] 10SRE, 10SRE-Access-Requests, 10User-MarcoAurelio: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10jbond) @MarcoAurelio can you confirm your irc nick, either here or you can email me at jbond@wikimedia.org [10:06:55] (03CR) 10Klausman: [C: 03+2] "The bare-metal vs. in-a-privileged pod thing is a bit quirky, but likely the best approach. I presume this daemonset still wouldn't be abl" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:07:44] (03CR) 10Muehlenhoff: "Looks good, new nits inline." [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:11:28] (03CR) 10Muehlenhoff: [C: 03+2] "This needs more work still, on irc1002 after pointing edit events to it, the following failure is logged to syslog:" [puppet] - 10https://gerrit.wikimedia.org/r/902077 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [10:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P46973 and previous config saved to /var/cache/conftool/dbconfig/20230417-101131-ladsgroup.json [10:11:59] (03PS1) 10Btullis: Place an-worker1132 back into the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/909202 (https://phabricator.wikimedia.org/T333091) [10:12:05] (03PS7) 10Elukey: Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) [10:12:16] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [10:12:29] (03CR) 10Elukey: Add initial debianizazion (034 comments) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:13:00] (03PS1) 10Giuseppe Lavagetto: Allow setting values for jsonschema entities [software/conftool] - 10https://gerrit.wikimedia.org/r/909203 [10:13:02] (03PS1) 10Giuseppe Lavagetto: Re-vamp integration testing [software/conftool] - 10https://gerrit.wikimedia.org/r/909204 [10:13:04] (03PS1) 10Giuseppe Lavagetto: New backend interface [software/conftool] - 10https://gerrit.wikimedia.org/r/909205 [10:13:41] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40685/console" [puppet] - 10https://gerrit.wikimedia.org/r/909202 (https://phabricator.wikimedia.org/T333091) (owner: 10Btullis) [10:13:46] (03CR) 10Klausman: [C: 03+2] Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:15:42] (03CR) 10Jcrespo: [C: 03+2] dbprov1002,dbprov1003: Replace db1117 with db1217 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909041 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [10:19:23] (03PS2) 10Elukey: Add a simple Docker image to test AMD GPUs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909196 (https://phabricator.wikimedia.org/T333009) [10:19:40] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40686/console" [puppet] - 10https://gerrit.wikimedia.org/r/909202 (https://phabricator.wikimedia.org/T333091) (owner: 10Btullis) [10:19:55] (03CR) 10Elukey: Add a simple Docker image to test AMD GPUs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909196 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:22:49] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:46] (03CR) 10Btullis: [V: 03+1 C: 03+2] Place an-worker1132 back into the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/909202 (https://phabricator.wikimedia.org/T333091) (owner: 10Btullis) [10:24:58] (03CR) 10Elukey: [C: 03+1] Place an-worker1132 back into the insetup role [puppet] - 10https://gerrit.wikimedia.org/r/909202 (https://phabricator.wikimedia.org/T333091) (owner: 10Btullis) [10:25:48] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P46974 and previous config saved to /var/cache/conftool/dbconfig/20230417-102637-ladsgroup.json [10:27:58] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab/ssh] Only recurse if the directory is to be removed [puppet] - 10https://gerrit.wikimedia.org/r/908927 (https://phabricator.wikimedia.org/T334736) (owner: 10EoghanGaffney) [10:29:22] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:27] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10Volans) [10:32:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS buster [10:33:02] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster [10:34:09] (03PS1) 10MVernon: Swift: start rclone job earlier [puppet] - 10https://gerrit.wikimedia.org/r/909207 [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:32] (03CR) 10CI reject: [V: 04-1] Swift: start rclone job earlier [puppet] - 10https://gerrit.wikimedia.org/r/909207 (owner: 10MVernon) [10:34:35] jouncebot: nowandnext [10:34:35] For the next 0 hour(s) and 25 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1000) [10:34:35] In 2 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1300) [10:34:53] (03CR) 10Ladsgroup: [C: 03+2] filebackend: Find thumbnails from all backends in FileBackendMultiWrite [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908959 (https://phabricator.wikimedia.org/T331138) (owner: 10Ladsgroup) [10:35:26] (03PS2) 10MVernon: Swift: start rclone job earlier [puppet] - 10https://gerrit.wikimedia.org/r/909207 [10:36:01] (03PS1) 10Vgutierrez: cache::haproxy: Relax hardening when coredumps are enabled [puppet] - 10https://gerrit.wikimedia.org/r/909209 (https://phabricator.wikimedia.org/T334448) [10:37:30] RECOVERY - Host an-worker1132 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [10:37:59] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks elukey." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909196 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [10:40:00] (03CR) 10Jcrespo: [C: 03+1] Swift: start rclone job earlier [puppet] - 10https://gerrit.wikimedia.org/r/909207 (owner: 10MVernon) [10:40:36] 10SRE, 10SRE-Access-Requests, 10User-MarcoAurelio: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10MarcoAurelio) 05Stalled→03Open >>! In T333870#8784890, @jbond wrote: > @MarcoAurelio can you confirm your irc nick, either here or you can email me at jbond@wikimedia.... [10:40:46] (03CR) 10MVernon: [C: 03+2] Swift: start rclone job earlier [puppet] - 10https://gerrit.wikimedia.org/r/909207 (owner: 10MVernon) [10:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P46976 and previous config saved to /var/cache/conftool/dbconfig/20230417-104144-ladsgroup.json [10:41:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [10:41:50] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [10:42:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1156.eqiad.wmnet with reason: Maintenance [10:42:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:42:07] (03CR) 10Volans: "[disclaimer] I didn't do a pass, just left one comment" [software/conftool] - 10https://gerrit.wikimedia.org/r/909204 (owner: 10Giuseppe Lavagetto) [10:42:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:42:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T333332)', diff saved to https://phabricator.wikimedia.org/P46977 and previous config saved to /var/cache/conftool/dbconfig/20230417-104229-ladsgroup.json [10:42:53] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-en-local-public.1a in eqiad [10:43:12] (03CR) 10Cathal Mooney: "LGTM! Covers exactly what we wanted as far as I can tell." [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [10:43:16] (03CR) 10Cathal Mooney: [C: 03+1] dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [10:44:27] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40687/console" [puppet] - 10https://gerrit.wikimedia.org/r/909209 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [10:44:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T333332)', diff saved to https://phabricator.wikimedia.org/P46978 and previous config saved to /var/cache/conftool/dbconfig/20230417-104449-ladsgroup.json [10:45:18] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10BTullis) [10:45:30] (03CR) 10Vgutierrez: cache::haproxy: Relax hardening when coredumps are enabled [puppet] - 10https://gerrit.wikimedia.org/r/909209 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [10:45:35] (03PS1) 10Hnowlan: Minor formatting changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/909212 [10:45:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-en-local-public.1a in eqiad [10:46:24] 10SRE, 10SRE-Access-Requests, 10User-MarcoAurelio: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10jbond) 05Open→03Resolved >>! In T333870#8784984, @MarcoAurelio wrote: >>>! In T333870#8784890, @jbond wrote: >> @MarcoAurelio can you confirm your irc nick, either her... [10:46:50] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.98 in eqiad [10:49:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.98 in eqiad [10:50:40] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.98 in codfw [10:51:17] (03Merged) 10jenkins-bot: filebackend: Find thumbnails from all backends in FileBackendMultiWrite [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908959 (https://phabricator.wikimedia.org/T331138) (owner: 10Ladsgroup) [10:51:28] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [10:51:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908959 (https://phabricator.wikimedia.org/T331138) (owner: 10Ladsgroup) [10:51:49] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:908959|filebackend: Find thumbnails from all backends in FileBackendMultiWrite (T331138)]] [10:51:53] T331138: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 [10:52:52] (03CR) 10CI reject: [V: 04-1] Minor formatting changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/909212 (owner: 10Hnowlan) [10:53:06] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:908959|filebackend: Find thumbnails from all backends in FileBackendMultiWrite (T331138)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [10:53:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.98 in codfw [10:54:33] (03Abandoned) 10EoghanGaffney: Adds flag to start after unmask, starts logrotate [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [10:54:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [10:59:06] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:908959|filebackend: Find thumbnails from all backends in FileBackendMultiWrite (T331138)]] (duration: 07m 16s) [10:59:10] T331138: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 [10:59:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P46979 and previous config saved to /var/cache/conftool/dbconfig/20230417-105955-ladsgroup.json [11:06:15] (03CR) 10Muehlenhoff: [C: 03+1] "It's worth a shot, if haproxy actually attempts to write the core file to /tmp and PrivateTmp is enabled, it would attempt to the username" [puppet] - 10https://gerrit.wikimedia.org/r/909209 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [11:06:54] Not sure if T334829 is Wikistories only or RecentChanges (I tend to think it's Wikistories since it's limited to idwiki only) [11:06:55] T334829: Error: Call to a member function getComment() on null - https://phabricator.wikimedia.org/T334829 [11:07:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:10:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1132.eqiad.wmnet with OS buster [11:10:15] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster completed: - an-worker1132 (**WAR... [11:10:46] (03CR) 10Vgutierrez: cache::haproxy: Relax hardening when coredumps are enabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909209 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [11:11:23] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [11:15:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P46980 and previous config saved to /var/cache/conftool/dbconfig/20230417-111501-ladsgroup.json [11:15:59] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [11:16:10] (03PS1) 10Marostegui: instances.yaml: Remove db1109 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909223 (https://phabricator.wikimedia.org/T334820) [11:16:53] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1109 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/909223 (https://phabricator.wikimedia.org/T334820) (owner: 10Marostegui) [11:17:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1109 from dbctl T334820', diff saved to https://phabricator.wikimedia.org/P46981 and previous config saved to /var/cache/conftool/dbconfig/20230417-111724-marostegui.json [11:17:30] T334820: decommission db1109.eqiad.wmnet - https://phabricator.wikimedia.org/T334820 [11:22:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [11:23:43] !log kamila@deploy2002 conftool action : set/pooled=yes:weight=10; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [11:27:03] (03PS1) 10Slyngshede: P:url_downloader move squid access to separate file. [puppet] - 10https://gerrit.wikimedia.org/r/909237 [11:28:49] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10BTullis) I recreated the logical drives with: `for i in $(seq 0 11); do sudo megacli -CfgLdAdd -r0 [32:$i] -a0; done` [11:30:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T333332)', diff saved to https://phabricator.wikimedia.org/P46983 and previous config saved to /var/cache/conftool/dbconfig/20230417-113008-ladsgroup.json [11:30:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:30:14] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [11:30:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:30:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P46984 and previous config saved to /var/cache/conftool/dbconfig/20230417-113031-ladsgroup.json [11:30:43] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40688/console" [puppet] - 10https://gerrit.wikimedia.org/r/909237 (owner: 10Slyngshede) [11:30:47] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10BTullis) The block devices are now preset, but it seems that there has been a partition incorrectly created on the `/dev/sdb` device. ` btullis@an-worker1132:~$ lsblk NA... [11:30:59] !log btullis@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1132.eqiad.wmnet [11:31:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P46985 and previous config saved to /var/cache/conftool/dbconfig/20230417-113152-ladsgroup.json [11:33:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1132.eqiad.wmnet [11:34:26] (03CR) 10Slyngshede: "Retention is a bit high for the buster urldownloader hosts, but we don't want to change the retention on syslog fleet wide. The easiest op" [puppet] - 10https://gerrit.wikimedia.org/r/909237 (owner: 10Slyngshede) [11:36:18] (03PS1) 10Func: CommonSettings-labs: Add use statement for MediaWikiServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908877 (https://phabricator.wikimedia.org/T333926) [11:37:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] CommonSettings-labs: Add use statement for MediaWikiServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908877 (https://phabricator.wikimedia.org/T333926) (owner: 10Func) [11:37:34] (03CR) 10MarcoAurelio: [C: 04-1] "Per T334344#8785096." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908973 (https://phabricator.wikimedia.org/T334344) (owner: 10Superpes15) [11:46:01] (03Abandoned) 10David Caro: replica_cnf: skip robots.txt when listing paws users [puppet] - 10https://gerrit.wikimedia.org/r/909198 (https://phabricator.wikimedia.org/T334828) (owner: 10David Caro) [11:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P46986 and previous config saved to /var/cache/conftool/dbconfig/20230417-114658-ladsgroup.json [11:49:10] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for Product Analytics Airflow - https://phabricator.wikimedia.org/T334836 (10Stevemunene) [11:53:20] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Ladsgroup) [11:54:10] 10SRE-swift-storage, 10MediaWiki-File-management, 10MW-1.41-notes (1.41.0-wmf.4; 2023-04-10), 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup This should be fixed now for new reuploads. Old... [11:54:59] 10SRE, 10Infrastructure-Foundations: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 (10MoritzMuehlenhoff) late-setup.sh has been modified to integrate the Puppet 5.5 forward port during the Debian installer and that makes the installation work! There's still the caveat t... [11:57:51] Lucas_WMDE: You around ? [11:57:58] o/ [11:58:07] but I should go make lunch soonish ^^ [11:58:08] I'd like to test migrating termbox again :) [11:58:10] Ah [11:58:15] it can wait then [11:58:33] I would have more time from 14:00 UTC for example [11:58:36] if that’s okay? [11:58:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 T326669', diff saved to https://phabricator.wikimedia.org/P46987 and previous config saved to /var/cache/conftool/dbconfig/20230417-115847-marostegui.json [11:58:53] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [11:59:15] Lucas_WMDE: I don't know if I'll be there, I have to taxi my SO :) We'll see, if not we'll do it some other time this weel [11:59:16] week [11:59:21] ok [12:00:25] (03PS1) 10Marostegui: db1119: Move it to s1 [puppet] - 10https://gerrit.wikimedia.org/r/909240 (https://phabricator.wikimedia.org/T326669) [12:00:57] (03CR) 10Marostegui: [C: 03+2] db1119: Move it to s1 [puppet] - 10https://gerrit.wikimedia.org/r/909240 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [12:02:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P46989 and previous config saved to /var/cache/conftool/dbconfig/20230417-120204-ladsgroup.json [12:05:01] (03CR) 10Jbond: "thanks for the work on this looks really promising" [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [12:06:07] (03CR) 10Jbond: [C: 03+2] "thanks" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [12:13:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/909237 (owner: 10Slyngshede) [12:16:03] (03PS2) 10Slyngshede: P:url_downloader move squid access to separate file. [puppet] - 10https://gerrit.wikimedia.org/r/909237 [12:16:44] (03PS1) 10Marostegui: db1119: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909243 [12:17:00] (03PS3) 10Slyngshede: P:url_downloader move squid access to separate file. [puppet] - 10https://gerrit.wikimedia.org/r/909237 [12:17:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P46990 and previous config saved to /var/cache/conftool/dbconfig/20230417-121710-ladsgroup.json [12:17:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [12:17:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [12:17:17] (03CR) 10Slyngshede: P:url_downloader move squid access to separate file. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909237 (owner: 10Slyngshede) [12:17:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1182.eqiad.wmnet with reason: Maintenance [12:17:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46991 and previous config saved to /var/cache/conftool/dbconfig/20230417-121734-ladsgroup.json [12:18:01] (03CR) 10Marostegui: [C: 03+2] db1119: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/909243 (owner: 10Marostegui) [12:19:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46992 and previous config saved to /var/cache/conftool/dbconfig/20230417-121953-ladsgroup.json [12:20:04] (03PS1) 10EoghanGaffney: Switch gitlab-replica and gitlab-replica-old hosts [puppet] - 10https://gerrit.wikimedia.org/r/909244 (https://phabricator.wikimedia.org/T334838) [12:20:54] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40689/console" [puppet] - 10https://gerrit.wikimedia.org/r/909237 (owner: 10Slyngshede) [12:21:24] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:url_downloader move squid access to separate file. [puppet] - 10https://gerrit.wikimedia.org/r/909237 (owner: 10Slyngshede) [12:26:48] PROBLEM - Disk space on urldownloader1001 is CRITICAL: DISK CRITICAL - free space: / 311 MB (3% inode=89%): /tmp 311 MB (3% inode=89%): /var/tmp 311 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops [12:28:59] (03PS1) 10Majavah: openstack: encapi: fix role handling in new endpoints [puppet] - 10https://gerrit.wikimedia.org/r/909246 [12:32:04] (03PS1) 10Clément Goubert: P:lists:monitoring: Raise process count for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/909247 [12:32:29] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/909247 (owner: 10Clément Goubert) [12:33:07] (03PS1) 10Btullis: Revert "Place an-worker1132 back into the insetup role" [puppet] - 10https://gerrit.wikimedia.org/r/908963 [12:33:38] (03PS1) 10EoghanGaffney: Move DNS names for gitlab-replica{,-old} [dns] - 10https://gerrit.wikimedia.org/r/909248 (https://phabricator.wikimedia.org/T334838) [12:34:32] (03CR) 10CI reject: [V: 04-1] Move DNS names for gitlab-replica{,-old} [dns] - 10https://gerrit.wikimedia.org/r/909248 (https://phabricator.wikimedia.org/T334838) (owner: 10EoghanGaffney) [12:35:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P46993 and previous config saved to /var/cache/conftool/dbconfig/20230417-123500-ladsgroup.json [12:35:42] (03CR) 10Jelto: Switch gitlab-replica and gitlab-replica-old hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909244 (https://phabricator.wikimedia.org/T334838) (owner: 10EoghanGaffney) [12:36:40] (03PS2) 10EoghanGaffney: Switch gitlab-replica and gitlab-replica-old hosts [puppet] - 10https://gerrit.wikimedia.org/r/909244 (https://phabricator.wikimedia.org/T334838) [12:36:43] (03CR) 10Btullis: [C: 03+2] Revert "Place an-worker1132 back into the insetup role" [puppet] - 10https://gerrit.wikimedia.org/r/908963 (owner: 10Btullis) [12:37:09] (03PS2) 10EoghanGaffney: Move DNS names for gitlab-replica{,-old} [dns] - 10https://gerrit.wikimedia.org/r/909248 (https://phabricator.wikimedia.org/T334838) [12:38:24] RECOVERY - Disk space on urldownloader1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=urldownloader1001&var-datasource=eqiad+prometheus/ops [12:40:57] Hey, just a heads up we need to deploy mobileapps because of this: https://phabricator.wikimedia.org/T334827 [12:41:43] (03PS2) 10Majavah: openstack: encapi: fix role handling in new endpoints [puppet] - 10https://gerrit.wikimedia.org/r/909246 [12:43:27] (03CR) 10EoghanGaffney: Switch gitlab-replica and gitlab-replica-old hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909244 (https://phabricator.wikimedia.org/T334838) (owner: 10EoghanGaffney) [12:44:10] (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/909249 [12:44:19] nemo-yiannis: ack [12:44:37] (03CR) 10EoghanGaffney: [C: 03+2] Cookbook for switchover of Gitlab to a new host [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [12:45:00] nemo-yiannis: Do you need us or do you self serve ? [12:45:09] No, thanks i can do it. [12:45:11] (03CR) 10EoghanGaffney: [C: 03+2] Cookbook for switchover of Gitlab to a new host (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [12:45:14] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] Cookbook for switchover of Gitlab to a new host [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [12:45:37] All right :) [12:46:47] (03PS2) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/909249 [12:47:54] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/909244 (https://phabricator.wikimedia.org/T334838) (owner: 10EoghanGaffney) [12:50:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P46994 and previous config saved to /var/cache/conftool/dbconfig/20230417-125006-ladsgroup.json [12:50:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) (owner: 10Clément Goubert) [12:50:53] (03PS1) 10Btullis: Revert "Decommission an-worker1132 from the Hadoop cluster" [puppet] - 10https://gerrit.wikimedia.org/r/908964 [12:51:03] (03CR) 10Clément Goubert: [C: 03+2] linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) (owner: 10Clément Goubert) [12:52:52] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/909249 (owner: 10Jgiannelos) [12:55:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "In this case, you don't need the mesh networkpolicy but rather to add the IPs for the api to the generic one using values.yaml in the prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [12:55:52] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service,lvm2-pvscan@8:18.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:36] (03Merged) 10jenkins-bot: linkrecommendation: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905941 (https://phabricator.wikimedia.org/T334060) (owner: 10Clément Goubert) [12:58:14] (03PS1) 10Slyngshede: P:url_downloader move syslog info [puppet] - 10https://gerrit.wikimedia.org/r/909250 (https://phabricator.wikimedia.org/T333676) [12:58:18] (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/909249 (owner: 10Jgiannelos) [12:58:20] (03CR) 10Btullis: [C: 03+2] Revert "Decommission an-worker1132 from the Hadoop cluster" [puppet] - 10https://gerrit.wikimedia.org/r/908964 (owner: 10Btullis) [12:59:19] !log Migrating linkrecommandation staging to mw-api-int - T334060 [12:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:24] T334060: Migrate linkrecommendations to mw-api-int - https://phabricator.wikimedia.org/T334060 [12:59:28] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [12:59:40] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [12:59:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40690/console" [puppet] - 10https://gerrit.wikimedia.org/r/909250 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1300). [13:00:05] Func and MichaelG_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:28] Hey there 👋 [13:00:39] o/ [13:00:53] o/ [13:01:04] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] beta wikidata: Enable new EntitySchema datatype (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) (owner: 10Michael Große) [13:01:12] I’m in a meeting and can’t deploy I’m afraid [13:01:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908877 (https://phabricator.wikimedia.org/T333926) (owner: 10Func) [13:01:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) (owner: 10Michael Große) [13:01:51] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/909248 (https://phabricator.wikimedia.org/T334838) (owner: 10EoghanGaffney) [13:02:04] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:url_downloader move syslog info [puppet] - 10https://gerrit.wikimedia.org/r/909250 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [13:02:38] (03Merged) 10jenkins-bot: CommonSettings-labs: Add use statement for MediaWikiServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908877 (https://phabricator.wikimedia.org/T333926) (owner: 10Func) [13:02:41] (03Merged) 10jenkins-bot: beta wikidata: Enable new EntitySchema datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) (owner: 10Michael Große) [13:03:02] linkrecommandation staging looks good, migrating production [13:03:16] Func: MichaelG_WMDE: both of your patches will be automatically deployed to beta within the next half an hour or so [13:03:45] Ah, deployment [13:03:47] I'll wait [13:04:05] taavi: this no longer involves manual testing for beta? [13:04:41] MichaelG_WMDE: for patches which only touch -labs files, never has required [13:04:55] ah, gotcha, thanks! [13:05:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T333332)', diff saved to https://phabricator.wikimedia.org/P46995 and previous config saved to /var/cache/conftool/dbconfig/20230417-130512-ladsgroup.json [13:05:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [13:05:18] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:05:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1188.eqiad.wmnet with reason: Maintenance [13:05:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T333332)', diff saved to https://phabricator.wikimedia.org/P46996 and previous config saved to /var/cache/conftool/dbconfig/20230417-130535-ladsgroup.json [13:06:37] * MichaelG_WMDE will try it out on beta on and off for the next ~40 minutes and will speak up if it is not what I expect by 13:45 UTC [13:07:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T333332)', diff saved to https://phabricator.wikimedia.org/P46997 and previous config saved to /var/cache/conftool/dbconfig/20230417-130751-ladsgroup.json [13:08:01] wait, is it live on beta already? o.O [13:08:02] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:08:32] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:09:04] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:10:08] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:10:51] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:12:16] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:13:18] (03Abandoned) 10Ssingh: cache::haproxy: enable systemd-coredump [puppet] - 10https://gerrit.wikimedia.org/r/908934 (https://phabricator.wikimedia.org/T334448) (owner: 10Ssingh) [13:15:05] (03CR) 10Ssingh: [C: 03+1] "PCC and idea looks good." [puppet] - 10https://gerrit.wikimedia.org/r/909209 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [13:15:08] (03PS1) 10EoghanGaffney: Fixes incorrect arg name [cookbooks] - 10https://gerrit.wikimedia.org/r/909253 [13:15:35] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:15:59] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add initial debianizazion [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/909177 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:16:43] (03PS2) 10EoghanGaffney: Fixes incorrect arg name [cookbooks] - 10https://gerrit.wikimedia.org/r/909253 (https://phabricator.wikimedia.org/T330771) [13:16:59] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add a simple Docker image to test AMD GPUs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909196 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:18:29] (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309) [13:19:03] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/909253 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:19:21] (03CR) 10EoghanGaffney: [C: 03+2] Fixes incorrect arg name [cookbooks] - 10https://gerrit.wikimedia.org/r/909253 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:20:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:30] (03Merged) 10jenkins-bot: Fixes incorrect arg name [cookbooks] - 10https://gerrit.wikimedia.org/r/909253 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:22:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P46998 and previous config saved to /var/cache/conftool/dbconfig/20230417-132258-ladsgroup.json [13:23:27] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1132.eqiad.wmnet [13:24:20] jouncebot: nowandnext [13:24:20] For the next 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1300) [13:24:20] In 2 hour(s) and 5 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1530) [13:24:32] Anyone around who can deploy (I'm not on a machine with SSH keys atm) [13:25:36] PROBLEM - Host wdqs2011 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:58] (03PS1) 10Reedy: Generate readable PHP diffs [extensions/TrustedXFF] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908965 [13:26:20] (03PS1) 10Reedy: Add WikiMirror [extensions/TrustedXFF] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909266 [13:26:26] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:26:28] RECOVERY - Host wdqs2011 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:26:50] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brian_King hung server, just rebooted https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:26:50] ACKNOWLEDGEMENT - Query Service HTTP Port on wdqs2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time Brian_King hung server, just rebooted https://wikitech.wikimedia.org/wiki/Wikidata_query_service [13:26:50] ACKNOWLEDGEMENT - Check systemd state on wdqs2011 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. Brian_King hung server, just rebooted https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:22] RECOVERY - WDQS SPARQL on wdqs2011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:27:58] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1132.eqiad.wmnet [13:32:51] (03PS1) 10Elukey: amd-gpu-tester: set tensorflow-rocm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909256 (https://phabricator.wikimedia.org/T333009) [13:33:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:34:45] (03CR) 10AikoChou: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909256 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:35:06] (03CR) 10Ssingh: "PCC is failing for this, I opened T334680, which are also some other PCC-related issues with the LVS hosts." [puppet] - 10https://gerrit.wikimedia.org/r/908909 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:37:05] !log mvernon@cumin2002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.e4 in codfw [13:37:39] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909267 (owner: 10MarcoAurelio) [13:38:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P46999 and previous config saved to /var/cache/conftool/dbconfig/20230417-133804-ladsgroup.json [13:39:22] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Relax hardening when coredumps are enabled [puppet] - 10https://gerrit.wikimedia.org/r/909209 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [13:39:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.e4 in codfw [13:41:44] (03CR) 10Jbond: Expose additional link information to Homer templates in wmf-netbox.py (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [13:42:12] (03PS1) 10EoghanGaffney: Fix dry-run mode for gitlab failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) [13:43:26] jouncebot: nowandnext [13:43:26] For the next 0 hour(s) and 16 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1300) [13:43:26] In 1 hour(s) and 46 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1530) [13:43:29] (03PS2) 10EoghanGaffney: Fix dry-run mode for gitlab failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) [13:44:30] (03PS2) 10Elukey: amd-gpu-tester: set tensorflow-rocm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909256 (https://phabricator.wikimedia.org/T333009) [13:44:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: set tensorflow-rocm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909256 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [13:45:50] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:46:04] (03CR) 10CI reject: [V: 04-1] Fix dry-run mode for gitlab failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:46:09] (03CR) 10JHathaway: [C: 03+2] Add an in place Debian upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [13:46:56] claime: if there's still time, there's an unblocking simple patch that I'd like to deploy [13:47:14] *request to be deployed [13:47:39] !log installing mariadb-10.3 security updates (Debian packaged version, not the wmf-mariadb packages) [13:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:19] (03PS3) 10EoghanGaffney: Fix dry-run mode for gitlab failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) [13:49:54] (03PS4) 10EoghanGaffney: Fix dry-run mode for gitlab failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) [13:50:11] FTR: my beta patch seems to work as intended, as far as I can tell [13:52:17] herzog: I was checking if the window was closed for a migration that I don't want interrupted by a mw-on-k8s deployment, sorry, can I suggest pinging one of the deployers in the calendar? [13:52:43] (03CR) 10Raymond Ndibe: [C: 03+2] tools-webservice: set default for buildservice-image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [13:52:48] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:52:53] claime: sure, thanks [13:52:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T333332)', diff saved to https://phabricator.wikimedia.org/P47000 and previous config saved to /var/cache/conftool/dbconfig/20230417-135311-ladsgroup.json [13:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:53:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [13:53:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:53:30] (03Merged) 10jenkins-bot: tools-webservice: set default for buildservice-image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [13:53:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T333332)', diff saved to https://phabricator.wikimedia.org/P47001 and previous config saved to /var/cache/conftool/dbconfig/20230417-135334-ladsgroup.json [13:53:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:53:48] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:53:54] (03CR) 10EoghanGaffney: [C: 03+2] Fix dry-run mode for gitlab failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:54:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.383 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T333332)', diff saved to https://phabricator.wikimedia.org/P47002 and previous config saved to /var/cache/conftool/dbconfig/20230417-135550-ladsgroup.json [13:56:08] (03Merged) 10jenkins-bot: Fix dry-run mode for gitlab failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/909257 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [13:56:29] (03CR) 10David Caro: [C: 03+2] openstack: encapi: fix role handling in new endpoints [puppet] - 10https://gerrit.wikimedia.org/r/909246 (owner: 10Majavah) [13:56:52] (03CR) 10SBassett: [C: 03+1] "Happy to try this out via a config deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909267 (owner: 10MarcoAurelio) [13:59:17] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40692/console" [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:00:39] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: encapi: open up write access [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:01:31] (03CR) 10David Caro: [V: 03+1 C: 03+2] "interesting, the pcc was a failure but still veriewed as ok :/" [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [14:02:26] (03PS1) 10EoghanGaffney: Change file timestamp to an int, from a string [cookbooks] - 10https://gerrit.wikimedia.org/r/909263 (https://phabricator.wikimedia.org/T330771) [14:02:59] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/909263 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:04:23] (03PS3) 10MarcoAurelio: Expose the 'sfsblock-bypass' right so it can be assigned to global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909267 (https://phabricator.wikimedia.org/T334856) [14:05:09] (03CR) 10EoghanGaffney: [C: 03+2] Change file timestamp to an int, from a string [cookbooks] - 10https://gerrit.wikimedia.org/r/909263 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:06:12] BGP alerts in eqiad expected, lvs reimaging [14:06:27] jouncebot: nowandnext [14:06:27] No deployments scheduled for the next 1 hour(s) and 23 minute(s) [14:06:27] In 1 hour(s) and 23 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1530) [14:06:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [14:07:06] !log Migrating linkrecommandation to mw-api-int - T334060 [14:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:11] T334060: Migrate linkrecommendations to mw-api-int - https://phabricator.wikimedia.org/T334060 [14:07:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1020.eqiad.wmnet with OS bullseye [14:08:01] (03PS1) 10Vgutierrez: cache::haproxy: use set-dumpable if coredumps are enabled [puppet] - 10https://gerrit.wikimedia.org/r/909287 (https://phabricator.wikimedia.org/T334448) [14:08:15] (03Merged) 10jenkins-bot: Change file timestamp to an int, from a string [cookbooks] - 10https://gerrit.wikimedia.org/r/909263 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:08:40] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [14:09:26] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [14:10:06] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [14:10:14] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40693/console" [puppet] - 10https://gerrit.wikimedia.org/r/909287 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [14:10:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P47003 and previous config saved to /var/cache/conftool/dbconfig/20230417-141056-ladsgroup.json [14:11:13] (03PS1) 10Elukey: Fix changelog warnings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909288 [14:11:53] (03PS1) 10EoghanGaffney: Split path to extract filename from ls output [cookbooks] - 10https://gerrit.wikimedia.org/r/909289 (https://phabricator.wikimedia.org/T330771) [14:12:08] !log Migrated linkrecommandation to mw-api-int - T334060 [14:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:13] T334060: Migrate linkrecommendations to mw-api-int - https://phabricator.wikimedia.org/T334060 [14:12:27] (03PS2) 10Vgutierrez: cache::haproxy: Add set-dumpable to haproxy global options [puppet] - 10https://gerrit.wikimedia.org/r/909287 (https://phabricator.wikimedia.org/T334448) [14:14:30] !log upload amd-k8s-device-plugin deb (1.25.2.3-1) to bullseye-wikimedia - T333009 [14:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:35] T333009: Review and test the AMD GPU kubernetes plugin - https://phabricator.wikimedia.org/T333009 [14:16:02] (03PS4) 10Urbanecm: Expose the 'sfsblock-bypass' right so it can be assigned to global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909267 (https://phabricator.wikimedia.org/T334856) (owner: 10MarcoAurelio) [14:16:33] jouncebot: nowandnext [14:16:34] No deployments scheduled for the next 1 hour(s) and 13 minute(s) [14:16:34] In 1 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1530) [14:16:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909267 (https://phabricator.wikimedia.org/T334856) (owner: 10MarcoAurelio) [14:17:43] (03Merged) 10jenkins-bot: Expose the 'sfsblock-bypass' right so it can be assigned to global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909267 (https://phabricator.wikimedia.org/T334856) (owner: 10MarcoAurelio) [14:17:47] (03CR) 10Jelto: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/909289 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:17:57] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:909267|Expose the 'sfsblock-bypass' right so it can be assigned to global groups (T334856)]] [14:18:02] T334856: Expose 'sfsblock-bypass' to global groups - https://phabricator.wikimedia.org/T334856 [14:18:10] (03CR) 10EoghanGaffney: [C: 03+2] Split path to extract filename from ls output [cookbooks] - 10https://gerrit.wikimedia.org/r/909289 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:19:08] !log urbanecm@deploy2002 urbanecm and maurelio: Backport for [[gerrit:909267|Expose the 'sfsblock-bypass' right so it can be assigned to global groups (T334856)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:19:12] (03CR) 10Ssingh: [C: 03+1] cache::haproxy: Add set-dumpable to haproxy global options [puppet] - 10https://gerrit.wikimedia.org/r/909287 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [14:19:46] patch works, syncing. [14:19:47] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Add set-dumpable to haproxy global options [puppet] - 10https://gerrit.wikimedia.org/r/909287 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [14:20:27] (03Merged) 10jenkins-bot: Split path to extract filename from ls output [cookbooks] - 10https://gerrit.wikimedia.org/r/909289 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney) [14:20:51] (03CR) 10Jforrester: "Argh, thanks, sorry was not around when this was deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908877 (https://phabricator.wikimedia.org/T333926) (owner: 10Func) [14:21:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1020.eqiad.wmnet with reason: host reimage [14:23:40] scap seems to be having some issues https://usercontent.irccloud-cdn.com/file/opnRca7u/image.png [14:24:24] yeah sigh [14:24:37] anything i should do with that sukhe? [14:24:39] let me find the ticket cdanis filed [14:24:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1020.eqiad.wmnet with reason: host reimage [14:24:46] urbanecm: known issue sadly [14:24:54] :-( [14:25:12] and I'm getting a 404 when reloading a wiki page because 'overflow' [14:25:17] https://phabricator.wikimedia.org/T334703 [14:25:34] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:909267|Expose the 'sfsblock-bypass' right so it can be assigned to global groups (T334856)]] (duration: 07m 36s) [14:25:35] urbanecm: happens because we are reimaging lvs1020 [14:25:38] wiki's very slow for me [14:25:38] T334856: Expose 'sfsblock-bypass' to global groups - https://phabricator.wikimedia.org/T334856 [14:25:42] PROBLEM - Host lvs1019 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:43] same here [14:25:43] did i accidentally depool many appservers? [14:25:48] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:25:52] er what happened to lvs1019? [14:26:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P47004 and previous config saved to /var/cache/conftool/dbconfig/20230417-142603-ladsgroup.json [14:26:16] PROBLEM - Host eventgate-analytics.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:26:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P47005 and previous config saved to /var/cache/conftool/dbconfig/20230417-142623-root.json [14:26:27] I'm getting Error: 502, Broken pipe at 2023-04-17 14:25:30 GMT now. [14:26:52] reproduced ^ [14:27:01] sigh [14:27:02] I can load some cached pages but not others and no pages I wouldn’t expect to be cached [14:27:06] RECOVERY - Host lvs1019 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [14:27:07] (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:18] PROBLEM - Host sessionstore.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:27:24] I acked the page [14:27:25] so [14:27:26] PROBLEM - Host eventgate-analytics-external.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:27:34] * Emperor here [14:27:38] (ProbeDown) firing: (17) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:39] Hey [14:27:40] lvs1019 crashed? [14:27:45] yeah most likely [14:27:50] 1020 is being reimaged [14:27:51] depooling eqiad [14:27:56] thanks [14:27:58] PROBLEM - Host termbox.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:28:00] (in meetings) [14:28:09] * Emperor just left the meeting they were in [14:28:11] o/ [14:28:18] PROBLEM - Host proton.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:28:28] RECOVERY - Host termbox.svc.eqiad.wmnet is UP: PING WARNING - Packet loss = 90%, RTA = 33.04 ms [14:28:32] PROBLEM - Host wikifeeds.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:28:34] PROBLEM - Host kartotherian.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:28:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:28:46] PROBLEM - Host restbase.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:28:53] (03PS1) 10Ssingh: depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/909292 [14:28:58] depooling eqiad from traffci won't cut it [14:29:07] we need to depool it from discovery [14:29:07] sukhe: should we restart pybal on 1019 [14:29:11] (03CR) 10BBlack: [C: 03+1] depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/909292 (owner: 10Ssingh) [14:29:14] just in case it can help? [14:29:15] It's depooled all things again [14:29:20] Like https://phabricator.wikimedia.org/T334703 [14:29:20] PROBLEM - Host eventgate-logging-external.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:29:22] the host is up [14:29:24] (03CR) 10Ssingh: [V: 03+2 C: 03+2] depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/909292 (owner: 10Ssingh) [14:29:34] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_443: Servers cp3064.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:29:46] it won't fix everything, but we should start minimizing damage now, and that's the easiest first step (traffic depool) [14:29:48] RECOVERY - Host kartotherian.svc.eqiad.wmnet is UP: PING WARNING - Packet loss = 90%, RTA = 33.08 ms [14:30:02] All api_appservers in eqiad are pooled=no for instance [14:30:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1498.eqiad.wmnet, mw1418.eqiad.wmnet, mw1353.eqiad.wmnet are marked down but pooled: eventgate-analytics-external_4692: Servers kubernetes1022.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wm [14:30:06] rnetes1017.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1374.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:30:09] !log running auth-dns update to depool eqiad [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:22] RECOVERY - Host restbase.svc.eqiad.wmnet is UP: PING WARNING - Packet loss = 90%, RTA = 33.10 ms [14:30:29] who is IC? [14:30:44] !log repooling api_appserver in eqiad [14:30:44] I'll take IC [14:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:50] getting a google doc going now [14:30:59] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=api_appserver [14:31:00] Emperor: updating status page [14:31:09] so the trigger was excessive applayer depool? [14:31:12] PROBLEM - Host search.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:31:15] that's what I'm getting from scrollback [14:31:24] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:31:26] RECOVERY - Host search.svc.eqiad.wmnet is UP: PING WARNING - Packet loss = 50%, RTA = 33.06 ms [14:31:28] !log repooling appserver in eqiad [14:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:33] https://docs.google.com/document/d/1Wx2KESoj8hRETWwH4_r2RPBWmGn0xIL4GyhoWiJcF_c/edit [14:31:40] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=appserver [14:31:42] RECOVERY - Host wikifeeds.svc.eqiad.wmnet is UP: PING WARNING - Packet loss = 75%, RTA = 33.12 ms [14:31:52] !log repooling parsoid in eqiad [14:31:53] ipvs screaming on /var/log/kern.log [14:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:57] Apr 17 14:24:44 lvs1019 kernel: [2406479.492138] IPVS: wrr: TCP 10.2.2.1:443 - no destination available [14:31:57] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=parsoid [14:31:58] RECOVERY - Host sessionstore.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms [14:32:06] RECOVERY - Host eventgate-analytics-external.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [14:32:07] (ProbeDown) firing: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:09] Apr 17 14:24:56 lvs1019 kernel: [2406492.090785] IPVS: wrr: TCP 10.2.2.22:443 - no destination available [14:32:22] ok so someone depooled all appservers? [14:32:28] apparently [14:32:29] joe: someone or something [14:32:31] yes [14:32:36] RECOVERY - Host eventgate-analytics.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 33.67 ms [14:32:38] (ProbeDown) firing: (21) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:39] All main appserver clusters repooled [14:32:47] don't we have a threshold? [14:32:50] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:33:00] RECOVERY - Host proton.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [14:33:09] Amir1: I think it's because this is scap + lvs depool [14:33:12] (03CR) 10Jbond: "joe just noticed this also need to be added to spicerack" [puppet] - 10https://gerrit.wikimedia.org/r/907814 (https://phabricator.wikimedia.org/T334456) (owner: 10Clément Goubert) [14:33:13] I think the problem is that we've stated multiple times that we can't do lvs maintenance while there is a mediawiki deploy and vice versa [14:33:22] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:25] there's known issues, we can't do a deploy while an LVS is being reimaged, or the automation breaks and doesn't repool things [14:33:28] and yet [14:33:30] joe: yes, this is https://phabricator.wikimedia.org/T334703 [14:33:35] again [14:33:39] this is the second time in a week? [14:33:43] yes [14:33:52] claime: is api ok too? [14:33:55] and jobrunner? [14:33:56] the first one was a bit different but my bad I guess for noticing that this is happening [14:34:16] (03PS1) 10Ssingh: hiera: lvs1020: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/909294 (https://phabricator.wikimedia.org/T321309) [14:34:34] thanks [14:34:47] yeah joe [14:34:48] Emperor: should we change to identified/monitoring or not yet? [14:34:58] ok so, let's pause scap deployments until sukhe is done? [14:35:00] RECOVERY - Host eventgate-logging-external.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 33.08 ms [14:35:00] Need to check kubernetes probably [14:35:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [14:35:15] joe: do we have a mechanism to enforce that? [14:35:22] yes [14:35:23] jynus: yes, I think so [14:35:29] you can create a scap lock [14:35:33] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_eventgate_logging_external_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:35:33] jouncebot: nowandnext [14:35:33] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [14:35:33] In 0 hour(s) and 54 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1530) [14:35:35] joe: can you do that, please? [14:35:54] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1743 days) https://wikitech.wikimedia.org/wiki/Logs [14:36:05] lvs1020 is going to be back up soon fwiw [14:37:07] (ProbeDown) resolved: (15) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:16] (MediaWikiLatencyExceeded) firing: (4) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExc [14:37:16] (MediaWikiLatencyExceeded) firing: Average latency high: ... [14:37:16] eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:37:32] throughput is back to normal, but metris show high latency and low edit rate [14:37:38] (ProbeDown) firing: (21) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:38:46] PROBLEM - eventgate-logging-external LVS eqiad on eventgate-logging-external.svc.eqiad.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [14:38:48] I will create a scap lock [14:39:07] bblack: did you depool eqiad from discovery? [14:39:48] Amir1: ack, thanks [14:40:05] !log ladsgroup@deploy2002 Locking from deployment [ALL REPOSITORIES]: LVS Maint - Outage (T334703) [14:40:12] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [14:40:18] RECOVERY - eventgate-logging-external LVS eqiad on eventgate-logging-external.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate [14:40:39] I'm doing scap lock --all "LVS Maint - Outage (T334703)" [14:40:42] thanks [14:41:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T333332)', diff saved to https://phabricator.wikimedia.org/P47006 and previous config saved to /var/cache/conftool/dbconfig/20230417-144109-ladsgroup.json [14:41:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:41:12] Emperor: no, sukhe just depooled the user-facing front edge traffic away from eqiad in general [14:41:16] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [14:41:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:41:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P47007 and previous config saved to /var/cache/conftool/dbconfig/20230417-144128-root.json [14:41:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T333332)', diff saved to https://phabricator.wikimedia.org/P47008 and previous config saved to /var/cache/conftool/dbconfig/20230417-144133-ladsgroup.json [14:41:57] bblack: thanks for clarifying [14:42:16] (MediaWikiLatencyExceeded) resolved: (4) Average latency high: eqiad appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE [14:42:38] (ProbeDown) firing: (17) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:45] I am counting around 11 minutes of disruption [14:43:05] in terms of missing traffic [14:43:42] let me know when I can unlock it [14:43:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T333332)', diff saved to https://phabricator.wikimedia.org/P47009 and previous config saved to /var/cache/conftool/dbconfig/20230417-144356-ladsgroup.json [14:44:01] Amir1: I will, waiting on a review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/909294 and then lvs reimaging should finish [14:44:35] In "least critical bug ever" news, the popup help ("?") UI is broken on www.wikimediastatus.net. Should I file a phab? Or to what wiki page should I post about it? [14:44:47] rest-gateway is still depooled joe [14:44:57] sorry no [14:45:01] my bad [14:45:21] No I'm right, it's depooled in both codfw and eqiad [14:45:32] (03PS1) 10Slyngshede: C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) [14:45:33] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_eventgate_logging_external_cluster_eqiad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:55] claime: That's okay, it's still in lvs_setup [14:46:00] hnowlan: ack [14:46:12] dmacks: ongoing incident, sorry [14:46:18] I think everything's repooled [14:46:35] (03PS1) 10Elukey: amd-gpu-tester: fix script permissions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909296 [14:46:38] np, will wait until things settle [14:46:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40694/console" [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [14:47:38] PROBLEM - PHP7 rendering on mw1375 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:47:58] (03CR) 10JMeybohm: [C: 03+1] Fix changelog warnings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909288 (owner: 10Elukey) [14:48:47] our headline metrics look to be back to about-normal now...? [14:48:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix changelog warnings [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909288 (owner: 10Elukey) [14:48:58] (03CR) 10Ssingh: [C: 03+2] hiera: lvs1020: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/909294 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:49:34] I'm so happy I was oncall last week instead of this week [14:49:45] haha [14:49:54] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: fix script permissions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909296 (owner: 10Elukey) [14:50:40] jouncebot: next [14:50:40] In 0 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1530) [14:51:47] jynus: I think we can mark the incident itself resolved? We need to clear up still, but I think we are serving traffic normally again. [14:52:03] Emperor: thanks, doing [14:52:15] sukhe: are you done with LVS work now? [i.e. can we release the scap lock?] [14:52:19] you do it on the doc, of I do? [14:52:23] Emperor: finishing the reimaging [14:52:26] I'll do it. [14:52:30] sukhe: OK, LMK when? [14:52:30] I will do statuspage [14:52:31] so please wait, thank you for checking [14:52:33] yes please [14:53:45] !log ladsgroup@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: LVS Maint - Outage (T334703) (duration: 13m 39s) [14:53:51] I released it [14:53:51] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [14:54:07] Amir1: err, I think sukhe said to wait until they were done [14:54:17] Amir1: lv1020 is still being reimaged fwiw [14:54:24] it's at the Puppet run stage, so another 20 mins [14:54:25] Amir1: put it back please [14:54:25] or so [14:54:27] (03CR) 10Jelto: [C: 03+2] install_server: configure root raid only on gitlab-raid1 [puppet] - 10https://gerrit.wikimedia.org/r/908832 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [14:54:30] (03PS1) 10Cwhite: sre: reword description to remove double-negative [alerts] - 10https://gerrit.wikimedia.org/r/908879 [14:54:31] we also need to restart php-fpm for MW (or revert the change I tried to make); it didn't take effect because of the issue. [14:54:35] Amir1: yeah, I think that's a "please put the lock back" [14:54:39] +1 [14:54:42] RECOVERY - PHP7 rendering on mw1375 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 3.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:54:43] !log depooling mw1375.eqiad.wmnet [14:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:50] of course [14:54:55] it recovers when I depool it [14:54:58] * claime shakes fist [14:55:44] !log repooled mw1375.eqiad.wmnet [14:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:06] this is just for documenting- I am not sure if cached requests to eqiad failed or not, someone can clarify? [14:56:10] (03CR) 10CI reject: [V: 04-1] sre: reword description to remove double-negative [alerts] - 10https://gerrit.wikimedia.org/r/908879 (owner: 10Cwhite) [14:56:28] (03PS2) 10Slyngshede: C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) [14:56:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P47010 and previous config saved to /var/cache/conftool/dbconfig/20230417-145633-root.json [14:56:39] we had a spike of 502s and 503s [14:56:49] can someone else put in the scap lock please? [14:57:12] !log urbanecm@deploy2002 Locking from deployment [ALL REPOSITORIES]: LVS Maint - Outage [14:57:14] !log urbanecm@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: LVS Maint - Outage (duration: 00m 01s) [14:57:15] !log urbanecm@deploy2002 Locking from deployment [ALL REPOSITORIES]: LVS Maint - Outage [14:57:16] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [14:57:16] eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:57:20] sukhe: done [14:57:25] thank you [14:57:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40695/console" [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [14:58:07] I will add that to the doc for a follwup checklist [14:58:42] (03PS2) 10Cwhite: sre: reword description to remove double-negative [alerts] - 10https://gerrit.wikimedia.org/r/908879 [14:59:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P47011 and previous config saved to /var/cache/conftool/dbconfig/20230417-145902-ladsgroup.json [14:59:11] (03PS3) 10Slyngshede: C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) [14:59:22] sukhe: now [14:59:43] already done [15:00:00] still waiting for the reimaging to finish, of course it's moving a lot slower now than it usually is :P [15:00:10] Amir1: <3 [15:00:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40696/console" [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [15:00:25] annoyingly, searching for "scap lock" on wikitech returns nothing useful [15:00:38] Emperor: scap lock --help is much more helpful, fortunately. [15:00:40] (from deploy2002) [15:00:48] (03PS4) 10Slyngshede: C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) [15:01:02] Emperor: I got it from the dc switchover checklist: scap lock --all "LVS Maint - Outage (T334703), take two" [15:01:03] T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 [15:01:35] I'm inclined to suggest that https://wikitech.wikimedia.org/wiki/Scap should have it, though [15:01:47] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40697/console" [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [15:02:16] (MediaWikiLatencyExceeded) firing: Average latency high: ... [15:02:16] eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:02:52] ? [15:03:09] (03PS5) 10Slyngshede: C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) [15:03:26] its just a very delayed alert [15:03:32] not sure why it's firing again [15:03:49] latency is going down actually for api [15:03:59] yeah [15:04:02] joe my guess it needs to adjust threasholds [15:04:08] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40698/console" [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [15:04:26] should I add it as followup, joe, claime? [15:04:30] (03CR) 10Slyngshede: C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [15:04:34] "check the alert"? [15:04:34] yes please [15:07:19] !log rolling restart of HAProxy in the text cluster - T334448 [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:26] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [15:07:31] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host lvs1020.eqiad.wmnet with OS bullseye [15:07:45] fail ^_^ [15:07:46] sukhe: problems with LVS still? [15:08:06] the failure i: [15:08:08] > First Puppet run failed, asking the operator what to do [15:08:13] because we had to update the interface names [15:08:39] (not urgent) I added a question to the incident draft, if there is Traffic people not busy right now, please help me respond it [15:08:41] everything else looks good, usually we reimage it twice after we update the iface names (since the names change in bullseye) [15:08:57] (03PS1) 10Volans: service: add httpbb_dir field [software/spicerack] - 10https://gerrit.wikimedia.org/r/909299 [15:09:04] we can do that later, for now, just checking eif if everything is fine and we should be ready to go ahead [15:09:40] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1020 [15:09:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service: add httpbb_dir field [software/spicerack] - 10https://gerrit.wikimedia.org/r/909299 (owner: 10Volans) [15:09:48] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs1020 [15:09:55] NOOP on switch reconfigure, that's good [15:11:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P47012 and previous config saved to /var/cache/conftool/dbconfig/20230417-151138-root.json [15:11:51] (03PS1) 10Ssingh: Revert "hiera: lvs1020: update iface names for bullseye (eqiad)" [puppet] - 10https://gerrit.wikimedia.org/r/909269 [15:11:57] er wrong [15:12:05] (03Abandoned) 10Ssingh: Revert "hiera: lvs1020: update iface names for bullseye (eqiad)" [puppet] - 10https://gerrit.wikimedia.org/r/909269 (owner: 10Ssingh) [15:12:21] (03PS1) 10Ssingh: Revert "depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/909270 [15:13:10] I am going to repool DNS [15:13:24] is that fine or I need to wait on someone? [15:13:28] then we can release the lock [15:13:29] 🍿 [15:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P47013 and previous config saved to /var/cache/conftool/dbconfig/20230417-151409-ladsgroup.json [15:14:22] (03PS1) 10Jbond: logrotate::rule: allow post_rotate to accept multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/909301 (https://phabricator.wikimedia.org/T333676) [15:14:41] sukhe: lgtm for appservers/parsoid/api [15:15:04] ok [15:15:10] going ahead [15:15:14] Amir1: make sure it's with extra butter! [15:16:32] (03CR) 10Volans: [C: 03+2] service: add httpbb_dir field [software/spicerack] - 10https://gerrit.wikimedia.org/r/909299 (owner: 10Volans) [15:16:55] (03CR) 10CI reject: [V: 04-1] logrotate::rule: allow post_rotate to accept multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/909301 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [15:17:44] :D [15:17:47] sukhe: is that done? [15:17:56] Emperor: was just checking the interfaces [15:17:59] now doing it and will log here :) [15:18:09] sukhe: cool, no worries, LMK [15:18:18] (03CR) 10Ssingh: [C: 03+2] Revert "depool eqiad" [dns] - 10https://gerrit.wikimedia.org/r/909270 (owner: 10Ssingh) [15:18:40] !log run authdns-update and repool eqiad [15:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:42] all done [15:19:45] you can release the lock [15:20:14] ack Amir1 you OK to release that scap lock now, please? Then I can go and get some tea :) [15:20:19] !log urbanecm@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: LVS Maint - Outage (duration: 23m 03s) [15:20:26] Martin is holding it [15:20:27] I'm not Amir, but I put the lock :). Released. [15:20:32] thanks urbanecm! [15:20:43] sukhe: okay to do a test deployment (to ensure my change propagated where it should be)? [15:20:53] (03Merged) 10jenkins-bot: service: add httpbb_dir field [software/spicerack] - 10https://gerrit.wikimedia.org/r/909299 (owner: 10Volans) [15:20:56] urbanecm: let's do it :) [15:20:57] urbanecm: TY [15:21:18] !log urbanecm@deploy2002 Started scap: Expose the sfsblock-bypass right so it can be assigned to global groups (T334856; second try) [15:21:21] doing [15:21:23] T334856: Expose 'sfsblock-bypass' to global groups - https://phabricator.wikimedia.org/T334856 [15:22:16] (MediaWikiLatencyExceeded) resolved: Average latency high: ... [15:22:16] eqiad api_appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:22:16] * sukhe holds breath [15:23:20] looks ok for now [15:23:25] claime: don't... [15:23:29] :P [15:23:36] >_> [15:23:39] it didn't reach the restart stage yet :D [15:24:23] urbanecm: I'm not worried about the deployment if the lvs is back tbh [15:24:26] * Emperor going to stop ignoring the RSI break the computer has been yelling at them about for the last 35 minutes [15:24:30] I was more worried about the repool [15:24:47] :) [15:26:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P47014 and previous config saved to /var/cache/conftool/dbconfig/20230417-152644-root.json [15:27:40] !log urbanecm@deploy2002 Finished scap: Expose the sfsblock-bypass right so it can be assigned to global groups (T334856; second try) (duration: 06m 22s) [15:27:45] T334856: Expose 'sfsblock-bypass' to global groups - https://phabricator.wikimedia.org/T334856 [15:27:56] urbanecm: all done? [15:27:57] change successfully deployed [15:27:59] yup [15:28:00] phew [15:28:05] thanks and sorry for the incident today [15:28:10] (03PS2) 10Jbond: logrotate::rule: allow post_rotate to accept multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/909301 (https://phabricator.wikimedia.org/T333676) [15:28:13] thanks for the quick fix :) [15:28:14] I definitely didn't notice your scap [15:28:27] that's on me so sorry [15:29:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T333332)', diff saved to https://phabricator.wikimedia.org/P47015 and previous config saved to /var/cache/conftool/dbconfig/20230417-152916-ladsgroup.json [15:29:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:29:22] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [15:29:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:29:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1530). [15:30:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:30:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:30:26] (03PS3) 10Jbond: logrotate::rule: allow post_rotate to accept multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/909301 (https://phabricator.wikimedia.org/T333676) [15:30:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:31:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:31:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:31:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T333332)', diff saved to https://phabricator.wikimedia.org/P47016 and previous config saved to /var/cache/conftool/dbconfig/20230417-153134-ladsgroup.json [15:34:03] (03PS1) 10Ahmon Dancy: Enable /srv/mediawiki symlink on prod deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/909302 (https://phabricator.wikimedia.org/T329857) [15:34:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T333332)', diff saved to https://phabricator.wikimedia.org/P47017 and previous config saved to /var/cache/conftool/dbconfig/20230417-153412-ladsgroup.json [15:34:26] (03CR) 10CI reject: [V: 04-1] Enable /srv/mediawiki symlink on prod deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/909302 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [15:35:16] (03PS4) 10Jbond: logrotate::rule: allow post_rotate to accept multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/909301 (https://phabricator.wikimedia.org/T333676) [15:35:18] (03PS6) 10Jbond: C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [15:35:24] (03PS2) 10Ahmon Dancy: Enable /srv/mediawiki symlink on prod deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/909302 (https://phabricator.wikimedia.org/T329857) [15:36:08] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909303 (https://phabricator.wikimedia.org/T128546) [15:38:17] jan_drewniak: It doesn't need rebase anymore? https://gerrit.wikimedia.org/r/c/wikimedia/portals/deploy/+/909190 NICE [15:38:49] Amir1: yeah! whatever you did there works 😅 [15:39:29] I couldn't believe what I did worked, it was so messy [15:39:55] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909303 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:40:43] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909303 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:41:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 35): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40702/console" [puppet] - 10https://gerrit.wikimedia.org/r/909301 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [15:41:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P47018 and previous config saved to /var/cache/conftool/dbconfig/20230417-154149-root.json [15:42:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:44:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] logrotate::rule: allow post_rotate to accept multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/909301 (https://phabricator.wikimedia.org/T333676) (owner: 10Jbond) [15:45:19] (03CR) 10Jbond: [C: 03+1] C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [15:45:38] (03CR) 10Slyngshede: [C: 03+2] C:squid Squid logs are managed by rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/909295 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [15:47:52] (03PS1) 10Elukey: amd-gpu-tester: add ROCm suite packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909304 (https://phabricator.wikimedia.org/T333009) [15:48:18] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:909303| Bumping portals to master (T128546)]] (duration: 05m 59s) [15:48:23] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:49:15] (03PS2) 10Elukey: amd-gpu-tester: add ROCm suite packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909304 (https://phabricator.wikimedia.org/T333009) [15:49:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P47019 and previous config saved to /var/cache/conftool/dbconfig/20230417-154918-ladsgroup.json [15:49:40] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: add ROCm suite packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909304 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [15:50:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked backup2010 hosts in codfw - jhancock@cumin2002" [15:50:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add newly racked backup2010 hosts in codfw - jhancock@cumin2002" [15:50:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:50] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:909303| Bumping portals to master (T128546)]] (duration: 05m 30s) [15:53:54] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:56:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P47020 and previous config saved to /var/cache/conftool/dbconfig/20230417-155654-root.json [15:59:09] I scratched my own itch by documenting scap lock on wikitech :) [16:02:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2010.mgmt.codfw.wmnet with reboot policy FORCED [16:04:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P47021 and previous config saved to /var/cache/conftool/dbconfig/20230417-160425-ladsgroup.json [16:06:04] (03PS2) 10SBassett: Disable DoubleWiki extension everywhere, at least for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902388 (https://phabricator.wikimedia.org/T332850) (owner: 10Jforrester) [16:08:51] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/908882 [16:14:02] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.4.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/909308 [16:19:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T333332)', diff saved to https://phabricator.wikimedia.org/P47022 and previous config saved to /var/cache/conftool/dbconfig/20230417-161931-ladsgroup.json [16:19:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:19:37] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [16:19:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:19:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T333332)', diff saved to https://phabricator.wikimedia.org/P47023 and previous config saved to /var/cache/conftool/dbconfig/20230417-161955-ladsgroup.json [16:22:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T333332)', diff saved to https://phabricator.wikimedia.org/P47024 and previous config saved to /var/cache/conftool/dbconfig/20230417-162238-ladsgroup.json [16:23:21] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.4.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/909308 (owner: 10Volans) [16:27:55] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.4.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/909308 (owner: 10Volans) [16:31:53] (03PS1) 10Volans: Upstream release v6.4.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/909310 [16:36:08] (03PS1) 10Elukey: amd-gpu-tester: reduce the ROCm packages installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909313 (https://phabricator.wikimedia.org/T333009) [16:37:08] (03CR) 10Volans: [C: 03+2] Upstream release v6.4.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/909310 (owner: 10Volans) [16:37:26] (03PS2) 10Elukey: amd-gpu-tester: reduce the ROCm packages installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909313 (https://phabricator.wikimedia.org/T333009) [16:37:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P47025 and previous config saved to /var/cache/conftool/dbconfig/20230417-163744-ladsgroup.json [16:37:53] (03CR) 10Elukey: [V: 03+2 C: 03+2] amd-gpu-tester: reduce the ROCm packages installed [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/909313 (https://phabricator.wikimedia.org/T333009) (owner: 10Elukey) [16:41:13] (03Merged) 10jenkins-bot: Upstream release v6.4.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/909310 (owner: 10Volans) [16:44:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2010.mgmt.codfw.wmnet with reboot policy FORCED [16:46:10] !log uploaded spicerack_6.4.2 to apt.wikimedia.org bullseye-wikimedia [16:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:26] !log installed spicerack_6.4.2 on cumin2002 [16:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P47026 and previous config saved to /var/cache/conftool/dbconfig/20230417-165251-ladsgroup.json [16:56:29] !log restarted oozie page view-druid-hourly job 0174449-220913162928808-oozie-oozi-C [16:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:42] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@f8dad05]: analytics: deploy Airflow ArchiveOperator should have a number of retries of 0. T332216 [16:59:47] T332216: Airflow ArchiveOperator should have a number of retries of 0 - https://phabricator.wikimedia.org/T332216 [16:59:54] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@f8dad05]: analytics: deploy Airflow ArchiveOperator should have a number of retries of 0. T332216 (duration: 00m 12s) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1700) [17:00:05] ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T1700). [17:01:33] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010'] [17:03:28] !log installed spicerack_6.4.2 on cumin1001 [17:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['backup2010'] [17:07:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T333332)', diff saved to https://phabricator.wikimedia.org/P47027 and previous config saved to /var/cache/conftool/dbconfig/20230417-170757-ladsgroup.json [17:08:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:08:03] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:08:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:08:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:08:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:08:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T333332)', diff saved to https://phabricator.wikimedia.org/P47028 and previous config saved to /var/cache/conftool/dbconfig/20230417-170838-ladsgroup.json [17:09:28] jouncebot: next [17:09:28] In 2 hour(s) and 50 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T2000) [17:11:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T333332)', diff saved to https://phabricator.wikimedia.org/P47029 and previous config saved to /var/cache/conftool/dbconfig/20230417-171117-ladsgroup.json [17:13:30] !log restarted Oozie page view-druid-daily job 0174450-220913162928808-oozie-oozi-C [17:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010'] [17:14:54] !log jhancock@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['backup2010'] [17:16:21] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010'] [17:17:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2010'] [17:18:40] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2010'] [17:25:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2010'] [17:26:24] !log restarted turnilo with ‘sudo systemctl restart turnilo’ [17:26:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P47030 and previous config saved to /var/cache/conftool/dbconfig/20230417-172623-ladsgroup.json [17:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P47031 and previous config saved to /var/cache/conftool/dbconfig/20230417-174130-ladsgroup.json [17:49:10] (03PS1) 10Jcrespo: dbbackups: Migrate db1116 sections to db1216 [puppet] - 10https://gerrit.wikimedia.org/r/909323 (https://phabricator.wikimedia.org/T334066) [17:54:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:56:25] (03CR) 10Aaron Schulz: Set "s3" as the default section name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834 (owner: 10Aaron Schulz) [17:56:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T333332)', diff saved to https://phabricator.wikimedia.org/P47032 and previous config saved to /var/cache/conftool/dbconfig/20230417-175636-ladsgroup.json [17:56:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:56:42] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [17:56:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P47033 and previous config saved to /var/cache/conftool/dbconfig/20230417-175700-ladsgroup.json [17:57:21] (03PS1) 10Jcrespo: .bashrc: Change alias location [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) [17:59:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P47034 and previous config saved to /var/cache/conftool/dbconfig/20230417-175943-ladsgroup.json [18:00:26] (03CR) 10Jcrespo: [C: 04-1] "Not yet, switchover not done." [puppet] - 10https://gerrit.wikimedia.org/r/909324 (https://phabricator.wikimedia.org/T334455) (owner: 10Jcrespo) [18:02:53] (03PS2) 10Majavah: ssh: extract enabled key types to a parameter [puppet] - 10https://gerrit.wikimedia.org/r/907939 [18:02:55] (03PS3) 10Majavah: ssh: add support for using a CA for host keys [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) [18:06:07] (03CR) 10Majavah: ssh: add support for using a CA for host keys (0317 comments) [puppet] - 10https://gerrit.wikimedia.org/r/907940 (https://phabricator.wikimedia.org/T268344) (owner: 10Majavah) [18:07:00] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Migrate db1116 sections to db1216 [puppet] - 10https://gerrit.wikimedia.org/r/909323 (https://phabricator.wikimedia.org/T334066) (owner: 10Jcrespo) [18:07:58] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/909212 (owner: 10Hnowlan) [18:11:57] (03PS1) 10Ssingh: hiera: lvs1019: update iface names for bullseye (eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/909325 (https://phabricator.wikimedia.org/T321309) [18:14:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P47035 and previous config saved to /var/cache/conftool/dbconfig/20230417-181449-ladsgroup.json [18:17:00] (03PS8) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [18:17:02] (03PS10) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [18:17:04] (03PS7) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [18:17:06] (03PS50) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:17:08] (03PS1) 10Jbond: puppet::agent: Pass through the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 [18:19:04] (03PS1) 10Jcrespo: dbbackups: Reenable notifications for db1216 and db1225 [puppet] - 10https://gerrit.wikimedia.org/r/909327 (https://phabricator.wikimedia.org/T334057) [18:19:17] (03PS2) 10Jcrespo: dbbackups: Reenable notifications for db1216 and db1225 [puppet] - 10https://gerrit.wikimedia.org/r/909327 (https://phabricator.wikimedia.org/T334057) [18:20:39] (03PS2) 10Jbond: puppet::agent: Pass through the enable_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/909326 [18:20:41] (03PS9) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [18:20:43] (03PS11) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [18:20:45] (03PS8) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [18:20:47] (03PS51) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:21:06] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications for db1216 and db1225 [puppet] - 10https://gerrit.wikimedia.org/r/909327 (https://phabricator.wikimedia.org/T334057) (owner: 10Jcrespo) [18:25:31] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [18:29:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P47036 and previous config saved to /var/cache/conftool/dbconfig/20230417-182956-ladsgroup.json [18:34:35] (03PS1) 10Andrew Bogott: cinder: duplicate [rbd] config in cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/909332 [18:37:52] (03CR) 10Andrew Bogott: [C: 03+2] cinder: duplicate [rbd] config in cinder.conf [puppet] - 10https://gerrit.wikimedia.org/r/909332 (owner: 10Andrew Bogott) [18:42:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:45:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P47037 and previous config saved to /var/cache/conftool/dbconfig/20230417-184502-ladsgroup.json [18:45:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:45:08] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [18:45:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:45:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T333332)', diff saved to https://phabricator.wikimedia.org/P47038 and previous config saved to /var/cache/conftool/dbconfig/20230417-184525-ladsgroup.json [18:45:48] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T333332)', diff saved to https://phabricator.wikimedia.org/P47039 and previous config saved to /var/cache/conftool/dbconfig/20230417-185710-ladsgroup.json [18:57:17] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:00:45] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [19:10:51] (03PS1) 10Bartosz Dziewoński: Mobile editor: Don't try to take over if the form has already been submitted [extensions/MobileFrontend] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909274 (https://phabricator.wikimedia.org/T334794) [19:11:16] (03PS1) 10Bartosz Dziewoński: Mobile editor: Don't try to take over on non-wikitext content [extensions/MobileFrontend] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909275 (https://phabricator.wikimedia.org/T334799) [19:12:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P47040 and previous config saved to /var/cache/conftool/dbconfig/20230417-191217-ladsgroup.json [19:13:36] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [19:15:49] (03PS4) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) [19:16:49] (03CR) 10Andrew Bogott: [C: 03+2] Openstack envscripts: allow unsetting environment variables [puppet] - 10https://gerrit.wikimedia.org/r/906776 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:16:55] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [19:17:13] (03CR) 10Andrew Bogott: [C: 03+2] openstack::util::envscript: allow caller to specify domain_ids [puppet] - 10https://gerrit.wikimedia.org/r/906777 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:17:53] (03PS3) 10Andrew Bogott: Openstack envscripts.pp: create additional scripts for system and domain scope [puppet] - 10https://gerrit.wikimedia.org/r/906778 (https://phabricator.wikimedia.org/T330759) [19:18:16] (03CR) 10CI reject: [V: 04-1] Openstack envscripts.pp: create additional scripts for system and domain scope [puppet] - 10https://gerrit.wikimedia.org/r/906778 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:18:30] (03PS1) 10Joal: Update AQS druid datasource to current month [puppet] - 10https://gerrit.wikimedia.org/r/909335 [19:23:04] (03PS4) 10Andrew Bogott: Openstack envscripts.pp: create additional scripts for system and domain scope [puppet] - 10https://gerrit.wikimedia.org/r/906778 (https://phabricator.wikimedia.org/T330759) [19:25:20] (03CR) 10Andrew Bogott: [C: 03+2] Openstack envscripts.pp: create additional scripts for system and domain scope [puppet] - 10https://gerrit.wikimedia.org/r/906778 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:27:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P47041 and previous config saved to /var/cache/conftool/dbconfig/20230417-192723-ladsgroup.json [19:28:43] (03PS1) 10Andrew Bogott: OpenStack env scripts: different script names for new env scripts [puppet] - 10https://gerrit.wikimedia.org/r/909337 (https://phabricator.wikimedia.org/T330759) [19:31:00] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack env scripts: different script names for new env scripts [puppet] - 10https://gerrit.wikimedia.org/r/909337 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:32:01] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [19:42:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T333332)', diff saved to https://phabricator.wikimedia.org/P47042 and previous config saved to /var/cache/conftool/dbconfig/20230417-194229-ladsgroup.json [19:42:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:42:35] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [19:42:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P47043 and previous config saved to /var/cache/conftool/dbconfig/20230417-194253-ladsgroup.json [19:45:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P47044 and previous config saved to /var/cache/conftool/dbconfig/20230417-194537-ladsgroup.json [19:47:58] (03PS1) 10Dzahn: gerrit::migration: ensure gerrit review_site dir exists [puppet] - 10https://gerrit.wikimedia.org/r/909338 (https://phabricator.wikimedia.org/T326368) [19:48:22] (03CR) 10CI reject: [V: 04-1] gerrit::migration: ensure gerrit review_site dir exists [puppet] - 10https://gerrit.wikimedia.org/r/909338 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [19:50:32] (03PS2) 10Dzahn: gerrit::migration: ensure gerrit review_site dir exists [puppet] - 10https://gerrit.wikimedia.org/r/909338 (https://phabricator.wikimedia.org/T326368) [19:52:25] (03PS1) 10Andrew Bogott: OpenStack envscripts: replace UNDEF with UNSET [puppet] - 10https://gerrit.wikimedia.org/r/909339 [19:54:38] (03CR) 10Dzahn: [C: 04-1] "ok on new and prod host but duplicate declaration on prod replica: https://puppet-compiler.wmflabs.org/output/909338/40705/gerrit2002.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/909338 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [19:55:22] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack envscripts: replace UNDEF with UNSET [puppet] - 10https://gerrit.wikimedia.org/r/909339 (owner: 10Andrew Bogott) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T2000) [20:00:05] koi and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] hello [20:00:21] hi [20:00:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P47045 and previous config saved to /var/cache/conftool/dbconfig/20230417-200043-ladsgroup.json [20:01:05] (unavailable, sorry) [20:03:02] I can deploy today [20:03:08] and hi TheresNoTime :) [20:03:43] (03CR) 10Urbanecm: [C: 03+2] Mobile editor: Don't try to take over if the form has already been submitted [extensions/MobileFrontend] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909274 (https://phabricator.wikimedia.org/T334794) (owner: 10Bartosz Dziewoński) [20:03:45] (03CR) 10Urbanecm: [C: 03+2] Mobile editor: Don't try to take over on non-wikitext content [extensions/MobileFrontend] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909275 (https://phabricator.wikimedia.org/T334799) (owner: 10Bartosz Dziewoński) [20:05:42] (03PS2) 10Urbanecm: ruwiki: Allow sysop to add/remove confirmed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908945 (https://phabricator.wikimedia.org/T334780) (owner: 10Stang) [20:05:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908945 (https://phabricator.wikimedia.org/T334780) (owner: 10Stang) [20:06:31] (03Merged) 10jenkins-bot: ruwiki: Allow sysop to add/remove confirmed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908945 (https://phabricator.wikimedia.org/T334780) (owner: 10Stang) [20:06:47] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:908945|ruwiki: Allow sysop to add/remove confirmed group (T334780)]] [20:06:52] T334780: Enable ruwiki sysops to add users to the `confirmed` group - https://phabricator.wikimedia.org/T334780 [20:08:00] !log urbanecm@deploy2002 urbanecm and stang: Backport for [[gerrit:908945|ruwiki: Allow sysop to add/remove confirmed group (T334780)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:08:12] koi: hi, can you test at mwdebug2001 please? [20:08:39] urbanecm, I checked Special:Listgrouprights and it works fine [20:08:43] syncing [20:08:46] ty [20:14:19] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:908945|ruwiki: Allow sysop to add/remove confirmed group (T334780)]] (duration: 07m 31s) [20:14:24] T334780: Enable ruwiki sysops to add users to the `confirmed` group - https://phabricator.wikimedia.org/T334780 [20:14:24] koi: should be live [20:14:48] MatmaRex: double-checking, your config is ok to go before the backports, right? [20:15:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P47046 and previous config saved to /var/cache/conftool/dbconfig/20230417-201549-ladsgroup.json [20:15:52] urbanecm: hi, yes, they're unrelated [20:16:01] thanks (and hi). proceeding then. [20:16:13] (03PS5) 10Urbanecm: Stop using redundant $wmg variables for VisualEditor extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) (owner: 10Bartosz Dziewoński) [20:16:17] (03CR) 10Urbanecm: [C: 03+2] Stop using redundant $wmg variables for VisualEditor extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) (owner: 10Bartosz Dziewoński) [20:16:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) (owner: 10Bartosz Dziewoński) [20:17:16] (03Merged) 10jenkins-bot: Stop using redundant $wmg variables for VisualEditor extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) (owner: 10Bartosz Dziewoński) [20:17:31] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:905711|Stop using redundant $wmg variables for VisualEditor extension (T119117)]] [20:17:36] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [20:18:34] (03Merged) 10jenkins-bot: Mobile editor: Don't try to take over if the form has already been submitted [extensions/MobileFrontend] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909274 (https://phabricator.wikimedia.org/T334794) (owner: 10Bartosz Dziewoński) [20:18:39] (03Merged) 10jenkins-bot: Mobile editor: Don't try to take over on non-wikitext content [extensions/MobileFrontend] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909275 (https://phabricator.wikimedia.org/T334799) (owner: 10Bartosz Dziewoński) [20:18:49] !log urbanecm@deploy2002 urbanecm and matmarex: Backport for [[gerrit:905711|Stop using redundant $wmg variables for VisualEditor extension (T119117)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:18:58] MatmaRex: please test your patch at mwdebug2001, thanks! [20:20:12] urbanecm: seems good [20:20:17] thanks, proceeding [20:20:43] the patch should be a no-op, i just tested that editor still works as before [20:21:23] (03CR) 10Ottomata: [C: 03+2] Update AQS druid datasource to current month [puppet] - 10https://gerrit.wikimedia.org/r/909335 (owner: 10Joal) [20:22:05] yeah, makes sense [20:25:50] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:905711|Stop using redundant $wmg variables for VisualEditor extension (T119117)]] (duration: 08m 19s) [20:25:55] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [20:26:12] should be synced. going on backports [20:26:37] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:909274|Mobile editor: Don't try to take over if the form has already been submitted (T334794 T334797 T334877)]], [[gerrit:909275|Mobile editor: Don't try to take over on non-wikitext content (T334799)]] [20:26:46] T334794: Section=new (commentbox inputbox) is broken in MobileFrontEnd - https://phabricator.wikimedia.org/T334794 [20:26:46] T334799: Editor doesn't load when editing CSS/JS/Lua on mobile - https://phabricator.wikimedia.org/T334799 [20:26:47] T334877: Unable to publish changes to fr.m.wikipedia.org - https://phabricator.wikimedia.org/T334877 [20:26:47] T334797: Impossible to edit without JS or undo an edit on mobile site (414 URI Too Long) - https://phabricator.wikimedia.org/T334797 [20:27:49] !log urbanecm@deploy2002 urbanecm and matmarex: Backport for [[gerrit:909274|Mobile editor: Don't try to take over if the form has already been submitted (T334794 T334797 T334877)]], [[gerrit:909275|Mobile editor: Don't try to take over on non-wikitext content (T334799)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:28:01] MatmaRex: backports are now at mwdebug2001. can you test please? [20:28:08] yeah [20:30:21] urbanecm: looking good [20:30:25] thanks, proceeding [20:30:26] (03PS3) 10Dzahn: gerrit::migration: ensure gerrit review_site dir exists [puppet] - 10https://gerrit.wikimedia.org/r/909338 (https://phabricator.wikimedia.org/T326368) [20:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T333332)', diff saved to https://phabricator.wikimedia.org/P47047 and previous config saved to /var/cache/conftool/dbconfig/20230417-203056-ladsgroup.json [20:30:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:31:01] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [20:31:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T333332)', diff saved to https://phabricator.wikimedia.org/P47048 and previous config saved to /var/cache/conftool/dbconfig/20230417-203108-ladsgroup.json [20:32:25] hm, scap says "20:31:12 Check 'Check endpoints for mw2376.codfw.wmnet' failed: /wiki/{title} (Special Version) is CRITICAL: Test Special Version returned the unexpected status 500 (expecting: 200)" [20:33:19] looks like a temp failure though, Special:Version works now at that appserver [20:33:36] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:33:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T333332)', diff saved to https://phabricator.wikimedia.org/P47049 and previous config saved to /var/cache/conftool/dbconfig/20230417-203350-ladsgroup.json [20:35:26] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service,httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:30] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [20:35:52] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:909274|Mobile editor: Don't try to take over if the form has already been submitted (T334794 T334797 T334877)]], [[gerrit:909275|Mobile editor: Don't try to take over on non-wikitext content (T334799)]] (duration: 09m 14s) [20:35:56] and, deployed [20:35:59] T334794: Section=new (commentbox inputbox) is broken in MobileFrontEnd - https://phabricator.wikimedia.org/T334794 [20:36:00] T334799: Editor doesn't load when editing CSS/JS/Lua on mobile - https://phabricator.wikimedia.org/T334799 [20:36:00] T334877: Unable to publish changes to fr.m.wikipedia.org - https://phabricator.wikimedia.org/T334877 [20:36:00] T334797: Impossible to edit without JS or undo an edit on mobile site (414 URI Too Long) - https://phabricator.wikimedia.org/T334797 [20:36:04] MatmaRex: should be all done. anything else? [20:36:29] thanks! [20:37:51] no problem [20:39:21] (03PS4) 10Dzahn: gerrit::migration: ensure gerrit review_site dir exists [puppet] - 10https://gerrit.wikimedia.org/r/909338 (https://phabricator.wikimedia.org/T326368) [20:41:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:42:37] urbanecm Do you have some time to deploy another patch? [20:42:49] I was about to leave, but sure, let's do it :) [20:42:51] which one? [20:43:10] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/908972 [20:43:46] Sorry I just returned home (that's why I didn't schedule it) [20:43:47] (03PS3) 10Urbanecm: [trwikiquote] Add a HD logo for Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908972 (https://phabricator.wikimedia.org/T334732) (owner: 10Superpes15) [20:43:51] (03CR) 10Urbanecm: [C: 03+2] [trwikiquote] Add a HD logo for Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908972 (https://phabricator.wikimedia.org/T334732) (owner: 10Superpes15) [20:43:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908972 (https://phabricator.wikimedia.org/T334732) (owner: 10Superpes15) [20:44:01] no worries [20:44:39] (03Merged) 10jenkins-bot: [trwikiquote] Add a HD logo for Vector legacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908972 (https://phabricator.wikimedia.org/T334732) (owner: 10Superpes15) [20:44:55] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:908972|[trwikiquote] Add a HD logo for Vector legacy (T334732)]] [20:45:01] T334732: Legacy Turkish Wikiquote PNG logo has white borders - https://phabricator.wikimedia.org/T334732 [20:46:08] !log urbanecm@deploy2002 urbanecm and superpes: Backport for [[gerrit:908972|[trwikiquote] Add a HD logo for Vector legacy (T334732)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:46:13] Superpes: can you test? [20:46:19] Yep looking [20:46:34] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/909338/40707/" [puppet] - 10https://gerrit.wikimedia.org/r/909338 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [20:46:41] urbanecm Looks fine :) [20:46:46] syncing :) [20:47:07] Thanks and sorry for the late [20:47:24] no worries [20:48:37] !log joal@deploy2002 Started restart [analytics/aqs/deploy@d273fde]: Restarting AQS to pick up new druid datasource [20:48:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P47051 and previous config saved to /var/cache/conftool/dbconfig/20230417-204856-ladsgroup.json [20:49:18] PROBLEM - SSH on stat1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:50:51] !log otto@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [20:51:58] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:908972|[trwikiquote] Add a HD logo for Vector legacy (T334732)]] (duration: 07m 02s) [20:52:03] T334732: Legacy Turkish Wikiquote PNG logo has white borders - https://phabricator.wikimedia.org/T334732 [20:52:05] Superpes: it's live :) [20:52:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:35] urbanecm Thanks :) [20:52:40] no problem :) [20:53:20] Actually UTC+1 was better for me lol [20:57:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230417T2100). [21:04:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P47052 and previous config saved to /var/cache/conftool/dbconfig/20230417-210403-ladsgroup.json [21:06:31] (03CR) 10Cwhite: [C: 03+2] logstash: ulogd remove copy network.transport to network.protocol [puppet] - 10https://gerrit.wikimedia.org/r/886857 (https://phabricator.wikimedia.org/T329195) (owner: 10Cwhite) [21:08:32] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:08:52] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:12:52] Hey all - I’d like to deploy one last mitigation update for T333140 to PrivateSettings.php during the security window [21:17:57] !log bking@cumin1001 ban cloudelastic1004 for upcoming switch maintenance T333377 [21:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:03] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [21:18:04] RECOVERY - SSH on stat1006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:18:41] (03PS1) 10Zabe: RC: Handle deleted story [extensions/Wikistories] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909276 (https://phabricator.wikimedia.org/T334829) [21:19:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T333332)', diff saved to https://phabricator.wikimedia.org/P47053 and previous config saved to /var/cache/conftool/dbconfig/20230417-211909-ladsgroup.json [21:19:15] T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332 [21:20:09] !log Deployed updated mitigation for T333140 [21:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:24:01] sbassett: you're done? [21:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:27:14] (03CR) 10Cwhite: [C: 03+2] opensearch_dashboards: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [21:29:59] (03CR) 10Zabe: [C: 03+2] RC: Handle deleted story [extensions/Wikistories] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909276 (https://phabricator.wikimedia.org/T334829) (owner: 10Zabe) [21:31:32] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:34:58] (03PS1) 10Zabe: Fix infinite loop for self-redirects with variants conversion [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909277 (https://phabricator.wikimedia.org/T333050) [21:35:05] (03Merged) 10jenkins-bot: RC: Handle deleted story [extensions/Wikistories] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909276 (https://phabricator.wikimedia.org/T334829) (owner: 10Zabe) [21:35:21] (03CR) 10Zabe: [C: 03+2] Fix infinite loop for self-redirects with variants conversion [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909277 (https://phabricator.wikimedia.org/T333050) (owner: 10Zabe) [21:37:06] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:38:03] !log zabe@deploy2002 Started scap: Backport for [[gerrit:909276|RC: Handle deleted story (T334829)]] [21:38:09] T334829: Error: Call to a member function getComment() on null - https://phabricator.wikimedia.org/T334829 [21:39:17] !log zabe@deploy2002 zabe: Backport for [[gerrit:909276|RC: Handle deleted story (T334829)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:44:59] (03PS1) 10Cwhite: profile: ensure several dashboards plugins are absent [puppet] - 10https://gerrit.wikimedia.org/r/908884 (https://phabricator.wikimedia.org/T333732) [21:45:05] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:909276|RC: Handle deleted story (T334829)]] (duration: 07m 01s) [21:45:10] T334829: Error: Call to a member function getComment() on null - https://phabricator.wikimedia.org/T334829 [21:53:16] (03Merged) 10jenkins-bot: Fix infinite loop for self-redirects with variants conversion [core] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909277 (https://phabricator.wikimedia.org/T333050) (owner: 10Zabe) [21:53:36] !log zabe@deploy2002 Started scap: Backport for [[gerrit:909277|Fix infinite loop for self-redirects with variants conversion (T333050)]] [21:54:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:54:49] !log zabe@deploy2002 zabe: Backport for [[gerrit:909277|Fix infinite loop for self-redirects with variants conversion (T333050)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:56:50] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:57:12] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:59:38] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 13 hosts with reason: T333377 maint [21:59:43] T333377: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 [21:59:59] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 13 hosts with reason: T333377 maint [22:00:29] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:909277|Fix infinite loop for self-redirects with variants conversion (T333050)]] (duration: 06m 52s) [22:42:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:45:48] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:02:57] (03PS1) 10Jdlrobson: [beta cluster] Enable indicators on page load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909396 (https://phabricator.wikimedia.org/T333601) [23:09:35] (03PS1) 10BryanDavis: toolforge: Use shard name 'toolsdb' in profile::wmcs::services::toolsdb_* [puppet] - 10https://gerrit.wikimedia.org/r/909397 [23:11:31] (03CR) 10BryanDavis: "I have also attempted to set this in instance-specific hiera via Horizon for the current instances. Putting it in the profiles seems like " [puppet] - 10https://gerrit.wikimedia.org/r/909397 (owner: 10BryanDavis)