[00:01:09] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:52] (03PS3) 10Xcollazo: Add missing airflow service users to yarn's production queue [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) [00:05:20] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10Dzahn) fixed example code at https://wikitech.wikimedia.org/w/index.php?title=UID&type=revision&diff=20... [00:05:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:07:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2023.mgmt.codfw.wmnet with reboot policy FORCED [00:24:41] (03CR) 10Paladox: [C: 03+1] gerrit: update style for Gerrit 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/824221 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [00:24:57] (03CR) 10Paladox: [C: 03+1] gerrit: remove Gerrit 3.5 obsolete @apply css statement [puppet] - 10https://gerrit.wikimedia.org/r/824222 (https://phabricator.wikimedia.org/T315445) (owner: 10Hashar) [00:29:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2023.mgmt.codfw.wmnet with reboot policy FORCED [00:31:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:35:15] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:38:29] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:59] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:40:57] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:45:37] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:39] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:47:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:48:41] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:49:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2023.codfw.wmnet'] [00:50:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) [00:52:41] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:54] 10SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Document best-practice for hinted-handoff - https://phabricator.wikimedia.org/T315517 (10Eevans) [01:02:03] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:31] PROBLEM - SSH on ms-be1041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:26:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:35:07] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:47] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:35] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (35) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephmon1001, cloudcephmon1002, cloudcephmon1003, ms-be1071, ms-be2028, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, st [02:09:35] stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [02:09:54] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) cpuinfo on eqiad appservers: `lang=none $ sudo cumin 'A:mw-eqiad' 'grep '\''model name'\'' /proc/cpuinfo | head -n1' ===== NODE GROUP =====... [02:15:56] !log on mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set scaling_governor to performance T315398 [02:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:00] T315398: Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 [02:17:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:27] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (35) node(s) change every puppet run: an-test-client1001, aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudcephmon1001, cloudcephmon1002, cloudcephmon1003, ms-be1071, ms-be2028, ms-be2067, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, st [02:30:27] stat1005, stat1006, stat1007, stat1008, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [02:53:33] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) We're seeing no effect. {F35470849} From [[https://grafana.wikimedia.org/d/5QOxR_m4k/scaling_governor-t315398?orgId=1|custom dashboard]] Note that this is the same mo... [03:15:45] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: adds-changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:45] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [03:29:03] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:40:57] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:27] PROBLEM - SSH on mw1315.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:45:35] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:56] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:08] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) I didn't apply the policy correctly. I only managed to set it on cpu0. On the codfw benchmark I did it correctly. [04:30:02] !log on mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set scaling_governor to performance, attempt 2, T315398 [04:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:06] T315398: Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 [04:43:47] RECOVERY - SSH on mw1315.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:49:05] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 31 hosts with reason: Primary switchover s8 T314369 [04:50:53] T314369: Switchover s8 master (db1104 -> db1109) - https://phabricator.wikimedia.org/T314369 [04:51:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 31 hosts with reason: Primary switchover s8 T314369 [04:52:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1109 with weight 0 T314369', diff saved to https://phabricator.wikimedia.org/P32471 and previous config saved to /var/cache/conftool/dbconfig/20220818-045218-ladsgroup.json [04:54:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 31 hosts with reason: Primary switchover s8 T314369 [04:54:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Primary switchover s8 T314369 [04:56:05] RECOVERY - SSH on ms-be1041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:56:49] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:27] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:07] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:21] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) That did the trick. I applied a 20 minute moving average so the ramp rate seen here is not real. {F35470912} Mean latency fell from 220ms to 185ms, a 16% drop. Power c... [05:14:43] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:05] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:11] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:16] (03PS2) 10Ladsgroup: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/819548 (https://phabricator.wikimedia.org/T314369) (owner: 10Gerrit maintenance bot) [05:17:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/819548 (https://phabricator.wikimedia.org/T314369) (owner: 10Gerrit maintenance bot) [05:36:01] (03PS1) 10Marostegui: mariadb: Productionize db2111 [puppet] - 10https://gerrit.wikimedia.org/r/824331 (https://phabricator.wikimedia.org/T311494) [05:36:20] (03PS2) 10Marostegui: mariadb: Productionize db2178 [puppet] - 10https://gerrit.wikimedia.org/r/824331 (https://phabricator.wikimedia.org/T311494) [05:36:58] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10tstarling) The amount of time cores spend with a clock speed over 2GHz increased from 31% to 87%. That seems excessive given that CPU utilization is only ~16%. But it's hard to arg... [05:37:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2178 [puppet] - 10https://gerrit.wikimedia.org/r/824331 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [05:38:49] (03CR) 10Marostegui: "This needs a manual rebase apparently?" [puppet] - 10https://gerrit.wikimedia.org/r/768652 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [05:39:03] (03CR) 10Marostegui: "This needs a manual rebase apparently?" [puppet] - 10https://gerrit.wikimedia.org/r/768653 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [05:41:05] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:44:00] (03PS1) 10Marostegui: site.pp: Remove insetup from db2178 [puppet] - 10https://gerrit.wikimedia.org/r/824333 (https://phabricator.wikimedia.org/T311494) [05:44:46] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2178 [puppet] - 10https://gerrit.wikimedia.org/r/824333 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [05:45:43] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T314041)', diff saved to https://phabricator.wikimedia.org/P32473 and previous config saved to /var/cache/conftool/dbconfig/20220818-055606-ladsgroup.json [05:56:11] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:00:04] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T0600). [06:00:10] o/ [06:00:13] o/ [06:01:10] I lost the ticket of switchover, one sec [06:01:23] !log Starting s8 eqiad failover from db1104 to db1109 - T314369 [06:01:24] https://phabricator.wikimedia.org/T314369 [06:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:27] T314369: Switchover s8 master (db1104 -> db1109) - https://phabricator.wikimedia.org/T314369 [06:01:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s8 eqiad as read-only for maintenance - T314369', diff saved to https://phabricator.wikimedia.org/P32474 and previous config saved to /var/cache/conftool/dbconfig/20220818-060137-ladsgroup.json [06:02:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1109 to s8 primary and set section read-write T314369', diff saved to https://phabricator.wikimedia.org/P32475 and previous config saved to /var/cache/conftool/dbconfig/20220818-060213-ladsgroup.json [06:02:34] can edit [06:02:39] edits are flowing back [06:04:04] recentchanges increasing [06:04:57] (03PS2) 10Ladsgroup: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/819549 (https://phabricator.wikimedia.org/T314369) (owner: 10Gerrit maintenance bot) [06:05:07] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/819549 (https://phabricator.wikimedia.org/T314369) (owner: 10Gerrit maintenance bot) [06:07:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1104 T314369', diff saved to https://phabricator.wikimedia.org/P32476 and previous config saved to /var/cache/conftool/dbconfig/20220818-060707-ladsgroup.json [06:07:11] T314369: Switchover s8 master (db1104 -> db1109) - https://phabricator.wikimedia.org/T314369 [06:08:29] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:41] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P32477 and previous config saved to /var/cache/conftool/dbconfig/20220818-061112-ladsgroup.json [06:11:37] !log dbmaint@s8 eqiad (T314369 T312863 T309311 T60674 T303603 T310485) [06:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:44] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:11:45] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [06:11:45] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [06:11:45] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [06:12:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maint [06:12:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maint [06:15:23] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:33] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:45] (03PS13) 10Giuseppe Lavagetto: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [06:17:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Let's try this. I'll release the change on one host, run as many tests as I can there, then deploy everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [06:25:50] (03PS2) 10Giuseppe Lavagetto: deployment-prep: serve php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824216 (https://phabricator.wikimedia.org/T306042) [06:25:52] (03PS2) 10Giuseppe Lavagetto: mediawiki::jobrunner: allow picking a default php version [puppet] - 10https://gerrit.wikimedia.org/r/824217 (https://phabricator.wikimedia.org/T306042) [06:25:54] (03PS2) 10Giuseppe Lavagetto: deployment-prep: convert jobrunner to use php 7.4 by default [puppet] - 10https://gerrit.wikimedia.org/r/824218 (https://phabricator.wikimedia.org/T306042) [06:25:56] (03PS1) 10Giuseppe Lavagetto: jwt_authorizer::service: fix ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/824341 [06:26:15] (03PS2) 10Giuseppe Lavagetto: jwt_authorizer::service: fix ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/824341 [06:26:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [06:26:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P32478 and previous config saved to /var/cache/conftool/dbconfig/20220818-062618-ladsgroup.json [06:26:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [06:26:37] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] jwt_authorizer::service: fix ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/824341 (owner: 10Giuseppe Lavagetto) [06:29:26] (03PS1) 10Giuseppe Lavagetto: Revert "docker_registry_ha: Authorize GitLab trusted runners using JWT" [puppet] - 10https://gerrit.wikimedia.org/r/824174 [06:29:37] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "docker_registry_ha: Authorize GitLab trusted runners using JWT" [puppet] - 10https://gerrit.wikimedia.org/r/824174 (owner: 10Giuseppe Lavagetto) [06:35:53] (03Abandoned) 10Ladsgroup: db1124: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768653 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [06:35:58] (03CR) 10Ladsgroup: db1124: Disable notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768653 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [06:36:04] (03Abandoned) 10Ladsgroup: db1125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768652 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [06:41:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T314041)', diff saved to https://phabricator.wikimedia.org/P32479 and previous config saved to /var/cache/conftool/dbconfig/20220818-064124-ladsgroup.json [06:41:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:41:28] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:41:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:46:44] (03CR) 10JMeybohm: [C: 03+2] sre.discovery.service-route: Fix argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/824234 (owner: 10JMeybohm) [06:50:24] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:50:41] (03Merged) 10jenkins-bot: sre.discovery.service-route: Fix argument parser [cookbooks] - 10https://gerrit.wikimedia.org/r/824234 (owner: 10JMeybohm) [06:52:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [06:52:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [06:53:50] jouncebot: nowandnext [06:53:50] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [06:53:50] In 0 hour(s) and 6 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T0700) [06:54:02] no patch there, awesome [06:54:36] (03PS1) 10Ladsgroup: SpecialRecentChangesLinked: Use array_uniqe on fields and tables [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824175 [06:55:05] (03PS1) 10Ladsgroup: Revert "Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query"" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824176 [06:55:10] (03CR) 10Ladsgroup: [C: 03+2] SpecialRecentChangesLinked: Use array_uniqe on fields and tables [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824175 (owner: 10Ladsgroup) [06:55:13] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query"" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824176 (owner: 10Ladsgroup) [06:56:07] (03PS7) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) [06:57:10] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) >>! In T260663#6924279, @JMeybohm wrote: > The cookbook does not seem to work (tried du... [07:00:05] Amir1, apergos, jnuche, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T0700). [07:00:17] hello! there are no trainees signed up today and no patches in the window. [07:08:49] (03CR) 10Tim Starling: "I would like to cherry-pick the cancelAtomic() and modtoken fixes into production, then re-enable multi-DC, ideally in the next 24 hours." [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [07:14:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [07:14:28] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Set correct username/groupname mappings for Rancid [puppet] - 10https://gerrit.wikimedia.org/r/824286 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [07:14:32] (03Merged) 10jenkins-bot: SpecialRecentChangesLinked: Use array_uniqe on fields and tables [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824175 (owner: 10Ladsgroup) [07:14:36] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Set correct username/groupname mappings for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/824284 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [07:14:39] (03Merged) 10jenkins-bot: Revert "Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query"" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824176 (owner: 10Ladsgroup) [07:18:14] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: bump proxy memcache max connections [puppet] - 10https://gerrit.wikimedia.org/r/822040 (https://phabricator.wikimedia.org/T314914) (owner: 10Filippo Giunchedi) [07:19:03] (03PS1) 10JMeybohm: httpbb docker-registry: Fix catalog check [puppet] - 10https://gerrit.wikimedia.org/r/824394 [07:20:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:22:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [07:22:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [07:24:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:24:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:24:17] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.25/includes/specials/SpecialRecentChangesLinked.php: Backport: [[gerrit:824176|Revert "Revert "SpecialRecentChangesLinked: Use rdbms code for building the main query""]] (duration: 03m 31s) [07:25:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:26:27] !log roll-restart swift-proxy to apply bumbed memcached limits T314914 [07:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:31] T314914: Bump memcache connections and swift-proxy limits - https://phabricator.wikimedia.org/T314914 [07:28:23] jouncebot: nowandnext [07:28:23] For the next 0 hour(s) and 31 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T0700) [07:28:23] In 2 hour(s) and 31 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1000) [07:28:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [07:28:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [07:28:54] seems like no patches were scheduled, and no trainees signed up. I'll sneak something in. [07:30:10] (03PS1) 10Urbanecm: Fix structured task restriction check [extensions/GrowthExperiments] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824183 (https://phabricator.wikimedia.org/T315516) [07:30:20] (03CR) 10Urbanecm: [C: 03+2] Fix structured task restriction check [extensions/GrowthExperiments] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824183 (https://phabricator.wikimedia.org/T315516) (owner: 10Urbanecm) [07:31:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P32480 and previous config saved to /var/cache/conftool/dbconfig/20220818-073113-ladsgroup.json [07:31:54] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "workarounding CI breakage (T315489) to fix an urgent issue in GE" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824183 (https://phabricator.wikimedia.org/T315516) (owner: 10Urbanecm) [07:34:29] (03CR) 10Nikerabbit: [C: 03+1] "If we need more testing to increase confidence, we could enable it just on labs first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [07:35:04] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install netmon2002 - https://phabricator.wikimedia.org/T313867 (10fgiunchedi) Thank you @papaul! [07:36:09] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10fgiunchedi) Thank you @Papaul ! [07:39:05] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.25/extensions/GrowthExperiments/modules/ext.growthExperiments.HelpPanel/SuggestedEditsGuidance.js: 520cd7b78631f993681a77e1baa7a77f9b5d0961: Fix structured task restriction check (T315516) (duration: 03m 17s) [07:39:10] T315516: [regression-wmf.25] mobile: Structured Add link task displays "Page is protected, abandoning structured task" for non-protected pages - https://phabricator.wikimedia.org/T315516 [07:39:14] (03PS1) 10STran: Deploy partial action blocks to cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824395 (https://phabricator.wikimedia.org/T315525) [07:40:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:41:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:41:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:41:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:42:04] (03PS1) 10Marostegui: mariadb: Productionize db2179 [puppet] - 10https://gerrit.wikimedia.org/r/824396 (https://phabricator.wikimedia.org/T311494) [07:43:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2179 [puppet] - 10https://gerrit.wikimedia.org/r/824396 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [07:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P32482 and previous config saved to /var/cache/conftool/dbconfig/20220818-074618-ladsgroup.json [07:46:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:46:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:46:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:46:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:46:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T312972)', diff saved to https://phabricator.wikimedia.org/P32483 and previous config saved to /var/cache/conftool/dbconfig/20220818-074652-marostegui.json [07:46:55] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [07:49:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312972)', diff saved to https://phabricator.wikimedia.org/P32484 and previous config saved to /var/cache/conftool/dbconfig/20220818-074859-marostegui.json [07:52:52] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1013 as pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824398 (https://phabricator.wikimedia.org/T315526) [07:53:57] (03PS1) 10Marostegui: pc1013: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824399 (https://phabricator.wikimedia.org/T315526) [07:55:11] (03CR) 10Marostegui: [C: 03+2] pc1013: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824399 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [07:55:44] 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10fgiunchedi) Thank you @Ottomata for the example and extensive explanation! I'll take a closer look and play with it a little bit it too [07:58:53] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1013 as pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824398 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [08:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P32485 and previous config saved to /var/cache/conftool/dbconfig/20220818-080122-ladsgroup.json [08:02:36] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1013 as pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824398 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [08:03:19] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1013 as pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824398 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [08:04:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32486 and previous config saved to /var/cache/conftool/dbconfig/20220818-080405-marostegui.json [08:05:59] 10SRE, 10SRE-swift-storage: Bump memcache connections and swift-proxy limits - https://phabricator.wikimedia.org/T314914 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This has been deployed! [08:06:09] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10fgiunchedi) [08:07:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:07:43] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1013 to pc3 master T315526 (duration: 03m 11s) [08:07:46] T315526: Promote pc1014 to pc2 master - https://phabricator.wikimedia.org/T315526 [08:08:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:08:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:09:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:09:26] (03PS1) 10Marostegui: pc1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824403 (https://phabricator.wikimedia.org/T315526) [08:09:43] !log dbmaint Promote pc1013 as pc3 master T315526 [08:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:01] (03PS1) 10Ladsgroup: Stop writing to the old templatelinks fields in wikidata and new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824404 (https://phabricator.wikimedia.org/T312865) [08:10:26] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:15] (03CR) 10Marostegui: [C: 03+2] pc1014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/824403 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [08:12:30] PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:12:46] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:06] PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:15:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:15:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:15:56] (03PS1) 10Marostegui: pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/824405 (https://phabricator.wikimedia.org/T315526) [08:16:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:16:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P32487 and previous config saved to /var/cache/conftool/dbconfig/20220818-081627-ladsgroup.json [08:16:54] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/824405 (https://phabricator.wikimedia.org/T315526) (owner: 10Marostegui) [08:18:33] (03PS1) 10Marostegui: site.pp: Remove insetup from db2179 [puppet] - 10https://gerrit.wikimedia.org/r/824406 (https://phabricator.wikimedia.org/T311494) [08:19:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P32488 and previous config saved to /var/cache/conftool/dbconfig/20220818-081911-marostegui.json [08:19:12] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks fields in wikidata and new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824404 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:20:01] (03Merged) 10jenkins-bot: Stop writing to the old templatelinks fields in wikidata and new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824404 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:20:21] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2179 [puppet] - 10https://gerrit.wikimedia.org/r/824406 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [08:20:56] PROBLEM - Check systemd state on wdqs1015 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:07] (03CR) 10Marostegui: [C: 03+1] auto_schema: Change replica_set to be all replicas in all dcs [software] - 10https://gerrit.wikimedia.org/r/820165 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [08:26:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:26:43] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824404|Stop writing to the old templatelinks fields in wikidata and new wikis (T312865)]] (duration: 03m 20s) [08:26:46] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [08:27:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:27:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:27:56] RECOVERY - Check systemd state on wdqs1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:29:17] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Change replica_set to be all replicas in all dcs [software] - 10https://gerrit.wikimedia.org/r/820165 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [08:29:58] (03Merged) 10jenkins-bot: auto_schema: Change replica_set to be all replicas in all dcs [software] - 10https://gerrit.wikimedia.org/r/820165 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup) [08:30:07] (03PS1) 10Vgutierrez: trafficserver: Upgrade cp5014 and cp5016 to ATS 9 [puppet] - 10https://gerrit.wikimedia.org/r/824408 (https://phabricator.wikimedia.org/T309651) [08:32:34] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Upgrade cp5014 and cp5016 to ATS 9 [puppet] - 10https://gerrit.wikimedia.org/r/824408 (https://phabricator.wikimedia.org/T309651) (owner: 10Vgutierrez) [08:33:28] !log upgrade to ATS 9.1.3 in cp5014 and cp5016 - T309651 [08:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:32] T309651: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 [08:34:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T312972)', diff saved to https://phabricator.wikimedia.org/P32489 and previous config saved to /var/cache/conftool/dbconfig/20220818-083417-marostegui.json [08:34:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [08:34:21] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [08:34:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [08:34:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:34:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:35:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T312972)', diff saved to https://phabricator.wikimedia.org/P32490 and previous config saved to /var/cache/conftool/dbconfig/20220818-083505-marostegui.json [08:35:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] doc: Remove old travis and coveralls badge from readme [software/conftool] - 10https://gerrit.wikimedia.org/r/813981 (owner: 10Krinkle) [08:36:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T312972)', diff saved to https://phabricator.wikimedia.org/P32491 and previous config saved to /var/cache/conftool/dbconfig/20220818-083612-marostegui.json [08:37:10] PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:39:18] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1002.eqiad.wmnet [08:39:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think this is needed, I just ignored that there is an sreadmins ldap group too." [puppet] - 10https://gerrit.wikimedia.org/r/818061 (owner: 10Volans) [08:39:47] (03Merged) 10jenkins-bot: doc: Remove old travis and coveralls badge from readme [software/conftool] - 10https://gerrit.wikimedia.org/r/813981 (owner: 10Krinkle) [08:41:17] the next thing on my todo list for today is deploying the change for config reload in maint scripts. cc apergos and marostegui [08:41:47] T298485 [08:41:47] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [08:41:50] 👍 [08:42:03] sounds good! [08:42:08] PROBLEM - Check systemd state on wdqs1014 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:10] (03CR) 10FNegri: [C: 03+2] ceph: rename CephOSDController to CephOSDNodeController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823168 (owner: 10David Caro) [08:49:06] PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:13] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1002.eqiad.wmnet [08:49:22] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [08:50:34] RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:50:54] RECOVERY - Check systemd state on wdqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P32492 and previous config saved to /var/cache/conftool/dbconfig/20220818-085118-marostegui.json [08:52:30] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:54:01] (03CR) 10Jbond: "thanks" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [08:54:03] (03PS1) 10Ladsgroup: Simplify wmfEtcdApplyDBConfig() a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824411 (https://phabricator.wikimedia.org/T298485) [08:55:04] (03Merged) 10jenkins-bot: ceph: rename CephOSDController to CephOSDNodeController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823168 (owner: 10David Caro) [08:55:30] (03CR) 10CI reject: [V: 04-1] Simplify wmfEtcdApplyDBConfig() a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824411 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [08:56:26] PROBLEM - Check systemd state on wdqs1014 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:27] (03PS2) 10Ladsgroup: Simplify wmfEtcdApplyDBConfig() a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824411 (https://phabricator.wikimedia.org/T298485) [08:58:33] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10MatthewVernon) [08:59:21] (03CR) 10Ladsgroup: [C: 03+2] "Terrible diff... It's just removing an indention" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824411 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [08:59:38] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet [08:59:48] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1004.eqiad.wmnet [09:00:13] (03PS1) 10Jbond: O:phabricator: move common settings to role hiera [puppet] - 10https://gerrit.wikimedia.org/r/824412 [09:00:18] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:00:30] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:46] (03Merged) 10jenkins-bot: Simplify wmfEtcdApplyDBConfig() a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824411 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [09:01:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36805/console" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [09:02:18] (03CR) 10Jbond: "thanks, see comment" [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [09:03:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:04:27] (03CR) 10JMeybohm: [C: 03+2] httpbb docker-registry: Fix catalog check [puppet] - 10https://gerrit.wikimedia.org/r/824394 (owner: 10JMeybohm) [09:04:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:04:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:04:46] (03CR) 10Jbond: phabricator::migration: add phd with systemd::sysuser, reserve UID 920 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [09:05:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:06:03] !log ladsgroup@deploy1002 Synchronized wmf-config/etcd.php: Config: [[gerrit:824411|Simplify wmfEtcdApplyDBConfig() a bit (T298485)]], Part I (duration: 03m 02s) [09:06:07] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [09:06:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P32493 and previous config saved to /var/cache/conftool/dbconfig/20220818-090624-marostegui.json [09:06:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10MatthewVernon) Hi, there is indeed a failed drive: ` Array F Logical Drive: 6 Size: 3.64 TB Fault Tolerance: 0 Strip Size: 256 KB Full S... [09:06:50] RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:42] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10MatthewVernon) The failed slot number is in the original message, but also from `show config`: ` Array F (SATA, logicaldrive 6 (3.64 TB, RAID 0, Failed) physicaldriv... [09:09:07] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1004.eqiad.wmnet [09:09:45] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2002.codfw.wmnet [09:10:05] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:824411|Simplify wmfEtcdApplyDBConfig() a bit (T298485)]], Part II (duration: 03m 11s) [09:10:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:11:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:11:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:12:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:17:06] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:31] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2002.codfw.wmnet [09:19:31] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2003.codfw.wmnet [09:21:07] (03PS4) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) [09:21:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T312972)', diff saved to https://phabricator.wikimedia.org/P32494 and previous config saved to /var/cache/conftool/dbconfig/20220818-092130-marostegui.json [09:21:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [09:21:34] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [09:21:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [09:21:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:22:02] RECOVERY - Check systemd state on wdqs1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:14] o/ I'm looking for someone who can CR this patch to schedule a script to be run on a weekly basis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/811312 [09:22:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T312972)', diff saved to https://phabricator.wikimedia.org/P32495 and previous config saved to /var/cache/conftool/dbconfig/20220818-092219-marostegui.json [09:22:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312972)', diff saved to https://phabricator.wikimedia.org/P32496 and previous config saved to /var/cache/conftool/dbconfig/20220818-092226-marostegui.json [09:28:54] (03PS1) 10Ladsgroup: Allow passing arguments to wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824413 (https://phabricator.wikimedia.org/T298485) [09:29:44] (03CR) 10CI reject: [V: 04-1] Allow passing arguments to wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824413 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [09:29:51] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2003.codfw.wmnet [09:30:10] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:51] (03PS2) 10Ladsgroup: Allow passing arguments to wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824413 (https://phabricator.wikimedia.org/T298485) [09:31:45] (03CR) 10CI reject: [V: 04-1] Allow passing arguments to wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824413 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [09:32:34] (03CR) 10Majavah: [C: 04-1] Schedule image suggestions notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [09:33:03] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner2004.codfw.wmnet [09:33:07] (03PS3) 10Ladsgroup: Allow passing arguments to wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824413 (https://phabricator.wikimedia.org/T298485) [09:34:03] <_joe_> !log updating vopsbot to 0.3.0 [09:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:49] 10SRE-tools, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10JMeybohm) p:05Triage→03Medium [09:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32497 and previous config saved to /var/cache/conftool/dbconfig/20220818-093732-marostegui.json [09:43:34] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner2004.codfw.wmnet [09:44:44] !log dnsdisc depooling codfw for services running in kubernetes cluster (for 30-60min due to T310483, T260661) [09:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:48] T260661: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 [09:44:53] (03CR) 10FNegri: "Nice work! 👍🏻" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (owner: 10David Caro) [09:46:56] (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [09:47:43] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:49:30] (03CR) 10Ladsgroup: [C: 03+2] Allow passing arguments to wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824413 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [09:49:32] (03PS1) 10Jbond: C:rancid: Drop unneeded dependencies [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) [09:49:36] (03CR) 10Marostegui: "Can we get https://phabricator.wikimedia.org/T315427 done first?" [puppet] - 10https://gerrit.wikimedia.org/r/824039 (https://phabricator.wikimedia.org/T315271) (owner: 10Tim Starling) [09:50:02] (03PS4) 10Cathal Mooney: Add SSH key for user mikeraish [puppet] - 10https://gerrit.wikimedia.org/r/824245 (https://phabricator.wikimedia.org/T313429) [09:50:15] (03Merged) 10jenkins-bot: Allow passing arguments to wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824413 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [09:50:52] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 9 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36806/console" [puppet] - 10https://gerrit.wikimedia.org/r/800219 (https://phabricator.wikimedia.org/T308639) (owner: 10Jbond) [09:51:14] (03CR) 10FNegri: [C: 03+1] "LGTM!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (owner: 10David Caro) [09:52:13] (03PS2) 10Jbond: C:rancid: Drop unneeded dependencies [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) [09:52:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:38] 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe) [09:52:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P32498 and previous config saved to /var/cache/conftool/dbconfig/20220818-095238-marostegui.json [09:52:51] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:52:53] 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe) [09:53:00] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [09:53:02] 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe) p:05Triage→03Medium [09:53:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36808/console" [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) (owner: 10Jbond) [09:53:26] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:53:26] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [09:53:28] (03CR) 10Jbond: [C: 03+1] "LGTM, no blocking comments, see follow up patch" [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [09:53:43] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [09:54:02] 10SRE, 10SRE-Access-Requests, 10serviceops: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10LSobanski) Approved. [09:56:46] (03CR) 10Cathal Mooney: [C: 03+1] "Makes sense to my untrained eyes! Really good example too very educational :)" [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) (owner: 10Jbond) [09:56:50] !log ladsgroup@deploy1002 Synchronized wmf-config/etcd.php: Config: [[gerrit:824413|Allow passing arguments to wmfEtcdApplyDBConfig() (T298485)]] (duration: 03m 40s) [09:56:54] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [09:58:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:58:50] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [09:59:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:59:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:00:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:00:05] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1000) [10:00:35] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [10:01:54] (03CR) 10David Caro: global: add inventory module (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (owner: 10David Caro) [10:03:22] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [10:07:17] (03PS1) 10Ladsgroup: Call wmfApplyEtcdDBConfig() directly in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824418 (https://phabricator.wikimedia.org/T298485) [10:07:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T312972)', diff saved to https://phabricator.wikimedia.org/P32499 and previous config saved to /var/cache/conftool/dbconfig/20220818-100744-marostegui.json [10:07:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [10:07:49] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [10:08:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [10:08:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T312972)', diff saved to https://phabricator.wikimedia.org/P32500 and previous config saved to /var/cache/conftool/dbconfig/20220818-100806-marostegui.json [10:08:14] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:09:03] (03CR) 10Ladsgroup: [C: 03+2] Call wmfApplyEtcdDBConfig() directly in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824418 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [10:09:54] (03Merged) 10jenkins-bot: Call wmfApplyEtcdDBConfig() directly in CS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824418 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [10:10:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312972)', diff saved to https://phabricator.wikimedia.org/P32501 and previous config saved to /var/cache/conftool/dbconfig/20220818-101013-marostegui.json [10:11:24] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/824233 (owner: 10PipelineBot) [10:13:39] (03CR) 10FNegri: "Left a couple of minor comments" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (owner: 10David Caro) [10:14:30] (03CR) 10FNegri: [C: 03+1] "👍🏻" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (owner: 10David Caro) [10:14:59] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/824233 (owner: 10PipelineBot) [10:15:03] (03CR) 10FNegri: [C: 03+1] "👍🏻" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (owner: 10David Caro) [10:15:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:15:32] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:15:59] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:824418|Call wmfApplyEtcdDBConfig() directly in CS.php (T298485)]] (duration: 03m 46s) [10:16:02] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [10:16:11] (03CR) 10FNegri: [C: 03+1] "👍🏻" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (owner: 10David Caro) [10:16:25] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:16:49] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:16:56] (03PS1) 10Ladsgroup: Drop now-unused wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824419 (https://phabricator.wikimedia.org/T298485) [10:17:26] (03CR) 10FNegri: [C: 03+1] "👍🏻" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (owner: 10David Caro) [10:18:51] (03CR) 10FNegri: [C: 03+2] ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (owner: 10David Caro) [10:19:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:19:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:19:49] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:20:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:20:38] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:22:10] !log kubernetes2005:~$ sudo systemctl status ifup@ens13.service - T273026 [10:22:12] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:16] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [10:22:35] (03CR) 10Ladsgroup: [C: 03+2] Drop now-unused wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824419 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [10:22:37] (03CR) 10FNegri: global: add inventory module (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (owner: 10David Caro) [10:22:59] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:23:21] (03Merged) 10jenkins-bot: Drop now-unused wmfEtcdApplyDBConfig() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824419 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [10:23:32] (03PS3) 10Hnowlan: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) [10:24:02] (03PS4) 10Hnowlan: Basic blubber file for thumbor [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/813613 (https://phabricator.wikimedia.org/T312104) [10:24:38] jouncebot: nowandnext [10:24:39] For the next 0 hour(s) and 35 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1000) [10:24:39] In 2 hour(s) and 35 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300) [10:24:39] In 2 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300) [10:25:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32503 and previous config saved to /var/cache/conftool/dbconfig/20220818-102519-marostegui.json [10:25:21] (03PS1) 10Kevin Bazira: ml-services: Add kowiki, srwiki, ukwiki & viwiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/824420 (https://phabricator.wikimedia.org/T314456) [10:25:44] (03PS1) 10David Caro: cloud: reformat cloud.yaml with prettier [puppet] - 10https://gerrit.wikimedia.org/r/824421 [10:25:46] (03PS1) 10David Caro: p:ceph::osd: get the os disks by size [puppet] - 10https://gerrit.wikimedia.org/r/824422 [10:25:48] (03PS1) 10David Caro: ceph::osd: add new disks model to disable write caches for [puppet] - 10https://gerrit.wikimedia.org/r/824423 [10:27:29] (03CR) 10FNegri: [C: 03+2] cloud: reformat cloud.yaml with prettier [puppet] - 10https://gerrit.wikimedia.org/r/824421 (owner: 10David Caro) [10:27:44] !log ladsgroup@deploy1002 Synchronized wmf-config/etcd.php: Config: [[gerrit:824419|Drop now-unused wmfEtcdApplyDBConfig() (T298485)]] (duration: 03m 36s) [10:27:51] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [10:29:39] (03PS2) 10David Caro: cloud: reformat cloud.yaml with prettier [puppet] - 10https://gerrit.wikimedia.org/r/824421 (https://phabricator.wikimedia.org/T314870) [10:29:41] (03PS2) 10David Caro: p:ceph::osd: get the os disks by size [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) [10:29:43] (03PS2) 10David Caro: ceph::osd: add new disks model to disable write caches for [puppet] - 10https://gerrit.wikimedia.org/r/824423 (https://phabricator.wikimedia.org/T314870) [10:30:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:31:02] (03PS1) 10Reedy: Fix DenyListManager singleton [extensions/StopForumSpam] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824426 (https://phabricator.wikimedia.org/T315447) [10:31:18] (03CR) 10Addshore: [C: 03+1] Fix DenyListManager singleton [extensions/StopForumSpam] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824426 (https://phabricator.wikimedia.org/T315447) (owner: 10Reedy) [10:31:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:31:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:31:52] (03CR) 10Reedy: [C: 03+2] Fix DenyListManager singleton [extensions/StopForumSpam] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824426 (https://phabricator.wikimedia.org/T315447) (owner: 10Reedy) [10:32:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:32:34] (03PS1) 10Filippo Giunchedi: pontoon: generate en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/824446 (https://phabricator.wikimedia.org/T313229) [10:32:34] gerrit spam incoming, sorry folks [10:32:37] (03PS1) 10Filippo Giunchedi: postgresql: resync_replica improvements and fixes [puppet] - 10https://gerrit.wikimedia.org/r/824447 (https://phabricator.wikimedia.org/T313229) [10:32:39] (03PS1) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) [10:32:42] (03PS1) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [10:32:43] (03PS1) 10Filippo Giunchedi: docker: use ExecStartPre to implement --pull=always [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229) [10:32:46] (03PS1) 10Filippo Giunchedi: service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229) [10:33:04] standing by for -1s from CI [10:34:44] such confidence :) [10:34:50] (03PS1) 10MMandere: utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) [10:34:55] much wow! [10:35:45] Reedy: I have tested the patches already in cloud vps/pontoon so I know they work "in the real world" and it'll be probably sth silly like arrow alignment [10:37:38] (03CR) 10CI reject: [V: 04-1] WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:37:43] !log kubernetes2006:~$ sudo systemctl reset-failed ifup@ens13.service - T273026 [10:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:47] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [10:37:57] (03PS4) 10Ladsgroup: Allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) (owner: 10Daniel Kinzler) [10:39:18] (03PS19) 10Jbond: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [10:39:53] (03CR) 10CI reject: [V: 04-1] WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:40:01] (03Merged) 10jenkins-bot: Fix DenyListManager singleton [extensions/StopForumSpam] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824426 (https://phabricator.wikimedia.org/T315447) (owner: 10Reedy) [10:40:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P32504 and previous config saved to /var/cache/conftool/dbconfig/20220818-104025-marostegui.json [10:40:30] (03CR) 10CI reject: [V: 04-1] service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:41:12] (03CR) 10CI reject: [V: 04-1] utils: Add latency measurement program [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [10:41:41] (03CR) 10Ladsgroup: [C: 03+2] Allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) (owner: 10Daniel Kinzler) [10:41:43] (03CR) 10Jbond: [C: 03+2] elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [10:43:07] (03Merged) 10jenkins-bot: Allow DB config reload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801721 (https://phabricator.wikimedia.org/T298485) (owner: 10Daniel Kinzler) [10:45:03] (03PS2) 10Filippo Giunchedi: pontoon: generate en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/824446 (https://phabricator.wikimedia.org/T313229) [10:45:05] (03PS2) 10Filippo Giunchedi: postgresql: resync_replica improvements and fixes [puppet] - 10https://gerrit.wikimedia.org/r/824447 (https://phabricator.wikimedia.org/T313229) [10:45:07] (03PS2) 10Filippo Giunchedi: WIP dispatch: add database role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) [10:45:09] (03PS2) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [10:45:11] (03PS2) 10Filippo Giunchedi: docker: use ExecStartPre to implement --pull=always [puppet] - 10https://gerrit.wikimedia.org/r/824450 (https://phabricator.wikimedia.org/T313229) [10:45:13] (03PS2) 10Filippo Giunchedi: service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229) [10:45:15] (03PS1) 10Filippo Giunchedi: pontoon: initialize new stack o11y-dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824455 [10:45:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1166', diff saved to https://phabricator.wikimedia.org/P32505 and previous config saved to /var/cache/conftool/dbconfig/20220818-104552-ladsgroup.json [10:45:53] !log reedy@deploy1002 Synchronized php-1.39.0-wmf.25/extensions/StopForumSpam/includes/: T315447 (duration: 03m 36s) [10:45:58] T315447: Investigate caching of serialisation of StopForumSpam IPSet - https://phabricator.wikimedia.org/T315447 [10:46:19] (03Abandoned) 10Filippo Giunchedi: pontoon: initialize new stack o11y-dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824455 (owner: 10Filippo Giunchedi) [10:47:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db1166', diff saved to https://phabricator.wikimedia.org/P32506 and previous config saved to /var/cache/conftool/dbconfig/20220818-104731-ladsgroup.json [10:47:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:48:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:48:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:48:53] PROBLEM - puppet last run on netboxdb2002 is CRITICAL: CRITICAL: Puppet has been disabled for 606241 seconds, message: test postgress config - jbond, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:49:33] (03Merged) 10jenkins-bot: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) (owner: 10Ebernhardson) [10:49:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:51:27] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: generate en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/824446 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:51:32] (03PS3) 10Filippo Giunchedi: pontoon: generate en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/824446 (https://phabricator.wikimedia.org/T313229) [10:52:37] go jayme [10:52:43] lolno.mkv [10:53:28] * jayme feels pretty encouraged [10:53:57] (03PS3) 10David Caro: p:ceph::osd: get the os disks by size [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) [10:53:59] (03PS3) 10David Caro: ceph::osd: add new disks model to disable write caches for [puppet] - 10https://gerrit.wikimedia.org/r/824423 (https://phabricator.wikimedia.org/T314870) [10:54:19] (03PS4) 10David Caro: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) [10:54:20] haha! [10:54:21] (03CR) 10David Caro: global: add inventory module (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [10:54:23] (03PS3) 10David Caro: Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 (https://phabricator.wikimedia.org/T314870) [10:54:25] (03PS3) 10David Caro: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) [10:54:27] (03CR) 10David Caro: ceph: use cluster_name instead of control node (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [10:54:29] (03PS3) 10David Caro: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) [10:54:31] (03PS3) 10David Caro: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) [10:54:33] (03PS3) 10David Caro: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) [10:54:35] (03PS3) 10David Caro: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) [10:54:38] (03PS5) 10David Caro: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) [10:54:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:54:40] (03PS4) 10David Caro: WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [10:54:49] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36810/console" [puppet] - 10https://gerrit.wikimedia.org/r/824423 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [10:55:07] (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: generate en_US.UTF-8 locale [puppet] - 10https://gerrit.wikimedia.org/r/824446 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [10:55:14] !log kubernetes2016:~$ sudo systemctl reset-failed ifup@ens13.service - T273026 [10:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:17] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [10:55:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T312972)', diff saved to https://phabricator.wikimedia.org/P32508 and previous config saved to /var/cache/conftool/dbconfig/20220818-105531-marostegui.json [10:55:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:55:35] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [10:55:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:55:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:55:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:55:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T312972)', diff saved to https://phabricator.wikimedia.org/P32509 and previous config saved to /var/cache/conftool/dbconfig/20220818-105552-marostegui.json [10:56:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:56:35] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:00:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312972)', diff saved to https://phabricator.wikimedia.org/P32510 and previous config saved to /var/cache/conftool/dbconfig/20220818-110000-marostegui.json [11:00:09] PROBLEM - Check systemd state on kubernetes2015 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:22] (03CR) 10Filippo Giunchedi: "This is WIP though basically functional now, I wanted to get your input/feedback on it" [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [11:00:34] (03CR) 10Stang: "Hi Amir, it has been more than one month and I do get some response from the community, so I would like to re-schedule this patch. Could y" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816239 (https://phabricator.wikimedia.org/T313657) (owner: 10Stang) [11:00:54] !log kubernetes2015:~$ sudo systemctl reset-failed ifup@ens13.service - T273026 [11:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:58] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [11:02:23] RECOVERY - Check systemd state on kubernetes2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:17] (03CR) 10Hnowlan: [C: 03+2] Add the instruction to debug the tests using PyCharm IDE [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/823705 (https://phabricator.wikimedia.org/T314657) (owner: 10Vlad.shapik) [11:07:25] (03Merged) 10jenkins-bot: Add the instruction to debug the tests using PyCharm IDE [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/823705 (https://phabricator.wikimedia.org/T314657) (owner: 10Vlad.shapik) [11:07:55] (03PS5) 10David Caro: WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [11:07:57] (03PS1) 10David Caro: ceph.bootstrapp_and_add: don't rely on sda/sdb being the os disks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824457 (https://phabricator.wikimedia.org/T314870) [11:08:53] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [11:09:37] (03CR) 10David Caro: p:ceph::osd: get the os disks by size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824422 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:12:54] (03PS1) 10Ladsgroup: Fix call to wmfApplyEtcdDBConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824460 (https://phabricator.wikimedia.org/T298485) [11:14:12] (03CR) 10Ladsgroup: [C: 03+2] Fix call to wmfApplyEtcdDBConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824460 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [11:14:21] jouncebot: nowandnext [11:14:21] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [11:14:21] In 1 hour(s) and 45 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300) [11:14:21] In 1 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300) [11:15:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32511 and previous config saved to /var/cache/conftool/dbconfig/20220818-111506-marostegui.json [11:15:08] (03Merged) 10jenkins-bot: Fix call to wmfApplyEtcdDBConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824460 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [11:18:53] (03PS1) 10Marostegui: mariadb: Productionize db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824468 (https://phabricator.wikimedia.org/T311494) [11:19:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824468 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [11:20:25] (03CR) 10CI reject: [V: 04-1] ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:20:38] (03CR) 10CI reject: [V: 04-1] ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:20:53] (03CR) 10CI reject: [V: 04-1] ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:21:19] (03CR) 10CI reject: [V: 04-1] ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:21:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:21:49] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:21:55] (03CR) 10CI reject: [V: 04-1] WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:22:17] (03CR) 10CI reject: [V: 04-1] ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:22:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:22:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:24:19] (03CR) 10CI reject: [V: 04-1] WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:24:58] (03CR) 10CI reject: [V: 04-1] ceph.bootstrapp_and_add: don't rely on sda/sdb being the os disks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824457 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:29:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:30:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P32513 and previous config saved to /var/cache/conftool/dbconfig/20220818-113012-marostegui.json [11:35:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'depool db1112', diff saved to https://phabricator.wikimedia.org/P32514 and previous config saved to /var/cache/conftool/dbconfig/20220818-113556-ladsgroup.json [11:36:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db1112', diff saved to https://phabricator.wikimedia.org/P32515 and previous config saved to /var/cache/conftool/dbconfig/20220818-113655-ladsgroup.json [11:37:52] (03PS1) 10Jbond: sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 [11:41:39] (03PS1) 10Hnowlan: admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) [11:43:40] (03PS27) 10Jbond: PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [11:43:49] (03CR) 10Jbond: [C: 03+2] PeeringDB API: initial commit [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [11:44:21] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 (owner: 10Jbond) [11:45:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T312972)', diff saved to https://phabricator.wikimedia.org/P32516 and previous config saved to /var/cache/conftool/dbconfig/20220818-114518-marostegui.json [11:45:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:45:22] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [11:45:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:45:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [11:45:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [11:45:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T312972)', diff saved to https://phabricator.wikimedia.org/P32517 and previous config saved to /var/cache/conftool/dbconfig/20220818-114555-marostegui.json [11:46:47] (03CR) 10Jbond: "before merging this we should try installing the most recent pynetbox on one of the cumin hosts and testing." [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [11:47:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312972)', diff saved to https://phabricator.wikimedia.org/P32518 and previous config saved to /var/cache/conftool/dbconfig/20220818-114702-marostegui.json [11:47:51] Is there a guide I'm failing to find on how/when to add things to `PrivateSettings.php`? ref T315491 [11:47:52] T315491: Add $wgPhonosApiKeyGoogle to PrivateSettings - https://phabricator.wikimedia.org/T315491 [11:52:02] (03PS1) 10Marostegui: site.pp: Remove insetup from db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824475 (https://phabricator.wikimedia.org/T311494) [11:53:28] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2180 [puppet] - 10https://gerrit.wikimedia.org/r/824475 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui) [11:54:05] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:54] (03PS1) 10Ladsgroup: Disable LBFactory config callback for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824476 (https://phabricator.wikimedia.org/T298485) [11:55:21] (03PS1) 10Marostegui: Revert "control-mariadb-10.6-bullseye: Drop libaio1" [software] - 10https://gerrit.wikimedia.org/r/824428 [11:55:28] (03CR) 10Hnowlan: [C: 03+1] "Nice, lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/824447 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [11:55:32] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [11:55:52] (03CR) 10CI reject: [V: 04-1] Disable LBFactory config callback for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824476 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [11:56:53] (03PS2) 10Ladsgroup: Disable LBFactory config callback for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824476 (https://phabricator.wikimedia.org/T298485) [11:57:46] (03CR) 10CI reject: [V: 04-1] Disable LBFactory config callback for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824476 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [11:59:19] (03CR) 10Marostegui: [C: 03+2] Revert "control-mariadb-10.6-bullseye: Drop libaio1" [software] - 10https://gerrit.wikimedia.org/r/824428 (owner: 10Marostegui) [11:59:30] (03PS3) 10Ladsgroup: Disable LBFactory config callback for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824476 (https://phabricator.wikimedia.org/T298485) [12:00:03] (03Merged) 10jenkins-bot: Revert "control-mariadb-10.6-bullseye: Drop libaio1" [software] - 10https://gerrit.wikimedia.org/r/824428 (owner: 10Marostegui) [12:00:10] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] improvement: display update-known-host-productions zsh hint on macOS [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/821770 (owner: 10Clément Goubert) [12:02:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32519 and previous config saved to /var/cache/conftool/dbconfig/20220818-120208-marostegui.json [12:03:13] (03PS2) 10Jbond: sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 [12:04:56] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [12:05:08] (03PS5) 10Matthias Mullie: Schedule image suggestions notifications [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) [12:05:15] (03CR) 10Matthias Mullie: Schedule image suggestions notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [12:05:56] (03CR) 10Ladsgroup: [C: 03+2] Disable LBFactory config callback for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824476 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [12:06:14] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 (owner: 10Jbond) [12:06:20] (03PS4) 10David Caro: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 (https://phabricator.wikimedia.org/T314870) [12:06:22] (03PS4) 10David Caro: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) [12:06:24] (03PS4) 10David Caro: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 (https://phabricator.wikimedia.org/T314870) [12:06:26] (03PS4) 10David Caro: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 (https://phabricator.wikimedia.org/T314870) [12:06:28] (03PS4) 10David Caro: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 (https://phabricator.wikimedia.org/T314870) [12:06:30] (03PS6) 10David Caro: ceph.bootstrap_and_add: add --force option [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824149 (https://phabricator.wikimedia.org/T314870) [12:06:32] (03PS2) 10David Caro: ceph.bootstrapp_and_add: don't rely on sda/sdb being the os disks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824457 (https://phabricator.wikimedia.org/T314870) [12:06:34] (03PS6) 10David Caro: WIP: adding support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [12:06:44] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:52] (03Merged) 10jenkins-bot: Disable LBFactory config callback for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824476 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [12:08:02] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:18] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:10:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:11:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:11:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:12:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:14:33] (03PS3) 10Jbond: sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 [12:15:04] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P32520 and previous config saved to /var/cache/conftool/dbconfig/20220818-121714-marostegui.json [12:17:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:18:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:18:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:18:38] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:18:47] this is me [12:19:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:19:42] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 (owner: 10Jbond) [12:19:56] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10cmooney) @Cmjohnson @Jclark-ctr can we get someone to visit the DC urgently to look at this? I'm concerned we have a link down between core devices for o... [12:20:06] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/824481 (https://phabricator.wikimedia.org/T315411) [12:20:18] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:20:24] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:21:37] (03PS3) 10Reedy: labs: Update StopFormSpam config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824168 [12:21:39] (03PS1) 10Reedy: wmf-config: Move wgSFSRepoyOnly to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824482 [12:21:42] jouncebot: nowandnext [12:21:42] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [12:21:42] In 0 hour(s) and 38 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300) [12:21:43] In 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300) [12:21:54] great typo there [12:22:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10cmooney) @MRaishWMF this should be working for you now if you want to give it a try and report back. Thanks! [12:22:32] (03PS2) 10Reedy: wmf-config: Move wgSFSReportOnly to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824482 [12:22:34] (03PS4) 10Reedy: labs: Update StopFormSpam config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824168 [12:23:00] (03CR) 10Reedy: [C: 03+2] wmf-config: Move wgSFSReportOnly to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824482 (owner: 10Reedy) [12:24:00] (03Merged) 10jenkins-bot: wmf-config: Move wgSFSReportOnly to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824482 (owner: 10Reedy) [12:24:17] (03PS5) 10Reedy: labs: Update StopForumSpam config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824168 [12:24:47] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/824481 (https://phabricator.wikimedia.org/T315411) (owner: 10Marostegui) [12:24:53] (03CR) 10Reedy: [C: 03+2] labs: Update StopForumSpam config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824168 (owner: 10Reedy) [12:25:08] !log Install 10.6.9 on pc1014 [12:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:28] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/824481 (https://phabricator.wikimedia.org/T315411) (owner: 10Marostegui) [12:25:41] (03Merged) 10jenkins-bot: labs: Update StopForumSpam config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824168 (owner: 10Reedy) [12:27:02] (03PS1) 10Vgutierrez: trafficserver: Set transaction_active_timeout_out on cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/824484 (https://phabricator.wikimedia.org/T315533) [12:27:18] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) [12:27:36] RECOVERY - Check systemd state on mw2373 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:52] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Set wgSFSReportOnly in here (duration: 03m 27s) [12:29:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:29:30] (03PS2) 10Vgutierrez: trafficserver: Set transaction_active_timeout_out on cp4026 and cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/824484 (https://phabricator.wikimedia.org/T315533) [12:30:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:30:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:30:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36812/console" [puppet] - 10https://gerrit.wikimedia.org/r/824484 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez) [12:30:52] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T312972)', diff saved to https://phabricator.wikimedia.org/P32521 and previous config saved to /var/cache/conftool/dbconfig/20220818-123220-marostegui.json [12:32:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:32:25] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [12:32:32] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:32:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:32:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T312972)', diff saved to https://phabricator.wikimedia.org/P32522 and previous config saved to /var/cache/conftool/dbconfig/20220818-123241-marostegui.json [12:33:27] !log reedy@deploy1002 Synchronized wmf-config/: SFS config updates (duration: 03m 25s) [12:34:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:39:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:40:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:40:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:40:25] (03PS1) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [12:41:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:42:24] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:54] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:34] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:46:55] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/824447 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [12:47:02] (03PS3) 10Filippo Giunchedi: postgresql: resync_replica improvements and fixes [puppet] - 10https://gerrit.wikimedia.org/r/824447 (https://phabricator.wikimedia.org/T313229) [12:47:06] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:14] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:44] (03PS1) 10Filippo Giunchedi: postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229) [12:58:06] (03CR) 10Jforrester: [C: 03+1] "Thank you." [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904) (owner: 10BBlack) [12:58:48] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:03] (03PS1) 10FNegri: Add cloudcephosd1026 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/824489 [12:59:33] (03PS2) 10FNegri: Add cloudcephosd1026 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/824489 (https://phabricator.wikimedia.org/T314870) [13:00:04] (03CR) 10Ssingh: [C: 03+1] trafficserver: Set transaction_active_timeout_out on cp4026 and cp4032 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824484 (https://phabricator.wikimedia.org/T315533) (owner: 10Vgutierrez) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1300). [13:00:05] MatmaRex, TheresNoTime, Jdlrobson, and zabe: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] * TheresNoTime is here [13:00:15] o/ [13:00:22] (03CR) 10CI reject: [V: 04-1] Add cloudcephosd1026 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/824489 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [13:00:48] hi [13:00:55] mine are just no-ops [13:00:56] hi [13:01:08] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1033.eqiad.wmnet with OS bullseye [13:01:43] (03PS3) 10FNegri: Add cloudcephosd1026 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/824489 (https://phabricator.wikimedia.org/T314870) [13:01:56] (03CR) 10David Caro: [C: 03+1] "LGTM! (after fixing the lints)" [puppet] - 10https://gerrit.wikimedia.org/r/824489 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [13:02:15] I can deploy if needed, will give urbanecm et al a few minutes to appear [13:02:28] jfdi :) [13:02:33] (03CR) 10jenkins-bot: Add cloudcephosd1026 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/824489 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [13:03:02] wamp, going to start with my own patch then :P [13:03:32] dogfooding is the best way [13:03:34] (03CR) 10Samtar: [C: 03+2] InitialiseSettings-labs: Enable Phonos on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824291 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [13:04:05] ah that had a conflict ^ [13:04:10] (03CR) 10FNegri: [C: 03+2] Add cloudcephosd1026 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/824489 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [13:07:02] (03PS2) 10Samtar: InitialiseSettings-labs: Enable Phonos on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824291 (https://phabricator.wikimedia.org/T314294) [13:08:02] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:08:12] (03CR) 10Samtar: [C: 03+2] InitialiseSettings-labs: Enable Phonos on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824291 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [13:09:00] (03PS3) 10Jdlrobson: Enable new Vector skin on select pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823587 (https://phabricator.wikimedia.org/T314286) [13:09:08] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable Phonos on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824291 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [13:10:04] (03PS1) 10Cathal Mooney: Add shell user for 'trokhymovych' [puppet] - 10https://gerrit.wikimedia.org/r/824490 (https://phabricator.wikimedia.org/T315262) [13:10:38] right just syncing mine, then MatmaRex you're next [13:10:44] no-ops you say? [13:10:51] yes [13:11:27] MatmaRex: mind fixing the conflicts while I sync? [13:11:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) I've completed the following actions: - User added to LDAP 'nda' group. - Kerberos principal created for username 'trokhymovych' and email 'trokhym... [13:12:41] doing [13:12:45] :) [13:12:56] (03PS1) 10JMeybohm: sre.k8s.reboot-nodes: Don't sleep that long between batches [cookbooks] - 10https://gerrit.wikimedia.org/r/824491 (https://phabricator.wikimedia.org/T260661) [13:13:30] (03PS1) 10Jbond: CHANGELOG: add changelogs for release v3.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/824492 [13:13:33] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:824291|InitialiseSettings-labs: Enable Phonos on beta enwiki (T314294)]] (duration: 03m 30s) [13:13:37] T314294: Deploy Phonos to beta cluster - https://phabricator.wikimedia.org/T314294 [13:13:55] (03CR) 10Jbond: [C: 03+2] CHANGELOG: add changelogs for release v3.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/824492 (owner: 10Jbond) [13:13:56] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) A quick update on "high frequency" statsd producers sampled over 10 minutes on graphite1004. The list is getting shorter... [13:14:05] awight: I started deployment, so far only done my own patch [13:14:38] (03PS6) 10David Caro: quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [13:14:39] awight: feel free to take over if you'd prefer? [13:14:40] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:14:45] TheresNoTime: thanks doing the deployment [13:14:49] i'm here too if needed [13:15:01] * TheresNoTime can continue :) [13:15:08] ty! [13:15:12] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage [13:15:13] feel free to! i'll be on standby :) [13:15:36] (03PS2) 10Bartosz Dziewoński: Disable DiscussionTools pageframe everywhere except labs and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824203 (owner: 10Esanders) [13:15:50] (03PS4) 10Bartosz Dziewoński: Remove unused config for Echo notification emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) [13:16:05] (ty ^) [13:16:08] done, sorry about that [13:16:19] no worries! [13:16:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:17:33] (03CR) 10Samtar: [C: 03+2] "deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824203 (owner: 10Esanders) [13:17:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:17:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:17:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1033.eqiad.wmnet with reason: host reimage [13:18:17] MatmaRex: are you going to be able to test these? [13:18:51] (03Merged) 10jenkins-bot: Disable DiscussionTools pageframe everywhere except labs and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824203 (owner: 10Esanders) [13:19:03] TheresNoTime: they do nothing [13:19:30] in both cases the configs aren't used by the code (anymore, or yet) [13:19:39] ack :) [13:20:33] Now syncing 824203 [13:20:53] (03PS5) 10Samtar: Remove unused config for Echo notification emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) (owner: 10Bartosz Dziewoński) [13:21:24] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:21:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824490 (https://phabricator.wikimedia.org/T315262) (owner: 10Cathal Mooney) [13:21:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:22:56] (03CR) 10David Caro: [C: 03+2] quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [13:23:01] (03CR) 10David Caro: [C: 03+2] "Thanks!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [13:23:28] (03PS4) 10Jdlrobson: Enable new Vector skin on select pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823587 (https://phabricator.wikimedia.org/T314286) [13:23:30] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:23:46] !log samtar@deploy1002 Synchronized wmf-config: Config: [[gerrit:824203|Disable DiscussionTools pageframe everywhere except labs and mediawikiwiki]] (duration: 03m 26s) [13:23:48] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:16] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:24:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:30] !log awight@deploy1002 Started deploy [kartotherian/deploy@672af45]: Update kartotherian to 285fc7d [13:24:40] (03CR) 10Samtar: [C: 03+2] "deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) (owner: 10Bartosz Dziewoński) [13:24:54] (03CR) 10Andrew Bogott: [C: 03+1] "this is much easier to remember!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:24:56] (03PS1) 10Jbond: Upstream release v3.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/824493 [13:25:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-codfw [13:25:13] 🥳 [13:25:46] I need to go, so I am going to reschedule my patch [13:25:46] (03Merged) 10jenkins-bot: Remove unused config for Echo notification emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820546 (https://phabricator.wikimedia.org/T314604) (owner: 10Bartosz Dziewoński) [13:25:49] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:26:01] zabe: okay! [13:26:06] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:25] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [13:27:17] (03CR) 10David Caro: "I had that review draft from some time ago sorry xd, will finish the whole review now" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:27:58] syncing 820546 [13:28:16] !log awight@deploy1002 Finished deploy [kartotherian/deploy@672af45]: Update kartotherian to 285fc7d (duration: 03m 45s) [13:29:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Andrew) [13:29:25] (03PS1) 10FNegri: Ceph OSD hosts: set mtu on both ifaces [puppet] - 10https://gerrit.wikimedia.org/r/824494 (https://phabricator.wikimedia.org/T315446) [13:29:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:29:51] (03PS1) 10Hashar: doc: redirect /mw-tools-scap/ to /scap/ [puppet] - 10https://gerrit.wikimedia.org/r/824495 (https://phabricator.wikimedia.org/T315541) [13:29:59] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on authdns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:30:17] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns5001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:30:21] scap should have a little https://en.wikipedia.org/wiki/Dinosaur_Game or something during the restarts [13:30:24] (03Merged) 10jenkins-bot: quota_increase: pretty format SAL entry [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824166 (owner: 10RhinosF1) [13:30:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:51] Jdlrobson: you're the next patch [13:30:53] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns2002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:31:01] TheresNoTime: awesome [13:31:08] (03PS5) 10Samtar: Enable new Vector skin on select pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823587 (https://phabricator.wikimedia.org/T314286) (owner: 10Jdlrobson) [13:31:14] !log samtar@deploy1002 Synchronized wmf-config: Config: [[gerrit:820546|Remove unused config for Echo notification emails (T314604)]] (duration: 03m 25s) [13:31:17] T314604: Remove no-reply-notifications@ email addresses and config using them (wgNotificationSender etc.) - https://phabricator.wikimedia.org/T314604 [13:31:19] TheresNoTime: i want to play dino game now and i got stuff to do! [13:31:46] !log jayme@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=toolhub [13:31:48] MatmaRex: your patches are live :) [13:32:11] (03CR) 10Samtar: [C: 03+2] "deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823587 (https://phabricator.wikimedia.org/T314286) (owner: 10Jdlrobson) [13:32:18] thanks TheresNoTime [13:32:27] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) Thanks @TAndic I'll leave this in @bcampbell's hands unless I hear otherwise! [13:32:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1033.eqiad.wmnet with OS bullseye [13:32:57] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns3001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312972)', diff saved to https://phabricator.wikimedia.org/P32523 and previous config saved to /var/cache/conftool/dbconfig/20220818-133257-marostegui.json [13:32:58] (03Merged) 10jenkins-bot: Enable new Vector skin on select pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823587 (https://phabricator.wikimedia.org/T314286) (owner: 10Jdlrobson) [13:33:01] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [13:33:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] Upstream release v3.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/824493 (owner: 10Jbond) [13:33:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [13:33:35] (03CR) 10CI reject: [V: 04-1] doc: redirect /mw-tools-scap/ to /scap/ [puppet] - 10https://gerrit.wikimedia.org/r/824495 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [13:34:07] Jdlrobson: can you test on mwdebug1001 ? :) [13:34:17] yep! [13:34:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:36:16] LGTM TheresNoTime [13:36:19] please sync [13:36:25] ack :D [13:37:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on authdns1001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:37:17] syncing 823587, zabe if you happen to be back/still around, you would be next [13:37:28] !log release spicerack 3.2.0 [13:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:33] !log samtar@deploy1002 scap failed: average error rate on 5/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [13:37:39] hm [13:37:46] !log uploaded spicerack_3.2.0 to apt.wikimedia.org bullseye-wikimedia [13:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:07] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns6001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:38:37] (03PS1) 10Andrew Bogott: profile::ceph::client::rbd_libvirt: remove refs to old absented files [puppet] - 10https://gerrit.wikimedia.org/r/824496 [13:38:39] (03PS1) 10Andrew Bogott: Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) [13:38:40] Jdlrobson: seeing `PHP Warning: in_array() expects parameter 2 to be array, string given` in `/srv/mediawiki/php-1.39.0-wmf.25/skins/Vector/includes/Hooks.php:648` [13:38:43] 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Replace certificate on deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T315386 (10bking) The work is complete; //deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud // has a valid S... [13:38:51] 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Replace certificate on deployment-elastic09.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T315386 (10bking) 05Open→03Resolved [13:39:04] TheresNoTime: ack.. looking [13:39:10] but not on `enwiki` so.. [13:39:19] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:39:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:39:31] 'default' => 'false', [13:39:39] TheresNoTime: that should be an array dooh [13:39:42] (03CR) 10CI reject: [V: 04-1] Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [13:39:44] reverting [13:39:48] follow up or revert/try again? [13:40:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:40:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:40:40] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/824495 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [13:40:48] (03PS1) 10Jdlrobson: Enable new Vector skin on select pages (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824498 (https://phabricator.wikimedia.org/T314286) [13:41:10] ^ TheresNoTime there's the better version [13:41:15] Jdlrobson: ack [13:41:23] sorry that was a silly mistake to make [13:41:23] (03CR) 10CI reject: [V: 04-1] Enable new Vector skin on select pages (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824498 (https://phabricator.wikimedia.org/T314286) (owner: 10Jdlrobson) [13:41:28] wish we had some kind of validation on configuration [13:41:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:42:12] Jdlrobson: I am mid-revert so will need to wait [13:42:28] (03CR) 10Cathal Mooney: [C: 03+2] Add shell user for 'trokhymovych' [puppet] - 10https://gerrit.wikimedia.org/r/824490 (https://phabricator.wikimedia.org/T315262) (owner: 10Cathal Mooney) [13:42:47] urbanecm: ping as my first revert, following https://deploy-commands.toolforge.org/bacc/823587 though [13:42:50] TheresNoTime: ack [13:42:58] TheresNoTime: i'm still here [13:43:07] commands at deployment tools are right ones [13:43:09] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns4001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:43:14] let me know if you've questions [13:43:17] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns3002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:43:32] so far so good, errors have stopped [13:44:03] (03PS2) 10Andrew Bogott: Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) [13:44:29] Jdlrobson: we sort of have something like a validation, `operations-mw-config-php72-composer-diffConfig-docker` tells you the actual difference in config your patch makes. might be useful. [13:44:32] Jdlrobson: re config validation, for core config that is soon [13:44:42] (03PS1) 10Clément Goubert: admin: move cgoubert from sre-admins to ops group [puppet] - 10https://gerrit.wikimedia.org/r/824499 (https://phabricator.wikimedia.org/T315538) [13:44:51] (03CR) 10CI reject: [V: 04-1] Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [13:44:58] (it needs you to have a look and check the variables, but at least it's something) [13:45:18] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:823587|Enable new Vector skin on select pages (T314286)]] (duration: 03m 35s) [13:45:18] RhinosF1: that's lovely to hear :) [13:45:22] T314286: Run survey for vector 2022 on enwiki - https://phabricator.wikimedia.org/T314286 [13:45:27] (03PS2) 10Clément Goubert: admin: move cgoubert from sre-admins to ops group [puppet] - 10https://gerrit.wikimedia.org/r/824499 (https://phabricator.wikimedia.org/T315538) [13:45:33] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824494 (https://phabricator.wikimedia.org/T315446) (owner: 10FNegri) [13:45:48] Jdlrobson, urbanecm: https://phabricator.wikimedia.org/T313128#8097748 discusses it [13:46:00] Jdlrobson: please rebase https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/824498/ [13:46:28] TheresNoTime: the push command from the revert section likely won't work (the default is now to push via ssh, not https). if that happens, i can share some config to put to ~/.gitconfig to fix that [13:46:42] urbanecm: *just* did that and got an error, yes [13:46:46] (03CR) 10Andrew Bogott: [C: 03+2] profile::ceph::client::rbd_libvirt: remove refs to old absented files [puppet] - 10https://gerrit.wikimedia.org/r/824496 (owner: 10Andrew Bogott) [13:46:47] okay, wait a sec [13:46:57] (actually, which error, just in case?) [13:47:08] TheresNoTime: where's the revert patch? [13:47:12] I need to rebase off of that [13:47:20] urbanecm: `samtar@gerrit.wikimedia.org: Permission denied (publickey)`, which is SSH so.. [13:47:24] ah, okay [13:47:29] Jdlrobson: it's at deployment server only atm [13:47:59] 10SRE-tools, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560 (10JMeybohm) p:05Triage→03Medium [13:47:59] urbanecm: so should i wait for you to push? [13:48:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32524 and previous config saved to /var/cache/conftool/dbconfig/20220818-134803-marostegui.json [13:48:07] or should I just apply on current master? [13:48:35] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns1001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:37] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns4002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:48:38] TheresNoTime: add this to your ~/.gitconfig. that should convince git to use https instead of ssh https://www.irccloud.com/pastebin/OXrrXaEV/ [13:48:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/824499 (https://phabricator.wikimedia.org/T315538) (owner: 10Clément Goubert) [13:49:12] ack [13:49:24] Jdlrobson: please wait a bit, it will be in gerrit soon. [13:49:34] urbanecm: no problemo [13:49:44] once it's in gerrit it should be easy to do the rebase [13:49:52] (03PS1) 10Urbanecm: admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/824501 [13:50:11] (03PS3) 10Andrew Bogott: Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) [13:50:30] (03PS2) 10Urbanecm: admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/824501 [13:50:43] (03PS4) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) [13:51:18] https://www.irccloud.com/pastebin/ASSY6m4C/ [13:51:19] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns1002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:51:21] PROBLEM - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-toolhub.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:51:29] urbanecm: ^ missing Change-Id [13:51:38] could you SSH in and do the push? [13:51:40] TheresNoTime: run git commit --amend and save without any changes [13:51:45] ah okay [13:52:05] that probably should be a part of deployment commands help. i'll send a patch for that. [13:52:23] (03CR) 10CI reject: [V: 04-1] Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [13:52:27] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on authdns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:52:27] (03PS3) 10Urbanecm: admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/824501 [13:52:53] (I can also do the push if you want, but i'm pretty sure the change-id is the last thing) [13:53:09] Jdlrobson: urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/824502 [13:53:19] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns6001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:19] (remind me to properly set up my gitconfig etc) [13:53:23] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:31] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns1002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:33] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:35] sounds good! let's merge? [13:53:39] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns3001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:39] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns3002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:53:42] (it's in prod already, right?) [13:53:57] yeah [13:54:22] 10SRE-OnFire, 10Performance-Team, 10MW-1.39-notes (1.39.0-wmf.26; 2022-08-22), 10Wikimedia-Incident, 10Wikimedia-production-error: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomi... - https://phabricator.wikimedia.org/T315274 [13:54:28] urbanecm: can I just +2 it or..? [13:54:32] yup [13:54:40] (03CR) 10Samtar: [C: 03+2] Revert "Enable new Vector skin on select pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824502 (owner: 10Samtar) [13:54:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [13:54:57] sorry for the delay here Jdlrobson! [13:54:59] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:54:59] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:54:59] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns2002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:54:59] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns4002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:55:01] RECOVERY - Confd template for /var/lib/gdnsd/discovery-toolhub.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:55:09] No problem TheresNoTime .. my error. :/ [13:55:19] * TheresNoTime probably should have caught that too [13:55:25] Seems like the error at least led to some useful knowledge transfer? [13:55:33] (03PS4) 10Jbond: sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 [13:55:42] (03Merged) 10jenkins-bot: Revert "Enable new Vector skin on select pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824502 (owner: 10Samtar) [13:56:12] (03PS1) 10Btullis: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) [13:56:14] https://www.irccloud.com/pastebin/VXysYXqT/ [13:56:20] urbanecm: ^ expected? [13:56:30] if you did not run git fetch yet, yes [13:56:37] ah, oops [13:57:03] (03CR) 10JMeybohm: [C: 03+2] sre.k8s.reboot-nodes: Don't sleep that long between batches [cookbooks] - 10https://gerrit.wikimedia.org/r/824491 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:57:11] Jdlrobson: okay, once you've rebased we can try again [13:57:59] TheresNoTime: ok thanks.. looking again [13:58:16] (03CR) 10Btullis: "I think this should fix the issue with hdfs-fuse and systemd-timesync, but I'm not sure if this is the perfect place to put the code in pu" [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [13:58:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) Reboot of staging clusters and codfw (batchsize 1, took ~3.25 hours) went smoothly without any al... [13:58:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10JMeybohm) [13:58:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) 05Open→03Resolved [13:58:42] TheresNoTime: OK this time we look good [13:58:48] seeing the empty array on mediawiki.org [13:59:01] please sync (with fingers crossed!) [13:59:07] TheresNoTime: fyi https://gerrit.wikimedia.org/r/c/labs/tools/deploy-commands/+/824505 [13:59:30] Jdlrobson: I've not merged https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/824498/ yet [13:59:41] (03PS4) 10Andrew Bogott: Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) [13:59:43] needs rebasing I think? [14:00:04] TheresNoTime: ah okay because im seeing the current default :) [14:00:20] (I just did scap pull on debug, had forgot!) [14:00:24] (03PS2) 10Jdlrobson: Enable new Vector skin on select pages (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824498 (https://phabricator.wikimedia.org/T314286) [14:00:28] I hit the rebase icon in the UI [14:00:31] sorry I thought that had been done [14:00:51] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 (owner: 10Jbond) [14:01:09] ah I could have done that sorry, I just saw "This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset." and assumed [14:01:27] yeah, always click rebase first, it's more reliable :) [14:01:38] !log extending deployment window slightly [14:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:01:44] (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes: Don't sleep that long between batches [cookbooks] - 10https://gerrit.wikimedia.org/r/824491 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [14:02:00] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36817/console" [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [14:02:02] (03CR) 10Samtar: [C: 03+2] "deploy take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824498 (https://phabricator.wikimedia.org/T314286) (owner: 10Jdlrobson) [14:02:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:02:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:02:55] (03Merged) 10jenkins-bot: Enable new Vector skin on select pages (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824498 (https://phabricator.wikimedia.org/T314286) (owner: 10Jdlrobson) [14:03:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P32525 and previous config saved to /var/cache/conftool/dbconfig/20220818-140309-marostegui.json [14:03:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:03:44] Jdlrobson: right! now ready on mwdebug1001 [14:03:47] (03PS5) 10Andrew Bogott: Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) [14:05:29] (03PS4) 10Urbanecm: admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/824501 [14:05:33] (03CR) 10Ottomata: Add missing airflow service users to yarn's production queue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [14:05:43] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:52] Jdlrobson: (looking good to me :) less *error-y*) [14:05:54] TheresNoTime: Okay looks good [14:05:58] syncing! [14:05:58] (03CR) 10Andrew Bogott: [C: 03+2] Provision cloudbackup100[34] to back up VM drives [puppet] - 10https://gerrit.wikimedia.org/r/824497 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [14:06:15] RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:08:28] TheresNoTime: fingers crossed :) [14:08:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:09:17] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:824498|Enable new Vector skin on select pages (take 2) (T314286)]] (duration: 03m 07s) [14:09:22] T314286: Run survey for vector 2022 on enwiki - https://phabricator.wikimedia.org/T314286 [14:09:37] Done :) [14:09:40] phew! [14:09:42] Thanks TheresNoTime [14:09:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:09:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:09:57] thanks for bearing with me! [14:10:12] !log UTC afternoon backport window done [14:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:50] and thank you for the help urbanecm [14:10:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:10:59] any time TheresNoTime! [14:11:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10cmooney) Hi @Trokhymovych I believe I have added all the elements that you require. Can you test the access and let us know if there is any problem? Thanks. [14:12:28] time to sort my gitconfig [14:12:58] TheresNoTime: pro-tip: use modules/admin/files/home/samtar in puppet repository [14:13:05] thanks TheresNoTime for the help today! [14:13:13] whatever you put there will be auto-propagated by puppet to all hosts you have access at [14:13:22] and it will be there when hosts are reinstalled, changed, whatever [14:13:22] urbanecm: oh awesome, thank you [14:13:23] (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-08-18-132255-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/824506 [14:13:43] see https://github.com/wikimedia/puppet/tree/production/modules/admin/files/home/urbanecm for my own dotfiles [14:13:54] (03PS5) 10Jbond: sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 [14:14:11] puppet patches can be scheduled for Puppet request windows (one is in two hours or so) [14:14:16] hope that helps TheresNoTime :) [14:14:28] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36820/console" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [14:14:52] kinda glad I've had that deploy experience now to be honest [14:15:32] (03PS2) 10Btullis: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) [14:16:28] (03CR) 10Urbanecm: [C: 03+1] Deploy partial action blocks to cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824395 (https://phabricator.wikimedia.org/T315525) (owner: 10STran) [14:16:37] urbanecm: or you can just ping a friendly sre or add them as a gerrit reviewer [14:16:43] or that [14:17:01] TheresNoTime: Jdlrobson How is it that `PHP Warning: in_array() expects parameter 2 to be array, string given` wasn't caught during staging? Did you test and stage on the same server? was mwdebug reviewed to be empty in Logstash? [14:17:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service: use --env-file for docker [puppet] - 10https://gerrit.wikimedia.org/r/824451 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:17:23] (03PS2) 10AOkoth: gitlab: revert gitlab-replica TTL to 600s [dns] - 10https://gerrit.wikimedia.org/r/824244 (https://phabricator.wikimedia.org/T296713) [14:17:56] Krinkle: I did not check https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002, entirely my fault. Sorry [14:18:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T312972)', diff saved to https://phabricator.wikimedia.org/P32526 and previous config saved to /var/cache/conftool/dbconfig/20220818-141815-marostegui.json [14:18:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [14:18:19] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [14:18:25] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36821/console" [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [14:18:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [14:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T312972)', diff saved to https://phabricator.wikimedia.org/P32527 and previous config saved to /var/cache/conftool/dbconfig/20220818-141835-marostegui.json [14:18:41] (03CR) 10AOkoth: gitlab: revert gitlab-replica TTL to 600s (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/824244 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [14:19:07] (not that I see that error in logstash..) [14:20:03] (03CR) 10Andrew Bogott: [C: 03+1] "I'm very surprised the change is in timesyncd and not in hdfs but I'll take whatever is on offer :)" [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [14:20:15] PROBLEM - Host sretest1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312972)', diff saved to https://phabricator.wikimedia.org/P32528 and previous config saved to /var/cache/conftool/dbconfig/20220818-142043-marostegui.json [14:21:02] (03CR) 10CI reject: [V: 04-1] sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 (owner: 10Jbond) [14:22:58] 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10Ottomata) [14:24:51] (03CR) 10Herron: "Nice! Please see a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:26:10] (03CR) 10Herron: [C: 03+1] netmon: Set correct username/groupname mappings for Rancid [puppet] - 10https://gerrit.wikimedia.org/r/824286 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [14:26:22] (03CR) 10Herron: [C: 03+1] netmon: Set correct username/groupname mappings for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/824284 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [14:27:46] (03CR) 10Ottomata: [C: 03+1] Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [14:32:02] (03CR) 10Herron: WIP: add profile::dispatch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:34:34] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/824501 (owner: 10Urbanecm) [14:34:37] (03PS1) 10Andrew Bogott: cloudbackup100[34]: specify limited rbd access [puppet] - 10https://gerrit.wikimedia.org/r/824510 (https://phabricator.wikimedia.org/T302535) [14:34:40] thanks jbond! [14:35:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32529 and previous config saved to /var/cache/conftool/dbconfig/20220818-143549-marostegui.json [14:36:55] (03PS1) 10Samtar: files/home: Add samtar [puppet] - 10https://gerrit.wikimedia.org/r/824511 [14:37:02] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup100[34]: specify limited rbd access [puppet] - 10https://gerrit.wikimedia.org/r/824510 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [14:41:45] TheresNoTime: it shows on the main logstash mw dashboard but seemingly not on mwdebug. I don't see any bug though, as logs before and after that period are fine. I see change 823587 merged 13:32 UTC. You aksed for testing at 13:34. Your scap pulls at https://logstash.wikimedia.org/goto/807b8ec61b668860178769210666a10e for mwdebug1001 show at 13:20, 13:27, 13:33 (relevant) and 13:59. And then at [14:41:45] https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002 when disabling type:mediawiki filter (to see the fpm logs) I see that indeed at 13:33 fpm reports "exiting, bye-bye!" as expected, and then after that various webrequests by Jon to validate it on enwiki. There were no pageviews on other wikis though that I can see. [14:43:05] so oppertunities for catching were: 1) test pageview other than enwiki and non-zero errors in logstash, and 2) Scap runs an eval.php check to validate configuration. However this doesn't invoke Vector hooks, so slipped through there. 3) Scap runs swagger checks against various URLs on canary hosts. These should catch it afaik. [14:43:28] TheresNoTime: can you confirm that Scap aborted the deploy during canaries? Or did it fully go out to all servers? [14:43:40] 10SRE-OnFire, 10Performance-Team, 10MW-1.39-notes (1.39.0-wmf.26; 2022-08-22), 10Wikimedia-Incident, 10Wikimedia-production-error: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomi... - https://phabricator.wikimedia.org/T315274 [14:43:41] Krinkle: it aborted itself [14:44:09] which check? [14:44:47] looking [14:45:27] `13:37:33 Check 'Logstash Error rate for mw1447.eqiad.wmnet' failed: ERROR: 96% OVER_THRESHOLD (Avg. Error rate: Before: 0.43, After: 117.00, Threshold: 4.31)` [14:46:01] same for mw1414, mw1450, mw1448 and mw1415 [14:46:20] then it failed itself with `average error rate on 5/9 canaries increased by 10x` [14:48:24] there is a fourth opportunity to catch things in scap which is what caught it: the order of magnitude increase in error rate on canary servers [14:49:05] it's getting some portion of traffic at that point, so not ideal, but better than all traffic [14:49:05] 10SRE, 10SRE-OnFire, 10Observability-Alerting: vopsbot's home directory doesn't get created - https://phabricator.wikimedia.org/T315568 (10fgiunchedi) [14:49:40] the swagger checks happen at the same stage [14:50:39] TheresNoTime: nice job on your first rollback: it's always stressful. [14:50:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P32530 and previous config saved to /var/cache/conftool/dbconfig/20220818-145055-marostegui.json [14:51:05] thanks :) [14:51:13] (03PS1) 10Btullis: Add dummy keytabs for new clouddumps100* servers [labs/private] - 10https://gerrit.wikimedia.org/r/824513 (https://phabricator.wikimedia.org/T309346) [14:51:17] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:52:21] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keytabs for new clouddumps100* servers [labs/private] - 10https://gerrit.wikimedia.org/r/824513 (https://phabricator.wikimedia.org/T309346) (owner: 10Btullis) [14:52:21] running git commands under duress is (at the moment) the main skill you gain from doing deployments [14:53:26] just to summarise, if this was tested on a wiki other than en.wiki while it was on mwdebug1001, we'd have logged errors? [14:53:32] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36823/console" [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [14:53:59] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:22] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Release-Engineering-Team, and 5 others: Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10Krinkle) [14:55:07] TheresNoTime: yes [14:56:33] thcipriani: perhaps we can run the swagger checks on mwdebug as well as part of scap-pull. ideally we'd be able to automatically also tell the scap-pull'er if there are non-zero exception/error entries during that window, but even if it's just running the reqs it'll give us something to look for in logstash [14:56:51] almost feels like a `scap pull` on mwdebug1001 should trigger a few requests to a couple of big & small wikis? [14:57:01] :) [14:57:15] ah! :P [14:58:01] !log dancy@deploy1002 Started deploy [integration/docroot@a43ff3b]: (no justification provided) [14:58:15] RECOVERY - Host sretest1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:58:30] (03PS6) 10Jbond: sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 [14:58:39] !log dancy@deploy1002 Finished deploy [integration/docroot@a43ff3b]: (no justification provided) (duration: 00m 38s) [14:58:54] Krinkle: Please file tickets for scap improvement requests [14:59:19] (03PS2) 10Papaul: Add kafka-stretch200[12] to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/824311 (https://phabricator.wikimedia.org/T314160) [14:59:24] We're working on a scap sprint next week and we might be able to take on a few additional mods [14:59:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36824/console" [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) (owner: 10Jbond) [14:59:49] Krinkle: file or ping on existing ones [15:01:35] (03PS2) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [15:02:11] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/824511 (owner: 10Samtar) [15:02:36] TheresNoTime: consider joining the train triage in 5min if you can. [15:02:42] (03PS7) 10Jbond: sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 [15:02:49] (03CR) 10Papaul: [C: 03+2] Add kafka-stretch200[12] to site.pp and to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/824311 (https://phabricator.wikimedia.org/T314160) (owner: 10Papaul) [15:02:51] (03PS3) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [15:02:59] Krinkle: ah I'm in a steering meeting now :( will there be notes? [15:03:10] (03CR) 10Btullis: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [15:04:08] TheresNoTime: nope, it's procedural. we go through logstash and phab prod-error inbox and file/triage stuff. but good excercise to watch (and once comfortable: e.g. take a turn yourself). [15:04:59] TheresNoTime: although these may of of use for offline prep: https://wikitech.wikimedia.org/wiki/OpenSearch_Dashboards#Tech_talk and https://wikitech.wikimedia.org/wiki/Performance/Runbook/Monitor_production_errors [15:05:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-stretch2001.codfw.wmnet with OS bullseye [15:05:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 3 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye [15:06:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T312972)', diff saved to https://phabricator.wikimedia.org/P32531 and previous config saved to /var/cache/conftool/dbconfig/20220818-150601-marostegui.json [15:06:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [15:06:06] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [15:06:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [15:06:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T312972)', diff saved to https://phabricator.wikimedia.org/P32532 and previous config saved to /var/cache/conftool/dbconfig/20220818-150621-marostegui.json [15:08:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312972)', diff saved to https://phabricator.wikimedia.org/P32533 and previous config saved to /var/cache/conftool/dbconfig/20220818-150829-marostegui.json [15:08:50] (03CR) 10Jbond: [C: 03+2] sre.hardware.firmware-upgrade: power on server for firmware updates [cookbooks] - 10https://gerrit.wikimedia.org/r/824472 (owner: 10Jbond) [15:09:35] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:10:07] (03PS4) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [15:11:59] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:35] (03CR) 10David Caro: "This is looking great!" [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:15:00] (03PS1) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [15:18:20] (03CR) 10David Caro: p:admin: ensure the shells exist before the users are created (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824130 (owner: 10David Caro) [15:19:34] (03CR) 10David Caro: [C: 03+1] "LGTM 👍" [puppet] - 10https://gerrit.wikimedia.org/r/824164 (owner: 10Jbond) [15:23:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32534 and previous config saved to /var/cache/conftool/dbconfig/20220818-152335-marostegui.json [15:26:41] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:53] (03CR) 10Klausman: [C: 03+2] ml-services: Add kowiki, srwiki, ukwiki & viwiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/824420 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [15:27:48] (03PS1) 10Andrew Bogott: cloudbackup100[34]: further attempt to narrow rbd keyring settings [puppet] - 10https://gerrit.wikimedia.org/r/824523 (https://phabricator.wikimedia.org/T302535) [15:30:03] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup100[34]: further attempt to narrow rbd keyring settings [puppet] - 10https://gerrit.wikimedia.org/r/824523 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [15:30:37] (03Merged) 10jenkins-bot: ml-services: Add kowiki, srwiki, ukwiki & viwiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/824420 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [15:32:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) [15:33:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) @Ottomata i getting the error below on kafka-stretch2001. I will check the HW side if you have a minutes can you please double check th... [15:35:44] (03PS1) 10Andrew Bogott: wmcs_backup_instances: temporarily back up testlabs on cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/824526 (https://phabricator.wikimedia.org/T302535) [15:37:30] (03PS4) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [15:37:55] (03PS1) 10Jbond: P:systemd::timesyncd: allow overriding the protectsystem systemd param [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) [15:38:12] (03CR) 10Jbond: [C: 04-1] Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [15:38:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P32535 and previous config saved to /var/cache/conftool/dbconfig/20220818-153842-marostegui.json [15:39:25] (03PS3) 10Btullis: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) [15:41:26] (03CR) 10Andrew Bogott: [C: 03+2] wmcs_backup_instances: temporarily back up testlabs on cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/824526 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [15:43:57] (03CR) 10Btullis: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [15:45:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36826/console" [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [15:52:43] (03CR) 10Jbond: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [15:52:49] (03CR) 10Btullis: [V: 03+1] Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [15:53:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T312972)', diff saved to https://phabricator.wikimedia.org/P32536 and previous config saved to /var/cache/conftool/dbconfig/20220818-155348-marostegui.json [15:53:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:53:53] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [15:54:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T312972)', diff saved to https://phabricator.wikimedia.org/P32537 and previous config saved to /var/cache/conftool/dbconfig/20220818-155410-marostegui.json [15:54:29] (03CR) 10Jbond: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [15:59:38] (03PS1) 10Andrew Bogott: role::wmcs::openstack::eqiad1::backy: add profile::ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) [15:59:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T312972)', diff saved to https://phabricator.wikimedia.org/P32538 and previous config saved to /var/cache/conftool/dbconfig/20220818-155938-marostegui.json [15:59:43] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [16:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1600). [16:00:05] Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:21] already merged. thanks again jbond [16:00:41] (03CR) 10Btullis: "This is great, thanks so much." [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [16:01:07] (03CR) 10CI reject: [V: 04-1] role::wmcs::openstack::eqiad1::backy: add profile::ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [16:02:21] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2110 - https://phabricator.wikimedia.org/T315229 (10wiki_willy) a:03Papaul [16:02:50] (03PS2) 10Andrew Bogott: role::wmcs::openstack::eqiad1::backy + ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) [16:03:31] (03CR) 10Ladsgroup: "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816239 (https://phabricator.wikimedia.org/T313657) (owner: 10Stang) [16:03:33] (03CR) 10CI reject: [V: 04-1] role::wmcs::openstack::eqiad1::backy + ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [16:03:44] (03PS3) 10Andrew Bogott: role::wmcs::openstack::eqiad1::backy: include profile::ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) [16:03:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) I received the new disk and I will need the server offline so i can work on it. Thanks [16:05:00] (03CR) 10Andrew Bogott: [C: 03+2] role::wmcs::openstack::eqiad1::backy: include profile::ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [16:06:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) Hm, in netboot.cfg, I see that kafka-jumbo nodes are currently set to use partman/custom/reuse-kafka-jumbo.cfg. That recipe has `/dev... [16:09:10] no probs urbanecm [16:09:45] (03CR) 10David Caro: [C: 03+2] Ceph OSD hosts: set mtu on both ifaces [puppet] - 10https://gerrit.wikimedia.org/r/824494 (https://phabricator.wikimedia.org/T315446) (owner: 10FNegri) [16:10:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) Thanks I will try with partman/custom/kafka-jumbo.cfg [16:14:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P32539 and previous config saved to /var/cache/conftool/dbconfig/20220818-161444-marostegui.json [16:17:01] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch2001.codfw.wmnet with OS bullseye [16:17:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye exe... [16:17:27] (03PS1) 10Papaul: Update partman for kafka-stretch200[12] [puppet] - 10https://gerrit.wikimedia.org/r/824530 (https://phabricator.wikimedia.org/T314160) [16:18:53] (03CR) 10Papaul: [C: 03+2] Update partman for kafka-stretch200[12] [puppet] - 10https://gerrit.wikimedia.org/r/824530 (https://phabricator.wikimedia.org/T314160) (owner: 10Papaul) [16:21:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-stretch2001.codfw.wmnet with OS bullseye [16:21:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 3 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye [16:24:24] (03Abandoned) 10Btullis: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [16:26:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: disk fault investigation [16:26:27] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: disk fault investigation [16:26:31] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=eb9685af-e0f7-4513-a789-7a96488ffc40) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit... [16:27:10] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) @Papaul I've shut this server down for you to work on it. [16:28:20] (03CR) 10AOkoth: [C: 03+2] gitlab: revert gitlab-replica TTL to 600s [dns] - 10https://gerrit.wikimedia.org/r/824244 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [16:29:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P32540 and previous config saved to /var/cache/conftool/dbconfig/20220818-162950-marostegui.json [16:30:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 3 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) @Ottomata on the new recipe i am getting ` │ │ │ │... [16:31:55] (03PS4) 10Andrew Bogott: role::wmcs::openstack::eqiad1::backy: include profile::ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) [16:32:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) thanks [16:33:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) Hm, okay, so do we need a new recipe then? This might be a recipe that will be reused for many Config I hosts. [16:33:14] (03CR) 10Andrew Bogott: [C: 03+2] role::wmcs::openstack::eqiad1::backy: include profile::ceph::client::rbd_libvirt [puppet] - 10https://gerrit.wikimedia.org/r/824529 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [16:33:31] (03CR) 10David Caro: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [16:34:33] (03CR) 10David Caro: Fix a conflict between hdfs-fuse and systemd-timesync on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824503 (https://phabricator.wikimedia.org/T310643) (owner: 10Btullis) [16:34:43] (03CR) 10David Caro: P:systemd::timesyncd: allow overriding the protectsystem systemd param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [16:38:05] (03CR) 10Jbond: P:systemd::timesyncd: allow overriding the protectsystem systemd param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [16:41:39] (03CR) 10Btullis: P:systemd::timesyncd: allow overriding the protectsystem systemd param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [16:42:57] (03PS1) 10Andrew Bogott: Add new profile::ceph::client::rbd_backy profile [puppet] - 10https://gerrit.wikimedia.org/r/824533 (https://phabricator.wikimedia.org/T302535) [16:43:27] (03CR) 10Btullis: P:systemd::timesyncd: allow overriding the protectsystem systemd param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [16:44:12] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.25 refs T314186 [16:44:16] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [16:44:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T312972)', diff saved to https://phabricator.wikimedia.org/P32541 and previous config saved to /var/cache/conftool/dbconfig/20220818-164456-marostegui.json [16:45:01] T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972 [16:45:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:45:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [16:45:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 13 hosts with reason: Maintenance [16:45:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 13 hosts with reason: Maintenance [16:46:59] (03CR) 10CI reject: [V: 04-1] Add new profile::ceph::client::rbd_backy profile [puppet] - 10https://gerrit.wikimedia.org/r/824533 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [16:47:33] !log demon@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.25 refs T314186 (duration: 03m 20s) [16:47:41] (03PS2) 10Andrew Bogott: Add new profile::ceph::client::rbd_backy profile [puppet] - 10https://gerrit.wikimedia.org/r/824533 (https://phabricator.wikimedia.org/T302535) [16:47:43] (03PS2) 10Hashar: doc: redirect /mw-tools-scap/ to /scap/ [puppet] - 10https://gerrit.wikimedia.org/r/824495 (https://phabricator.wikimedia.org/T315541) [16:47:45] (03CR) 10BCornwall: [C: 04-1] "Following the instructions in the README, I'm getting:" [dns] - 10https://gerrit.wikimedia.org/r/824452 (https://phabricator.wikimedia.org/T315536) (owner: 10MMandere) [16:51:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:51:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:51:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 9 hosts with reason: Maintenance [16:51:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 9 hosts with reason: Maintenance [16:53:08] (03CR) 10Andrew Bogott: [C: 03+2] Add new profile::ceph::client::rbd_backy profile [puppet] - 10https://gerrit.wikimedia.org/r/824533 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [16:53:18] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:00:04] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1700). [17:01:28] (03CR) 10Hnowlan: [C: 03+1] postgresql: default to autodetecting pg version [cookbooks] - 10https://gerrit.wikimedia.org/r/824486 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [17:05:43] (03CR) 10David Caro: "Not finished yet though 😊" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:07:57] (03CR) 10Andrea Denisse: netmon: Add the OpenSSH configuration file inside the rancid home directory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [17:08:01] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add the OpenSSH configuration file inside the rancid home directory [puppet] - 10https://gerrit.wikimedia.org/r/824299 (https://phabricator.wikimedia.org/T314936) (owner: 10Andrea Denisse) [17:08:26] !log hashar@deploy1002 Started deploy [integration/docroot@1aca57b]: doc: update links from /mw-tools-scap/ to /scap/ - T315541 [17:08:29] T315541: scap documentation is no more generated - https://phabricator.wikimedia.org/T315541 [17:08:35] !log hashar@deploy1002 Finished deploy [integration/docroot@1aca57b]: doc: update links from /mw-tools-scap/ to /scap/ - T315541 (duration: 00m 09s) [17:08:36] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Set correct username/groupname mappings for Rancid [puppet] - 10https://gerrit.wikimedia.org/r/824286 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [17:09:12] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Set correct username/groupname mappings for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/824284 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [17:09:35] (03CR) 10Andrea Denisse: [C: 03+2] C:rancid: Drop unneeded dependencies [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) (owner: 10Jbond) [17:09:46] (03PS3) 10Andrea Denisse: C:rancid: Drop unneeded dependencies [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) (owner: 10Jbond) [17:10:08] (03CR) 10Andrea Denisse: "LGTM thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) (owner: 10Jbond) [17:10:13] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:10:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) yes if we can make a new one works for config I that will be great . here is the HW raid setting that i have for the server ` Virtual... [17:10:16] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] C:rancid: Drop unneeded dependencies [puppet] - 10https://gerrit.wikimedia.org/r/824417 (https://phabricator.wikimedia.org/T314936) (owner: 10Jbond) [17:10:24] (03CR) 10Cathal Mooney: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824495 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [17:15:35] (03PS2) 10Samtar: CommonSettings: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) [17:16:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch2001.codfw.wmnet with OS bullseye [17:16:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2001.codfw.wmnet with OS bullseye exe... [17:17:20] PROBLEM - Host ms-be2067.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:20:08] 10SRE, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic-Icebox: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10nshahquinn-wmf) The main idea here is to regularly expire or refresh GeoIP cookies, so I'd say this is the same as T12... [17:20:31] 10SRE, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic-Icebox: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10nshahquinn-wmf) [17:23:44] RECOVERY - Host ms-be2067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [17:23:54] RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:31:33] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-08-18-132255-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/824506 (owner: 10BryanDavis) [17:32:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Failed disk in ms-be2028 - https://phabricator.wikimedia.org/T315213 (10Papaul) 05Open→03Resolved Drive replaced from a decom server. [17:36:24] (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-08-18-132255-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/824506 (owner: 10BryanDavis) [17:38:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) @papaul, I asked @RobH in #wikimedia-dcops on IRC: > i think the recipe is detecting the SSD as /dev/sdb and the HDD as /dev/sda yeah... [17:45:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) When the drive is removed from the server the IDRAC detected it and when it is re-placed back, the IDRAC detected it as well but the controller doesn't ` Drive 0 is install... [17:46:51] !log dancy@deploy1002 backport aborted: (duration: 00m 21s) [17:48:58] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch2001.codfw.wmnet with OS bullseye [17:49:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye [17:52:19] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:52:26] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:45] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:52:51] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:53:38] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:53:56] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:54:30] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:56:52] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-stretch2001.codfw.wmnet with OS bullseye [17:56:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye execu... [17:59:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) @Ottomata thanks will look into it once i am done with this Dell call. [18:00:03] (03PS1) 10Hashar: doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) [18:00:04] ^demon and dancy: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T1800). [18:01:48] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) I've gone ahead and checked and followed the comments posted on T297913#8041258 and it is still setting the SSDs as SDB in the installer.... [18:06:21] ^demon: you around? [18:07:31] !log Testing stashbot behavior #1 T315444 [18:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:36] T315444: `scap backport` should include phabricator task in SAL messages - https://phabricator.wikimedia.org/T315444 [18:07:38] (03PS2) 10Hashar: doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) [18:08:04] !log Testing stashbot behavior #2. T315444, T314613 [18:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:09] T314613: Scap backport: Notify on irc when change has been deployed to mwdebug - https://phabricator.wikimedia.org/T314613 [18:08:40] Rolling the train [18:08:52] (03PS1) 10TrainBranchBot: group2 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824543 (https://phabricator.wikimedia.org/T314186) [18:08:54] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824543 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [18:09:37] (03Merged) 10jenkins-bot: group2 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824543 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [18:10:08] (03PS1) 10Andrew Bogott: backy2: start backing up recommendation-api vms on cloudbackup1004 [puppet] - 10https://gerrit.wikimedia.org/r/824544 (https://phabricator.wikimedia.org/T302535) [18:11:55] (03CR) 10Herron: [C: 03+1] "LGTM, group => 'librenms' would probably be ok too" [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [18:13:39] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.25 refs T314186 [18:13:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:13:45] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [18:14:33] (03CR) 10Andrew Bogott: [C: 03+2] backy2: start backing up recommendation-api vms on cloudbackup1004 [puppet] - 10https://gerrit.wikimedia.org/r/824544 (https://phabricator.wikimedia.org/T302535) (owner: 10Andrew Bogott) [18:14:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:14:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:15:02] (03PS1) 10Brennen Bearnes: phabricator: remove phab_deploy_ensure_config_ownership.sh [puppet] - 10https://gerrit.wikimedia.org/r/824547 (https://phabricator.wikimedia.org/T313953) [18:15:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:17:25] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch2001.codfw.wmnet with OS bullseye [18:17:25] (03PS2) 10Brennen Bearnes: phabricator: remove phab_deploy_ensure_config_ownership.sh [puppet] - 10https://gerrit.wikimedia.org/r/824547 (https://phabricator.wikimedia.org/T313953) [18:17:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye [18:19:01] (03CR) 10Brennen Bearnes: "This change is ready for review." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [18:21:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) >>! In T314160#8166577, @RobH wrote: > I've gone ahead and checked and followed the comments posted on T297913#8041258 and it is still set... [18:30:16] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10RobH) 05Open→03Resolved [18:30:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [18:30:57] (03CR) 10Andrew Bogott: P:systemd::timesyncd: allow overriding the protectsystem systemd param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824527 (https://phabricator.wikimedia.org/T310643) (owner: 10Jbond) [18:34:13] (03CR) 10Gehel: [C: 03+2] elastic: upgrade to 7.10.2-2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/824306 (https://phabricator.wikimedia.org/T299226) (owner: 10Ryan Kemper) [18:34:52] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:34:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) [18:35:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) In double checking the checklist steps, I can see that these haven't yet been received in on the coupa PO, so I unchecked that box in the... [18:36:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch2001.codfw.wmnet with reason: host reimage [18:39:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) [18:40:28] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch2001.codfw.wmnet with reason: host reimage [18:52:38] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:55:25] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch2001.codfw.wmnet with OS bullseye [18:55:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye compl... [18:55:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10RobH) [18:57:17] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:57:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:58:03] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [18:58:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata10... [18:59:21] (03PS1) 10Jdrewniak: Set initial-zoom via JavaScript to avoid font-scaling issue in iPad [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824433 (https://phabricator.wikimedia.org/T311795) [19:00:00] (03PS1) 10Ryan Kemper: elastic: add elastic710 comp for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824551 [19:00:10] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:00:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:00:40] (03PS2) 10Ryan Kemper: elastic: add elastic710 comp for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824551 (https://phabricator.wikimedia.org/T299226) [19:01:40] (03CR) 10Gehel: elastic: add elastic710 comp for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824551 (https://phabricator.wikimedia.org/T299226) (owner: 10Ryan Kemper) [19:03:14] (03PS3) 10Ryan Kemper: elastic: add elastic710 comp for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824551 (https://phabricator.wikimedia.org/T299226) [19:04:00] (03CR) 10Gehel: [C: 04-1] elastic: add elastic710 comp for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824551 (https://phabricator.wikimedia.org/T299226) (owner: 10Ryan Kemper) [19:04:49] (03CR) 10Tchanders: [C: 03+1] Deploy partial action blocks to cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824395 (https://phabricator.wikimedia.org/T315525) (owner: 10STran) [19:07:43] (03PS4) 10Ryan Kemper: elastic: add elastic710 comp for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824551 (https://phabricator.wikimedia.org/T299226) [19:09:44] (03CR) 10Gehel: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/824551 (https://phabricator.wikimedia.org/T299226) (owner: 10Ryan Kemper) [19:09:55] (03CR) 10Ryan Kemper: [C: 03+2] elastic: add elastic710 comp for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/824551 (https://phabricator.wikimedia.org/T299226) (owner: 10Ryan Kemper) [19:10:34] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:12:50] RECOVERY - Check systemd state on mw2339 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:02] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [19:19:06] (03PS1) 10Bking: wdqs: add bking as contact for wdqs alerts [puppet] - 10https://gerrit.wikimedia.org/r/824553 (https://phabricator.wikimedia.org/T313095) [19:19:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [19:26:33] (03PS1) 10Bking: elastic: enable ES7.10 in relforge env [puppet] - 10https://gerrit.wikimedia.org/r/824555 (https://phabricator.wikimedia.org/T315604) [19:34:45] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:34:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed: - dumpsdata1007 (**PASS*... [19:44:41] (03PS1) 10Aaron Schulz: Switch $wgChronologyProtectorStash to "mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824556 (https://phabricator.wikimedia.org/T314453) [19:45:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) [19:47:42] !log temporarily disable puppet on an-master100* while applying change in test cluster - T312858 [19:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:47] T312858: New airflow instance related to Image Suggestion Jobs - https://phabricator.wikimedia.org/T312858 [19:48:11] 10SRE, 10Infrastructure-Foundations: icinga raid montioring broken for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH) [19:48:26] 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring broken for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH) [19:48:27] (03CR) 10Ottomata: [C: 03+2] Add missing airflow service users to yarn's production queue [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [19:49:35] 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring broken for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH) T308027 tracks the private repo deployment, but I didn't see anything to track the fix for icinga monitoring for the new perc h750 controllers. I've a... [19:49:46] 10SRE, 10Infrastructure-Foundations, 10observability: icinga raid montioring broken for H750 controllers - https://phabricator.wikimedia.org/T315608 (10RobH) a:05MoritzMuehlenhoff→03None [19:57:08] !log renable puppet on an-master* [19:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:00] (03CR) 10Ahmon Dancy: scap: add permission mangling, reorder checks (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [20:00:04] brennen: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220818T2000). [20:00:05] Tran, zabe, koi, and jan_drewniak: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1006.eqiad.wmnet with OS bullseye [20:00:22] 👋 [20:00:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye [20:00:28] o/ [20:00:52] (03PS3) 10Brennen Bearnes: scap: add permission mangling, reorder checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) [20:00:54] o/ [20:01:04] o/ [20:01:10] o/ [20:01:16] (03CR) 10Brennen Bearnes: scap: add permission mangling, reorder checks (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [20:01:40] o/ [20:02:23] cool, looks like almost everyone's here zabe -- you around? [20:02:52] Hi o/ I'll be testing Tran's patch [20:03:03] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [20:03:09] Tchanders: okie doke, we'll let ou know when it's ready [20:04:46] (03PS2) 10Brennen Bearnes: Deploy partial action blocks to cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824395 (https://phabricator.wikimedia.org/T315525) (owner: 10STran) [20:07:59] (as a note for folks getting their patches deployed, we're also testing some scap changes so thank you in advance for bearing with us :)) [20:08:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824395 (https://phabricator.wikimedia.org/T315525) (owner: 10STran) [20:09:24] (03Merged) 10jenkins-bot: Deploy partial action blocks to cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824395 (https://phabricator.wikimedia.org/T315525) (owner: 10STran) [20:09:56] !log brennen@deploy1002 Started scap: [[gerrit:824395|Deploy partial action blocks to cswiki (T315525)]] [20:10:00] T315525: Deploy action blocks to pilot wikis - https://phabricator.wikimedia.org/T315525 [20:10:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-stretch2002.codfw.wmnet with OS bullseye [20:10:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye [20:13:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Degraded RAID on ms-be1054 - https://phabricator.wikimedia.org/T315480 (10wiki_willy) a:03Jclark-ctr Hi @Jclark-ctr - this one shows a purchase date of August 7, 2019. Technically, it's after the 3yr warranty, but can you try submitting a RMA with to se... [20:15:54] Tchanders: patch is on mwdebug1001 if you want to test [20:16:17] brennen: thanks, testing [20:17:01] brennen: looks good - thank you! [20:17:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:21] cool, syncing [20:18:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:18:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:19:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) I have a ticket open with Dell to send me a back plane. the servers is back online for now. thanks [20:20:42] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dumpsdata1006.eqiad.wmnet with OS bullseye [20:20:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors: - dumpsdata10... [20:21:10] 10SRE, 10Performance-Team (Radar): Set MW appserver scaling_governor to performance - https://phabricator.wikimedia.org/T315398 (10ori) Congratulations, this is a huge win! I think we should dig deeper to see if we can get the same or similar performance benefit, but waste less power. The intel_pstate docs st... [20:24:41] does anyone how can I turn a flow board to a wikitext? [20:24:47] what maint script [20:27:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) a:05RobH→03Cmjohnson When attempting to install dumpsdata1006, the NIC is not detecting correctly. Can someone onsite unseat and reseat the nic to see if that... [20:28:40] Maybe James_F knows how to turn a flow board to text [20:29:12] !log brennen@deploy1002 Finished scap: [[gerrit:824395|Deploy partial action blocks to cswiki (T315525)]] (duration: 19m 16s) [20:29:17] T315525: Deploy action blocks to pilot wikis - https://phabricator.wikimedia.org/T315525 [20:29:27] zabe: about? [20:29:38] (03PS2) 10Brennen Bearnes: Start writing to cuc_actor on s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824152 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:29:56] Tchanders: your/STran's change should be live everywhere now :) [20:30:07] Amir1: Special:ChaneContentModel can remove flow on a page. I think usually the current flow board is moved to a sub-page as an archive first. I don't know of a Flow content -> wikitext thing, but maybe there is one somewhere. [20:30:09] Amir1: https://www.mediawiki.org/wiki/Structured_Discussions/FAQ#I_don't_want_Structured_Discussions_on_my_talk_page._Is_it_possible_to_change_it_back_to_the_old_wikitext_style? [20:30:22] Delete it. [20:30:24] Thank you so much [20:30:29] (03PS2) 10Brennen Bearnes: Allow admin to grant/revoke "transwiki" group on zh(wikt|wb|wq|ws) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816239 (https://phabricator.wikimedia.org/T313657) (owner: 10Stang) [20:30:47] James_F: that's the fun part, non-existent pages are set to flow board in mediawiki.org [20:30:54] koi: going ahead with yours [20:30:54] https://www.mediawiki.org/wiki/Talk:Reading/Web/Desktop_Improvements/Archive6 [20:31:08] Not for sysops. [20:31:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816239 (https://phabricator.wikimedia.org/T313657) (owner: 10Stang) [20:31:49] bd808: I swear, I have at least five different flow docs page open, even checking history [20:31:54] let me give that a try [20:32:15] (03Merged) 10jenkins-bot: Allow admin to grant/revoke "transwiki" group on zh(wikt|wb|wq|ws) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816239 (https://phabricator.wikimedia.org/T313657) (owner: 10Stang) [20:32:25] nope, deleting didn't work https://www.mediawiki.org/wiki/Talk:Reading/Web/Desktop_Improvements/Archive5 [20:32:46] !log brennen@deploy1002 Started scap: [[gerrit:816239|Allow admin to grant/revoke "transwiki" group on zh(wikt|wb|wq|ws) (T313657)]] [20:32:50] T313657: Allow admin to grant/revoke "transwiki" group on zh(wikt|wb|wq|ws) - https://phabricator.wikimedia.org/T313657 [20:33:16] brennen, I thought I could test this one [20:33:29] aha, change content model worked, awesome [20:33:36] koi: on mwdebug1001 [20:33:44] looking [20:35:31] brennen: visited these four sites at special:usergrouprights and LGTM [20:35:51] koi: cool, syncing [20:36:14] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:37:32] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch2002.codfw.wmnet with OS bullseye [20:37:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye exe... [20:39:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:39:55] !log brennen@deploy1002 Finished scap: [[gerrit:816239|Allow admin to grant/revoke "transwiki" group on zh(wikt|wb|wq|ws) (T313657)]] (duration: 07m 09s) [20:39:59] T313657: Allow admin to grant/revoke "transwiki" group on zh(wikt|wb|wq|ws) - https://phabricator.wikimedia.org/T313657 [20:40:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-stretch2002.codfw.wmnet with OS bullseye [20:40:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye [20:40:14] (03PS2) 10Brennen Bearnes: mrwiktionary: Set import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824308 (https://phabricator.wikimedia.org/T314939) (owner: 10Stang) [20:40:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:40:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:40:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824308 (https://phabricator.wikimedia.org/T314939) (owner: 10Stang) [20:41:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:42:14] !log brennen@deploy1002 Started scap: [[gerrit:824308|mrwiktionary: Set import source (T314939)]] [20:42:54] koi: mwdebug1001 has https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/824308 now [20:43:12] ...assuming there's anything testable there [20:43:32] I could not test this patch as I don't have sufficient permission on that site :( [20:43:46] but I thought it would be fine to sync it [20:44:47] ack, going ahead [20:45:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch2002.codfw.wmnet with reason: host reimage [20:46:22] koi: please make sure you can test patches for backport going forward. This one seems small and has consensus, but deployers depend on this step, generally. [20:46:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:08] thcipriani: got it [20:47:10] A bit late but just wanted to thank you for deploying! 🙇‍♂️ Is there any need to test it post-sync? I don't have permissions and I think Tchanders has stepped away (it's late in her day) [20:47:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:47:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:47:29] thcipriani: in koi's defense, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/824308 cannot be tested without advanced on-wiki permissions (sysop, importer or equivalent) [20:47:58] Tran: I checked Special:Block at cswiki, and I can see it there! [20:48:04] 🎉 [20:48:10] great thanks! [20:48:26] no problem [20:48:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:49:02] !log brennen@deploy1002 Finished scap: [[gerrit:824308|mrwiktionary: Set import source (T314939)]] (duration: 06m 48s) [20:49:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch2002.codfw.wmnet with reason: host reimage [20:50:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824433 (https://phabricator.wikimedia.org/T311795) (owner: 10Jdrewniak) [20:56:47] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/824555 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [20:57:08] jan_drewniak: waiting on CI for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/824433 - will overrun the end of the window a bit. [20:57:49] brennen: np [21:00:22] (03CR) 10Bking: [C: 03+2] elastic: enable ES7.10 in relforge env [puppet] - 10https://gerrit.wikimedia.org/r/824555 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [21:03:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch2002.codfw.wmnet with OS bullseye [21:03:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-stretch2002.codfw.wmnet with OS bullseye com... [21:05:37] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@relforge-eqiad-small-alpha.service,elasticsearch_7@relforge-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:18] (03Merged) 10jenkins-bot: Set initial-zoom via JavaScript to avoid font-scaling issue in iPad [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824433 (https://phabricator.wikimedia.org/T311795) (owner: 10Jdrewniak) [21:09:51] !log brennen@deploy1002 Started scap: [[gerrit:824433|Set initial-zoom via JavaScript to avoid font-scaling issue in iPad (T311795)]] [21:09:55] T311795: [Bug] Safari doesn't allow font-size scaling on iPad for users viewing legacy and modern Vector - https://phabricator.wikimedia.org/T311795 [21:10:32] jan_drewniak: on mwdebug1001 [21:13:45] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_7@relforge-eqiad-small-alpha.service,elasticsearch_7@relforge-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:14:09] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: elastic 7 upgrade [21:14:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: elastic 7 upgrade [21:14:34] (guessing this is... not particularly easy to test?) [21:14:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:14:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:14:57] brennen: got an ipad emulator here :P [21:15:08] nice. :) [21:15:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:16:06] brennen: ok good to sync! [21:16:19] cool, syncing [21:20:07] !log brennen@deploy1002 Finished scap: [[gerrit:824433|Set initial-zoom via JavaScript to avoid font-scaling issue in iPad (T311795)]] (duration: 10m 16s) [21:20:11] T311795: [Bug] Safari doesn't allow font-size scaling on iPad for users viewing legacy and modern Vector - https://phabricator.wikimedia.org/T311795 [21:20:39] !log end of UTC late backport and config window [21:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:50] good night awesome deployers <3 [21:24:46] g'night xSavitar. :) [21:30:58] (03PS1) 10Eevans: Add user eevans to ops group [puppet] - 10https://gerrit.wikimedia.org/r/824567 [21:32:36] (03PS1) 10Ryan Kemper: bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824568 (https://phabricator.wikimedia.org/T315604) [21:33:12] (03CR) 10Bking: [C: 03+1] bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824568 (https://phabricator.wikimedia.org/T315604) (owner: 10Ryan Kemper) [21:43:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) [21:44:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Data Engineering Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Papaul) @Ottomata all yours thanks for the help [21:47:28] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes2024 [21:48:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes2024 [21:50:22] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:02:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:02:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:05:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:09:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2024.mgmt.codfw.wmnet with reboot policy FORCED [22:09:41] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 377 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:12:01] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:15:22] (03PS1) 10Bking: bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824569 (https://phabricator.wikimedia.org/T315604) [22:15:59] (03CR) 10CI reject: [V: 04-1] bullseye: add thirdparty/elasticsearch-curator5 [puppet] - 10https://gerrit.wikimedia.org/r/824569 (https://phabricator.wikimedia.org/T315604) (owner: 10Bking) [22:16:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2024.mgmt.codfw.wmnet with reboot policy FORCED [22:24:44] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@ff0a0e2]: (no justification provided) [22:25:03] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@ff0a0e2]: (no justification provided) (duration: 00m 19s) [22:25:44] !log Rolling the train back to group1 due to T315620 [22:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:47] T315620: Going to an #Anchor link, viewport jumps back to top of page (all wikis, all skins) - https://phabricator.wikimedia.org/T315620 [22:26:19] (03PS1) 10TrainBranchBot: group2 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824571 (https://phabricator.wikimedia.org/T314186) [22:26:21] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824571 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [22:27:14] (03Merged) 10jenkins-bot: group2 wikis to 1.39.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824571 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [22:31:25] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.23 refs T314186 [22:31:29] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [22:31:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:32:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:32:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:32:52] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:33:12] (03PS1) 10Cathal Mooney: Add include statement for 2001:df2:e500:fe07::/64 reverse entries [dns] - 10https://gerrit.wikimedia.org/r/824572 (https://phabricator.wikimedia.org/T315429) [22:33:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:33:43] dancy: I'm pretty sure it's the change I linked, just because no one complained about this until an hour ago. And this seems problematic enough group1 wikis would've complained too [22:34:11] I can still reproduce it at https://commons.wikimedia.org/wiki/Commons:Village_pump#Public_domain_works_we_should_have [22:35:27] That does track. [22:35:36] OK. I'll revert that backport [22:35:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:36:02] legoktm: that would make a lot more sense [22:36:23] jan_drewniak: hi [22:36:33] !log dancy@deploy1002 backport aborted: (duration: 00m 12s) [22:37:04] (03PS1) 10TrainBranchBot: Revert "Set initial-zoom via JavaScript to avoid font-scaling issue in iPad" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824573 [22:37:46] yeah, that'd make sense. [22:38:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824573 (owner: 10TrainBranchBot) [22:40:17] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:43:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:49:24] (03CR) 10Dzahn: [C: 03+2] phabricator: remove phab_deploy_ensure_config_ownership.sh [puppet] - 10https://gerrit.wikimedia.org/r/824547 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [22:52:00] (03CR) 10Dzahn: [C: 03+2] "deployed in prod. the sudo rule for scap has been removed: -phab-deploy ALL=(root) NOPASSWD: /usr/local/sbin/phab_deploy_ensure_config_own" [puppet] - 10https://gerrit.wikimedia.org/r/824547 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [22:53:23] !log phab1001, phab2001: sudo rm /usr/local/sbin/phab_deploy_ensure_config_ownership (follow-up gerrit:824547 T313953) [22:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:28] T313953: Scap3-ify Phabricator - https://phabricator.wikimedia.org/T313953 [22:54:11] (03CR) 10Dzahn: [C: 03+2] "file wasn't absented in puppet but I deleted it manually on 2 servers (not in devtools)" [puppet] - 10https://gerrit.wikimedia.org/r/824547 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [22:56:05] (03Merged) 10jenkins-bot: Revert "Set initial-zoom via JavaScript to avoid font-scaling issue in iPad" [core] (wmf/1.39.0-wmf.25) - 10https://gerrit.wikimedia.org/r/824573 (owner: 10TrainBranchBot) [22:57:02] !log dancy@deploy1002 Started scap: Backport for [[gerrit:824573]] Revert "Set initial-zoom via JavaScript to avoid font-scaling issue in iPad" [22:59:27] (03CR) 10Dzahn: "Currently what I want is to apply the phabricator role phab1004 and phab2002 but -without- things like vcs / lvs. That's why I don't want " [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [23:02:46] (03CR) 10Dzahn: [C: 04-1] "not yet but I can use it once the old hosts are shut down which is WIP" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [23:03:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:05:54] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [23:06:53] (03CR) 10Dzahn: [C: 03+1] "script matches what was removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/824547 go ahead with this (before next deploy, old" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/822688 (https://phabricator.wikimedia.org/T313953) (owner: 10Brennen Bearnes) [23:10:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:10:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:11:00] (03CR) 10Dzahn: "above your new code it creates /var/log/librenms/ dir. is it intentional that this new log file is not inside that (/var/log/librenms.log " [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T315393) (owner: 10Andrea Denisse) [23:12:29] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:824573]] Revert "Set initial-zoom via JavaScript to avoid font-scaling issue in iPad" (duration: 15m 27s) [23:14:18] Rolling train forward again [23:14:30] (03PS1) 10TrainBranchBot: group2 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824575 (https://phabricator.wikimedia.org/T314186) [23:14:32] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824575 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [23:15:16] (03Merged) 10jenkins-bot: group2 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824575 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [23:16:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:19:17] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.39.0-wmf.25 refs T314186 [23:19:20] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [23:19:56] (03CR) 10Dzahn: "alright, thanks Arnold. did you run authdns-update script as well? I saw there were some (entirely unrelated) conflicts in the DNS repo.." [dns] - 10https://gerrit.wikimedia.org/r/824244 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [23:22:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:24:01] PROBLEM - Check systemd state on mw2397 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:29:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:29:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:33:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:33:14] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator: move lvs::realserver inclusion to profile, create use_lvs parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)