[00:30:35] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991816 [00:38:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991816 (owner: 10TrainBranchBot) [00:42:27] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:07] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:28] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991816 (owner: 10TrainBranchBot) [01:00:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991816 (owner: 10TrainBranchBot) [01:32:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:57] PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_task_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:39:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:09:20] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:54:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P55116 and previous config saved to /var/cache/conftool/dbconfig/20240122-045445-ladsgroup.json [04:54:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:09:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P55117 and previous config saved to /var/cache/conftool/dbconfig/20240122-050952-ladsgroup.json [05:24:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P55118 and previous config saved to /var/cache/conftool/dbconfig/20240122-052458-ladsgroup.json [05:40:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P55119 and previous config saved to /var/cache/conftool/dbconfig/20240122-054005-ladsgroup.json [05:40:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [05:40:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:40:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [05:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:56:42] (03CR) 10Slavina Stefanova: [C: 03+1] maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 (owner: 10Majavah) [06:00:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:00:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [06:05:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:05:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:05:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2109 (T354336)', diff saved to https://phabricator.wikimedia.org/P55120 and previous config saved to /var/cache/conftool/dbconfig/20240122-060529-marostegui.json [06:05:47] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:06:22] (03PS2) 10KartikMistry: Update MinT to 2024-01-22-053144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/991578 (https://phabricator.wikimedia.org/T355303) [06:07:59] (03PS1) 10Marostegui: db1187: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/991983 (https://phabricator.wikimedia.org/T354506) [06:08:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 T354506', diff saved to https://phabricator.wikimedia.org/P55121 and previous config saved to /var/cache/conftool/dbconfig/20240122-060811-marostegui.json [06:08:16] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [06:09:16] (03CR) 10Marostegui: [C: 03+2] db1187: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/991983 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:10:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1187.eqiad.wmnet with OS bookworm [06:13:46] (03CR) 10Santhosh: [C: 03+1] Update MinT to 2024-01-22-053144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/991578 (https://phabricator.wikimedia.org/T355303) (owner: 10KartikMistry) [06:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:23:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage [06:25:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T354336)', diff saved to https://phabricator.wikimedia.org/P55122 and previous config saved to /var/cache/conftool/dbconfig/20240122-062535-marostegui.json [06:25:40] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:26:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: host reimage [06:35:37] marostegui: OK to deploy MinT service? [06:39:26] OK. I'll go ahead :) [06:39:52] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2024-01-22-053144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/991578 (https://phabricator.wikimedia.org/T355303) (owner: 10KartikMistry) [06:40:03] (03PS1) 10Marostegui: Revert "db1187: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/991854 [06:40:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P55123 and previous config saved to /var/cache/conftool/dbconfig/20240122-064041-marostegui.json [06:40:59] (03Merged) 10jenkins-bot: Update MinT to 2024-01-22-053144-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/991578 (https://phabricator.wikimedia.org/T355303) (owner: 10KartikMistry) [06:43:52] (03CR) 10Marostegui: [C: 03+2] Revert "db1187: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/991854 (owner: 10Marostegui) [06:46:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1187.eqiad.wmnet with OS bookworm [06:46:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55124 and previous config saved to /var/cache/conftool/dbconfig/20240122-064657-root.json [06:47:45] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Dumps-Generation, 10Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228 (10LSobanski) [06:47:57] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:49:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2169:3316 db2169:3317', diff saved to https://phabricator.wikimedia.org/P55125 and previous config saved to /var/cache/conftool/dbconfig/20240122-064929-marostegui.json [06:50:29] (03PS1) 10Marostegui: db2169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/991984 (https://phabricator.wikimedia.org/T354506) [06:51:43] (03CR) 10Marostegui: [C: 03+2] db2169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/991984 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:52:13] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:52:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2169.codfw.wmnet with OS bookworm [06:55:02] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:55:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P55126 and previous config saved to /var/cache/conftool/dbconfig/20240122-065548-marostegui.json [07:00:59] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:00] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:02:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55127 and previous config saved to /var/cache/conftool/dbconfig/20240122-070202-root.json [07:10:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T354336)', diff saved to https://phabricator.wikimedia.org/P55128 and previous config saved to /var/cache/conftool/dbconfig/20240122-071054-marostegui.json [07:10:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [07:11:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [07:11:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [07:11:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T354336)', diff saved to https://phabricator.wikimedia.org/P55129 and previous config saved to /var/cache/conftool/dbconfig/20240122-071117-marostegui.json [07:11:23] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:13:05] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:14:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [07:17:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55130 and previous config saved to /var/cache/conftool/dbconfig/20240122-071707-root.json [07:20:27] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:22:52] (03PS1) 10Marostegui: Revert "db2169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/991855 [07:24:47] (03PS1) 10Slyngshede: Fix any broken tests or code formatting before hooking up CI. [software/bitu] - 10https://gerrit.wikimedia.org/r/992074 [07:28:09] !log Updated MinT to 2024-01-22-053144-production (T355303, T338608, T353510, T354666) [07:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:17] T355303: Adjust multiple model support on MinT test instance - https://phabricator.wikimedia.org/T355303 [07:28:18] T338608: Support requesting translations from a specific model in MinT - https://phabricator.wikimedia.org/T338608 [07:28:19] T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400 - https://phabricator.wikimedia.org/T353510 [07:28:19] T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666 [07:30:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T354336)', diff saved to https://phabricator.wikimedia.org/P55131 and previous config saved to /var/cache/conftool/dbconfig/20240122-073025-marostegui.json [07:30:42] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:31:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55132 and previous config saved to /var/cache/conftool/dbconfig/20240122-073148-root.json [07:32:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55133 and previous config saved to /var/cache/conftool/dbconfig/20240122-073212-root.json [07:32:17] 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) I have successfully logged in to mwmaint1002 - many thanks for the support here! [07:33:18] (03CR) 10Marostegui: [C: 03+2] Revert "db2169: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/991855 (owner: 10Marostegui) [07:33:34] (03CR) 10Slyngshede: [C: 03+1] "LGTM, thank you" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) (owner: 10Majavah) [07:34:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55134 and previous config saved to /var/cache/conftool/dbconfig/20240122-073435-root.json [07:36:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2169.codfw.wmnet with OS bookworm [07:45:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P55135 and previous config saved to /var/cache/conftool/dbconfig/20240122-074532-marostegui.json [07:46:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55136 and previous config saved to /var/cache/conftool/dbconfig/20240122-074653-root.json [07:47:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55137 and previous config saved to /var/cache/conftool/dbconfig/20240122-074717-root.json [07:49:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55138 and previous config saved to /var/cache/conftool/dbconfig/20240122-074940-root.json [07:51:18] (03PS1) 10Muehlenhoff: Remove access for shubhankar [puppet] - 10https://gerrit.wikimedia.org/r/992075 [07:53:45] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Shubhankar Patankar out of all services on: 2208 hosts [07:54:42] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Shubhankar Patankar out of all services on: 2208 hosts [07:55:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for shubhankar [puppet] - 10https://gerrit.wikimedia.org/r/992075 (owner: 10Muehlenhoff) [08:00:04] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T0800) [08:00:05] xSavitar: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:39] o/ [08:00:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P55139 and previous config saved to /var/cache/conftool/dbconfig/20240122-080038-marostegui.json [08:01:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55140 and previous config saved to /var/cache/conftool/dbconfig/20240122-080158-root.json [08:02:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55141 and previous config saved to /var/cache/conftool/dbconfig/20240122-080222-root.json [08:02:30] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: user-runtime-dir@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:04:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55142 and previous config saved to /var/cache/conftool/dbconfig/20240122-080445-root.json [08:05:32] (03CR) 10Muehlenhoff: [C: 03+2] Add mfischer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/991789 (https://phabricator.wikimedia.org/T355395) (owner: 10Muehlenhoff) [08:09:05] I'll go ahead an deploy [08:09:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @MFischer I've enabled your access, it takes 30 minutes until the change has been applied across all ou... [08:10:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988403 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [08:11:01] (03PS3) 10D3r1ck01: wmf-config: Remove unused wgCentralAuthTokenCacheType [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988403 (https://phabricator.wikimedia.org/T336004) [08:11:23] (03CR) 10D3r1ck01: [C: 03+2] wmf-config: Remove unused wgCentralAuthTokenCacheType [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988403 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [08:12:05] (03Merged) 10jenkins-bot: wmf-config: Remove unused wgCentralAuthTokenCacheType [mediawiki-config] - 10https://gerrit.wikimedia.org/r/988403 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [08:12:38] (03CR) 10Muehlenhoff: [C: 03+2] Deprecate system::role for IF services (batch one) [puppet] - 10https://gerrit.wikimedia.org/r/991786 (owner: 10Muehlenhoff) [08:15:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T354336)', diff saved to https://phabricator.wikimedia.org/P55143 and previous config saved to /var/cache/conftool/dbconfig/20240122-081545-marostegui.json [08:15:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:15:51] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:15:54] !log derick@deploy2002 Started scap: Backport for [[gerrit:988403|wmf-config: Remove unused wgCentralAuthTokenCacheType (T336004)]] [08:15:58] T336004: Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc) - https://phabricator.wikimedia.org/T336004 [08:16:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:16:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T354336)', diff saved to https://phabricator.wikimedia.org/P55144 and previous config saved to /var/cache/conftool/dbconfig/20240122-081618-marostegui.json [08:17:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55145 and previous config saved to /var/cache/conftool/dbconfig/20240122-081703-root.json [08:17:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55146 and previous config saved to /var/cache/conftool/dbconfig/20240122-081727-root.json [08:17:38] (03CR) 10Muehlenhoff: [C: 03+2] profile::openstack::codfw1dev::db: Convert ferm::rule into firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/991752 (owner: 10Muehlenhoff) [08:19:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55147 and previous config saved to /var/cache/conftool/dbconfig/20240122-081950-root.json [08:20:52] PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:56] !log derick@deploy2002 d3r1ck01 and derick: Backport for [[gerrit:988403|wmf-config: Remove unused wgCentralAuthTokenCacheType (T336004)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:26:57] good morning [08:27:00] T336004: Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc) - https://phabricator.wikimedia.org/T336004 [08:27:07] I will shutdown and upgrade Gerrit in ~ 30 minutes [08:27:10] (03CR) 10Ayounsi: Add BGP to the contributing protocols for aggregate routes on CRs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) (owner: 10Cathal Mooney) [08:27:13] jouncebot: next [08:27:13] In 0 hour(s) and 32 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T0900) [08:27:15] ^ :) [08:27:37] hashar, good morning sir. I'm almost done deploying a config patch. [08:27:45] !log derick@deploy2002 d3r1ck01 and derick: Continuing with sync [08:28:05] no worries I will wait for the backport & config window to complete :-] [08:28:22] Thanks. We don't have much in this window. Just 1 patch :D [08:28:30] So it'll finish early [08:29:08] yeah no worries [08:29:21] and even if it needs extra time, it is fine delaying the Gerrit upgrade [08:29:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10Nahid) Thank you all. [08:32:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55148 and previous config saved to /var/cache/conftool/dbconfig/20240122-083208-root.json [08:32:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1048.eqiad.wmnet [08:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T354336)', diff saved to https://phabricator.wikimedia.org/P55149 and previous config saved to /var/cache/conftool/dbconfig/20240122-083319-marostegui.json [08:33:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1048 [puppet] - 10https://gerrit.wikimedia.org/r/990998 (owner: 10Effie Mouzeli) [08:33:23] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:33:36] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc2048 [puppet] - 10https://gerrit.wikimedia.org/r/990999 (owner: 10Effie Mouzeli) [08:34:09] !log derick@deploy2002 Finished scap: Backport for [[gerrit:988403|wmf-config: Remove unused wgCentralAuthTokenCacheType (T336004)]] (duration: 18m 15s) [08:34:13] T336004: Recognize 4th cache service interface in MediaWiki (Migrate ConfirmEdit tokens from MainStash to mcrouter-primary-dc) - https://phabricator.wikimedia.org/T336004 [08:34:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55150 and previous config saved to /var/cache/conftool/dbconfig/20240122-083454-root.json [08:35:07] !log UTC morning backport window done! [08:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:20] hashar, over to you! :) [08:35:30] :-] [08:37:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1048.eqiad.wmnet [08:38:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2048.codfw.wmnet [08:38:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1213:3316 db1213:3315', diff saved to https://phabricator.wikimedia.org/P55151 and previous config saved to /var/cache/conftool/dbconfig/20240122-083812-marostegui.json [08:38:14] 10SRE, 10Infrastructure-Foundations, 10netops: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120 (10ayounsi) Thinking a bit more about that, as the loopback is already on a private IP it can't be targeted directly, and packets with TTL=1 being sent to th... [08:39:08] (03PS1) 10Marostegui: db1213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992080 (https://phabricator.wikimedia.org/T354506) [08:39:46] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2048 [puppet] - 10https://gerrit.wikimedia.org/r/990999 (owner: 10Effie Mouzeli) [08:39:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1213.eqiad.wmnet with OS bookworm [08:40:22] (03CR) 10Marostegui: [C: 03+2] db1213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992080 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [08:43:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2048.codfw.wmnet [08:44:58] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove legacy check_nagios_paging [puppet] - 10https://gerrit.wikimedia.org/r/991801 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [08:46:06] (03Abandoned) 10Muehlenhoff: Switch mc2043 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991284 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:46:29] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: include profile::monitoring in base [puppet] - 10https://gerrit.wikimedia.org/r/991363 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [08:46:31] (03PS2) 10Muehlenhoff: airflow::instance: Pass web server port as an integer [puppet] - 10https://gerrit.wikimedia.org/r/990060 [08:47:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55152 and previous config saved to /var/cache/conftool/dbconfig/20240122-084713-root.json [08:48:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P55153 and previous config saved to /var/cache/conftool/dbconfig/20240122-084825-marostegui.json [08:50:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55154 and previous config saved to /var/cache/conftool/dbconfig/20240122-084959-root.json [08:50:06] (03PS1) 10Filippo Giunchedi: wikimedia.org: clean up ldap-icinga [dns] - 10https://gerrit.wikimedia.org/r/992083 (https://phabricator.wikimedia.org/T333615) [08:50:23] (03CR) 10Muehlenhoff: [C: 03+1] wikimedia.org: clean up ldap-icinga [dns] - 10https://gerrit.wikimedia.org/r/992083 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [08:53:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage [08:53:37] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] icinga: remove ldap-icinga remnants [puppet] - 10https://gerrit.wikimedia.org/r/991364 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [08:53:53] (03CR) 10Volans: "It's not clear to me if, based on how the services are configured on the host, you need to manual handling of bird at all during reboots. " [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [08:55:05] (03CR) 10Majavah: [C: 03+2] Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) (owner: 10Majavah) [08:55:22] (03CR) 10Filippo Giunchedi: [C: 03+2] wikimedia.org: clean up ldap-icinga [dns] - 10https://gerrit.wikimedia.org/r/992083 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [08:55:47] (03CR) 10Majavah: [C: 03+2] maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 (owner: 10Majavah) [08:56:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1213.eqiad.wmnet with reason: host reimage [08:56:31] (03Merged) 10jenkins-bot: Fix documentation generation [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/991086 (https://phabricator.wikimedia.org/T355174) (owner: 10Majavah) [08:58:02] * hashar grabs a pint of coffee [09:00:04] hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T0900) [09:01:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw2394.codfw.wmnet [09:01:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2394.codfw.wmnet [09:02:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3316 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55155 and previous config saved to /var/cache/conftool/dbconfig/20240122-090218-root.json [09:03:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P55156 and previous config saved to /var/cache/conftool/dbconfig/20240122-090332-marostegui.json [09:03:44] !log cgoubert@cumin1002 conftool action : set/pooled=no; selector: name=mw2444.codfw.wmnet [09:05:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169:3317 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55157 and previous config saved to /var/cache/conftool/dbconfig/20240122-090504-root.json [09:05:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [09:06:58] !log Upgrading Gerrit # T354885 [09:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:14] T354885: Upgrade to Gerrit 3.7 - https://phabricator.wikimedia.org/T354885 [09:08:39] !log hashar@deploy2002 Started deploy [gerrit/gerrit@bdd1a8b]: Gerrit to version 3.7.6 [09:08:49] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@bdd1a8b]: Gerrit to version 3.7.6 (duration: 00m 10s) [09:11:22] !log Gerrit: reindexing all changes for 3.6 > 3.7 migration # T354885 [09:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:29] Reindexing changes: changes: 2% (23122/980136), project-slices: 11% (383/3240), Slicing projects: 100% (2862/2862) (/) [09:11:39] ETA: I don't know [09:11:42] (ProbeDown) firing: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:06] as for the alerting, I did put the hosts in maintenance mode in Icinga [09:12:14] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:18] and added some alerts filtering in alert manager but looks like that did not caught them [09:12:19] :/ [09:12:31] (ProbeDown) firing: (2) Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:39] Reindexing changes: changes: 11% (114762/980136), project-slices: 19% (624/3240), Slicing projects: 100% (2862/2862) (/) [09:13:48] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:54] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:20] (JobUnavailable) firing: (2) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:38] PROBLEM - Check systemd state on db2169 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:28] (JobUnavailable) firing: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:15:44] hashar: want me actually downtime the host? [09:15:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [09:16:16] claime: please yes :) [09:16:28] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:36] I don't know how I screwed up the silents in https://alerts.wikimedia.org/ :-\ [09:16:42] (ProbeDown) firing: (2) Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit1003:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:17:26] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on gerrit[1003,2002].wikimedia.org with reason: Gerrit update [09:17:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on gerrit[1003,2002].wikimedia.org with reason: Gerrit update [09:17:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1213.eqiad.wmnet with OS bookworm [09:18:20] RECOVERY - Check systemd state on arclamp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T354336)', diff saved to https://phabricator.wikimedia.org/P55158 and previous config saved to /var/cache/conftool/dbconfig/20240122-091838-marostegui.json [09:18:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:18:43] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:18:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A [09:18:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A [09:18:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:18:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:19:10] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:19:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T354336)', diff saved to https://phabricator.wikimedia.org/P55159 and previous config saved to /var/cache/conftool/dbconfig/20240122-091916-marostegui.json [09:21:24] hashar: I've added ad-hoc silences on the job as well, it'll last ~15 minutes [09:21:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55160 and previous config saved to /var/cache/conftool/dbconfig/20240122-092152-root.json [09:22:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55161 and previous config saved to /var/cache/conftool/dbconfig/20240122-092207-root.json [09:22:22] but every systemd timer git job is going to fail and I can't really help that without masking a bunch of potentially real alerts [09:25:12] PROBLEM - Check systemd state on chartmuseum2001 is CRITICAL: CRITICAL - degraded: The following units failed: helm-chartctl-package-all.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:05] claime: thanks! [09:26:05] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=mw2444.codfw.wmnet [09:26:14] reindexing has completed [09:26:15] !log cgoubert@cumin1002 conftool action : set/pooled=no; selector: name=mw2394.codfw.wmnet [09:27:48] <_joe_> hashar: who's working with you on the SRE side? [09:28:51] <_joe_> we have a full team dedicated to working on these systems, please coordinate with them so they'll assist you on the SRE side (including setting up downtimes) [09:31:16] we did that previously [09:31:18] and the steps are documented [09:31:28] but wrong for some reason, I will look at it after the upgrade [09:33:12] Restarting Gerrit [09:36:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T354336)', diff saved to https://phabricator.wikimedia.org/P55162 and previous config saved to /var/cache/conftool/dbconfig/20240122-093638-marostegui.json [09:36:44] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:36:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55163 and previous config saved to /var/cache/conftool/dbconfig/20240122-093657-root.json [09:37:13] (03PS1) 10Majavah: maintain-views: Remove wb_terms [puppet] - 10https://gerrit.wikimedia.org/r/992086 (https://phabricator.wikimedia.org/T265137) [09:37:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55164 and previous config saved to /var/cache/conftool/dbconfig/20240122-093712-root.json [09:37:14] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:22] RECOVERY - Check systemd state on chartmuseum2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:22] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:39] (03CR) 10Majavah: [C: 03+2] P:openstack: keystone: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991771 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah) [09:37:46] 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Clement_Goubert) Repooled, thank you @Jhancock.wm [09:38:15] !log Restarted Gerrit with upgraded version 3.7.6 # T354885 [09:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:19] T354885: Upgrade to Gerrit 3.7 - https://phabricator.wikimedia.org/T354885 [09:38:26] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:09] (03PS1) 10Majavah: get_config: use .mailmap [puppet] - 10https://gerrit.wikimedia.org/r/992087 [09:43:33] (03PS1) 10Muehlenhoff: Add ganeti1035 to eqiad ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/992088 (https://phabricator.wikimedia.org/T349925) [09:44:00] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/992106 (owner: 10Marostegui) [09:44:34] (03PS1) 10Marostegui: Revert "db1213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992106 [09:44:47] (03CR) 10Volans: [C: 03+1] "Looks ok enough for me to start testing (see [1]). There are still a couple of optional comments open, but I'll leave that to you." [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [09:44:55] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti1035 to eqiad ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/992088 (https://phabricator.wikimedia.org/T349925) (owner: 10Muehlenhoff) [09:45:28] (JobUnavailable) resolved: (4) Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:48] (03CR) 10Marostegui: [C: 03+2] Revert "db1213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992106 (owner: 10Marostegui) [09:45:59] moritzm: ok to merge? [09:46:42] marostegui: yes, please [09:46:49] moritzm: done [09:46:53] cheers [09:47:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1049.eqiad.wmnet [09:48:05] (03CR) 10Muehlenhoff: [C: 03+2] mc1049: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991293 (owner: 10Effie Mouzeli) [09:49:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A [09:51:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1035.eqiad.wmnet to cluster eqiad and group A [09:51:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P55165 and previous config saved to /var/cache/conftool/dbconfig/20240122-095145-marostegui.json [09:52:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55166 and previous config saved to /var/cache/conftool/dbconfig/20240122-095202-root.json [09:52:09] (03PS1) 10Brouberol: archiva: fix 400s when proxying requests with spaces in the URL [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) [09:52:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55167 and previous config saved to /var/cache/conftool/dbconfig/20240122-095217-root.json [09:52:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1049.eqiad.wmnet [09:52:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2049.codfw.wmnet [09:53:03] (03CR) 10Muehlenhoff: [C: 03+2] mc2049: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991294 (owner: 10Effie Mouzeli) [09:54:00] (03PS2) 10Brouberol: archiva: fix 400s when proxying requests with spaces in the URL [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) [09:54:14] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) (owner: 10Brouberol) [09:55:09] (03CR) 10CI reject: [V: 04-1] archiva: fix 400s when proxying requests with spaces in the URL [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) (owner: 10Brouberol) [09:55:44] (03PS3) 10Brouberol: archiva: fix 400s when proxying requests with spaces in the URL [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) [09:56:21] (03PS2) 10Majavah: P:toolforge: move hba to grid-specific bastion profile [puppet] - 10https://gerrit.wikimedia.org/r/990702 [09:56:23] (03PS2) 10Majavah: O:toolforge: add role for grid-less bastions [puppet] - 10https://gerrit.wikimedia.org/r/990703 (https://phabricator.wikimedia.org/T314665) [09:56:25] (03PS2) 10Majavah: P:toolforge::shell_environ: remove packages not on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/990704 [09:56:52] !log stop envoy on ticket-test.wikimedia.org to test alerting - T354479 [09:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:56] T354479: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 [09:57:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2049.codfw.wmnet [09:57:45] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) (owner: 10Brouberol) [09:58:26] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:04] (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:55] !log start envoy on ticket-test.wikimedia.org to test alerting - T354479 [10:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:51] (03PS4) 10Brouberol: archiva: fix 400s when proxying requests with spaces in the URL [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) [10:02:03] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) (owner: 10Brouberol) [10:04:26] !log gerrit: running jgit gc on every repository to regenerate potentially faulty reachability bitmaps files preventing fetches on some repositories # T355173 [10:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:31] T355173: git clone and git pull commands fail for refinery repo - https://phabricator.wikimedia.org/T355173 [10:04:49] my bet is ten years from now, someone will reach out to me asking what that `!log` line meant [10:05:04] and I would have fogotten about all of that [10:05:04] (ProbeDown) resolved: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:38] same as https://bash.toolforge.org/quip/AU7VVd-E6snAnmqnK_xp [10:05:52] :D [10:06:11] I more or less have an idea about what git bitmap files are [10:06:15] but I can't explain it [10:06:24] which implies I don't know about them [10:06:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P55169 and previous config saved to /var/cache/conftool/dbconfig/20240122-100651-marostegui.json [10:07:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55170 and previous config saved to /var/cache/conftool/dbconfig/20240122-100707-root.json [10:07:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55171 and previous config saved to /var/cache/conftool/dbconfig/20240122-100722-root.json [10:12:57] (03CR) 10Btullis: [C: 03+1] "Great work!" [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) (owner: 10Brouberol) [10:13:24] (03CR) 10Brouberol: [C: 03+2] archiva: fix 400s when proxying requests with spaces in the URL [puppet] - 10https://gerrit.wikimedia.org/r/992089 (https://phabricator.wikimedia.org/T355352) (owner: 10Brouberol) [10:13:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for gerrit[1003,2002].wikimedia.org [10:13:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gerrit[1003,2002].wikimedia.org [10:16:10] (03PS1) 10Marostegui: db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992092 (https://phabricator.wikimedia.org/T354506) [10:16:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2158', diff saved to https://phabricator.wikimedia.org/P55172 and previous config saved to /var/cache/conftool/dbconfig/20240122-101634-marostegui.json [10:17:39] (03CR) 10Marostegui: [C: 03+2] db2158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992092 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [10:18:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2158.codfw.wmnet with OS bookworm [10:21:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T354336)', diff saved to https://phabricator.wikimedia.org/P55173 and previous config saved to /var/cache/conftool/dbconfig/20240122-102158-marostegui.json [10:22:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:22:03] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:22:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55174 and previous config saved to /var/cache/conftool/dbconfig/20240122-102212-root.json [10:22:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:22:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T354336)', diff saved to https://phabricator.wikimedia.org/P55175 and previous config saved to /var/cache/conftool/dbconfig/20240122-102220-marostegui.json [10:22:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55176 and previous config saved to /var/cache/conftool/dbconfig/20240122-102227-root.json [10:33:45] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Clement_Goubert) 05Open→03Resolved >>! In T352893#9471788, @ayounsi wrote: > Nice !! > > T... [10:33:53] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10Clement_Goubert) [10:34:01] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) 05In progress→03Resolved [10:34:11] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10Clement_Goubert) [10:35:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [10:35:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [10:35:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P55177 and previous config saved to /var/cache/conftool/dbconfig/20240122-103520-ladsgroup.json [10:35:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:37:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55178 and previous config saved to /var/cache/conftool/dbconfig/20240122-103717-root.json [10:37:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2158.codfw.wmnet with reason: host reimage [10:37:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55179 and previous config saved to /var/cache/conftool/dbconfig/20240122-103732-root.json [10:38:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T354336)', diff saved to https://phabricator.wikimedia.org/P55180 and previous config saved to /var/cache/conftool/dbconfig/20240122-103820-marostegui.json [10:38:32] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:40:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2158.codfw.wmnet with reason: host reimage [10:41:08] (03CR) 10Jelto: [V: 03+1 C: 03+2] vrts: test delaying blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/991765 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [10:42:57] (03PS1) 10Vgutierrez: aptrepo,haproxy: Allow installing HAProxy 2.8 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/992097 (https://phabricator.wikimedia.org/T354424) [10:46:31] (03CR) 10FNegri: "LGTM, one question inline." [puppet] - 10https://gerrit.wikimedia.org/r/990702 (owner: 10Majavah) [10:51:48] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1177/console" [puppet] - 10https://gerrit.wikimedia.org/r/992097 (https://phabricator.wikimedia.org/T354424) (owner: 10Vgutierrez) [10:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55181 and previous config saved to /var/cache/conftool/dbconfig/20240122-105222-root.json [10:52:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55182 and previous config saved to /var/cache/conftool/dbconfig/20240122-105237-root.json [10:53:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P55183 and previous config saved to /var/cache/conftool/dbconfig/20240122-105326-marostegui.json [10:55:12] (03PS1) 10Vgutierrez: hiera: Install HAProxy 2.8 on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/992098 [10:56:19] (03PS1) 10Marostegui: Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992107 [10:56:57] PROBLEM - Check systemd state on db1213 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:05] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1178/co" [puppet] - 10https://gerrit.wikimedia.org/r/992098 (owner: 10Vgutierrez) [10:57:36] (03CR) 10Fabfur: [C: 03+1] "ok for me!" [puppet] - 10https://gerrit.wikimedia.org/r/992097 (https://phabricator.wikimedia.org/T354424) (owner: 10Vgutierrez) [10:58:17] (03PS2) 10Vgutierrez: hiera: Install HAProxy 2.8 on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/992098 (https://phabricator.wikimedia.org/T354424) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T1100) [11:01:24] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] aptrepo,haproxy: Allow installing HAProxy 2.8 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/992097 (https://phabricator.wikimedia.org/T354424) (owner: 10Vgutierrez) [11:01:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2158.codfw.wmnet with OS bookworm [11:02:51] (03CR) 10Marostegui: [C: 03+2] Revert "db2158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992107 (owner: 10Marostegui) [11:03:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55184 and previous config saved to /var/cache/conftool/dbconfig/20240122-110321-root.json [11:04:14] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet [11:04:30] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet [11:07:54] !log vgutierrez@apt1001:~$ sudo -i reprepro --noskipold --component thirdparty/haproxy28 update bullseye-wikimedia - T354424 [11:07:55] T354424: HAProxy 2.6.16 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [11:08:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P55185 and previous config saved to /var/cache/conftool/dbconfig/20240122-110833-marostegui.json [11:10:08] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet [11:10:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet [11:14:23] (03PS1) 10Muehlenhoff: Make ganeti1036 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992099 (https://phabricator.wikimedia.org/T349925) [11:16:35] (03PS1) 10Hnowlan: kubernetes: reclaim eqiad jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/992100 (https://phabricator.wikimedia.org/T354791) [11:18:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55187 and previous config saved to /var/cache/conftool/dbconfig/20240122-111826-root.json [11:19:57] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1036 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/992099 (https://phabricator.wikimedia.org/T349925) (owner: 10Muehlenhoff) [11:21:26] !log stop envoy on ticket-test.wikimedia.org to test alerting - T354479 [11:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:31] T354479: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 [11:23:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T354336)', diff saved to https://phabricator.wikimedia.org/P55188 and previous config saved to /var/cache/conftool/dbconfig/20240122-112339-marostegui.json [11:23:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:23:45] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:23:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:24:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55189 and previous config saved to /var/cache/conftool/dbconfig/20240122-112401-marostegui.json [11:26:04] (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:12] !log start envoy on ticket-test.wikimedia.org to test alerting - T354479 [11:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:11] (03CR) 10Effie Mouzeli: [C: 03+1] "> > * should there be a keyspace value?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [11:31:04] (ProbeDown) resolved: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:29] (03PS1) 10Clément Goubert: Raise yaml_log_error logging level to error [software/conftool] - 10https://gerrit.wikimedia.org/r/992104 (https://phabricator.wikimedia.org/T355256) [11:31:31] (03PS1) 10Clément Goubert: Fix various pylint warnings [software/conftool] - 10https://gerrit.wikimedia.org/r/992105 [11:32:46] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Jelto) The change above has the expected effect. The `ProbeDown` alert for vrts hosts changes from `for: 2m` to `for: 3m`. I checked the alert config in Thanos: ` name:... [11:32:57] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Jelto) a:03Jelto [11:33:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55190 and previous config saved to /var/cache/conftool/dbconfig/20240122-113331-root.json [11:37:37] (03PS1) 10Jelto: Revert "vrts: test delaying blackbox::check::http" [puppet] - 10https://gerrit.wikimedia.org/r/992108 (https://phabricator.wikimedia.org/T354479) [11:40:27] (03CR) 10Vgutierrez: [C: 03+2] hiera: Install HAProxy 2.8 on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/992098 (https://phabricator.wikimedia.org/T354424) (owner: 10Vgutierrez) [11:41:14] !log update to HAProxy 2.8.5 on cp3066 - T354424 [11:41:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55191 and previous config saved to /var/cache/conftool/dbconfig/20240122-114115-marostegui.json [11:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:18] T354424: HAProxy 2.6.16 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [11:41:22] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:48:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55192 and previous config saved to /var/cache/conftool/dbconfig/20240122-114836-root.json [11:48:38] (03CR) 10D3r1ck01: [C: 03+1] "LGTM! Thanks 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991787 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [11:49:32] 10SRE, 10MW-on-K8s, 10Trust and Safety Product Team, 10serviceops-radar, and 2 others: MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) [11:50:13] 10SRE, 10MW-on-K8s, 10Trust and Safety Product Team, 10serviceops-radar, and 2 others: MediaModeration maintenance script scanFilesInScanTable.php inderectly calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Clement_Goubert) p:05Triage→03Medium [11:51:38] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Use core page html on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/991787 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [11:52:55] (03Merged) 10jenkins-bot: mobileapps: Use core page html on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/991787 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [11:53:11] (03PS1) 10Ladsgroup: Drop old virtual domain for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992129 [11:54:23] (03CR) 10Ladsgroup: [C: 03+1] "Wanna merge it yourself?" [puppet] - 10https://gerrit.wikimedia.org/r/992086 (https://phabricator.wikimedia.org/T265137) (owner: 10Majavah) [11:56:17] !log volans@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing [11:56:18] !log volans@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing [11:56:20] jouncebot: nowandnext [11:56:20] For the next 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T1100) [11:56:21] In 2 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T1400) [11:56:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P55193 and previous config saved to /var/cache/conftool/dbconfig/20240122-115621-marostegui.json [11:56:36] !log volans@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing [11:56:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:57:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:59:09] 10SRE, 10Wikimedia-Mailing-lists: Close mailing list safetywikimania2021 - https://phabricator.wikimedia.org/T355480 (10Ladsgroup) a:03Ladsgroup [11:59:38] (03PS1) 10Jgiannelos: mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 [12:00:16] (03CR) 10CI reject: [V: 04-1] mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (owner: 10Jgiannelos) [12:00:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.534 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:00:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:30] (03PS2) 10Jgiannelos: mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) [12:02:25] (03CR) 10CI reject: [V: 04-1] mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:03:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55195 and previous config saved to /var/cache/conftool/dbconfig/20240122-120341-root.json [12:03:42] 10SRE, 10MW-on-K8s, 10Trust and Safety Product Team, 10serviceops-radar, 10Patch-For-Review: MediaModeration maintenance script scanFilesInScanTable.php indirectly calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) [12:04:27] (03PS3) 10Jgiannelos: mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) [12:05:06] (03CR) 10CI reject: [V: 04-1] mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:06:36] !log volans@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing [12:07:10] (03PS4) 10Jgiannelos: mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) [12:11:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P55197 and previous config saved to /var/cache/conftool/dbconfig/20240122-121128-marostegui.json [12:12:41] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [12:15:01] (03CR) 10Hnowlan: [C: 03+1] mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:17:48] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:18:46] (03Merged) 10jenkins-bot: mobileapps: Configure core page html req template [deployment-charts] - 10https://gerrit.wikimedia.org/r/992130 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [12:18:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55198 and previous config saved to /var/cache/conftool/dbconfig/20240122-121846-root.json [12:20:52] (03PS1) 10Jgiannelos: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992136 [12:21:57] (03CR) 10Hnowlan: [C: 03+1] mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992136 (owner: 10Jgiannelos) [12:24:59] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [12:25:36] (03PS2) 10Slyngshede: Code cleanup before enabling CI pipeline. [software/bitu] - 10https://gerrit.wikimedia.org/r/992074 [12:26:17] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.337 second response time https://wikitech.wikimedia.org/wiki/Docker [12:26:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T354336)', diff saved to https://phabricator.wikimedia.org/P55199 and previous config saved to /var/cache/conftool/dbconfig/20240122-122634-marostegui.json [12:26:39] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:26:55] (03CR) 10Slyngshede: "The most annoying patch in the world." [software/bitu] - 10https://gerrit.wikimedia.org/r/992074 (owner: 10Slyngshede) [12:28:45] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992136 (owner: 10Jgiannelos) [12:29:52] (03Merged) 10jenkins-bot: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/992136 (owner: 10Jgiannelos) [12:33:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55200 and previous config saved to /var/cache/conftool/dbconfig/20240122-123351-root.json [12:44:36] (03PS6) 10Klausman: helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) [12:45:22] (03CR) 10Klausman: helmfile/rbac: Allow deploy users to debug pods in experimental (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [12:45:41] (03PS1) 10Klausman: ml-serve/staging: Add group to allow debugging operations on [puppet] - 10https://gerrit.wikimedia.org/r/992152 (https://phabricator.wikimedia.org/T354516) [12:47:22] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) [12:47:29] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:48:32] (03PS7) 10Klausman: helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) [12:48:38] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) p:05Triage→03High [12:48:40] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:48:58] (03PS8) 10Klausman: helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) [12:49:32] (03PS9) 10Klausman: helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) [12:50:33] (ProbeDown) firing: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:19] (ProbeDown) resolved: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1050.eqiad.wmnet [12:56:14] (03CR) 10Muehlenhoff: [C: 03+2] mc1050: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991295 (owner: 10Effie Mouzeli) [12:59:22] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2049.codfw.wmnet [12:59:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet [12:59:32] 10SRE, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team, 10serviceops: Move testwiki over to mw-on-k8s - https://phabricator.wikimedia.org/T355534 (10Clement_Goubert) [13:00:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1050.eqiad.wmnet [13:01:33] (03PS1) 10Clément Goubert: trafficserver: move 30% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/992158 (https://phabricator.wikimedia.org/T355532) [13:01:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2050.codfw.wmnet [13:02:05] (03CR) 10Muehlenhoff: [C: 03+2] mc2050: Switch MW memcache to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/991296 (owner: 10Effie Mouzeli) [13:05:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet [13:05:30] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2049.codfw.wmnet [13:06:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [13:07:05] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [13:07:45] PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [13:07:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2050.codfw.wmnet [13:08:48] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) [13:13:49] RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [13:14:41] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [13:20:18] (03PS1) 10Marostegui: db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992160 (https://phabricator.wikimedia.org/T354506) [13:20:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1165', diff saved to https://phabricator.wikimedia.org/P55201 and previous config saved to /var/cache/conftool/dbconfig/20240122-132023-marostegui.json [13:20:26] (03PS1) 10Filippo Giunchedi: openstack: remove spreadcheck, absented [puppet] - 10https://gerrit.wikimedia.org/r/992162 (https://phabricator.wikimedia.org/T345294) [13:21:33] (03CR) 10Marostegui: [C: 03+2] db1165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/992160 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [13:21:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1165.eqiad.wmnet with OS bookworm [13:21:52] (03CR) 10Majavah: [C: 03+2] maintain-views: Remove wb_terms [puppet] - 10https://gerrit.wikimedia.org/r/992086 (https://phabricator.wikimedia.org/T265137) (owner: 10Majavah) [13:22:00] !log Upgrade sanitarium master, there will be lag on s6 wiki replicas T354506 [13:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:19] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [13:23:29] (03PS1) 10Jgiannelos: mobileapps: Fix rest.php path for core page html requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/992163 [13:24:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1036.eqiad.wmnet [13:25:45] PROBLEM - Check systemd state on ganeti1036 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:07] (03CR) 10Majavah: [C: 03+1] openstack: remove spreadcheck, absented [puppet] - 10https://gerrit.wikimedia.org/r/992162 (https://phabricator.wikimedia.org/T345294) (owner: 10Filippo Giunchedi) [13:33:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1165.eqiad.wmnet with reason: host reimage [13:33:57] (03CR) 10Filippo Giunchedi: [C: 03+2] openstack: remove spreadcheck, absented [puppet] - 10https://gerrit.wikimedia.org/r/992162 (https://phabricator.wikimedia.org/T345294) (owner: 10Filippo Giunchedi) [13:36:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1165.eqiad.wmnet with reason: host reimage [13:40:20] (03CR) 10Majavah: [C: 03+2] P:toolforge: move hba to grid-specific bastion profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990702 (owner: 10Majavah) [13:41:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [13:47:09] (03PS47) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [13:57:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1165.eqiad.wmnet with OS bookworm [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T1400). nyaa~ [14:00:05] hubaishan: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] o/ [14:02:07] I can deploy [14:03:22] hubaishan: are you there? [14:03:29] yes [14:03:35] ok, I’m looking at your changes [14:04:47] (03CR) 10Lucas Werkmeister (WMDE): Restrict pagequality-validate right to patroller in arwikisource (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:07:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "+1 to run CI" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:07:48] (03PS1) 10Marostegui: db1134: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/992187 (https://phabricator.wikimedia.org/T355541) [14:08:38] (03CR) 10CI reject: [V: 04-1] Restrict pagequality-validate right to patroller in arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:09:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Alright, the diffConfig looks sensible – a few groups get reordered, but there don’t seem to be any rights changes other than the intended" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:09:30] (03CR) 10Marostegui: [C: 03+2] db1134: Remove it from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/992187 (https://phabricator.wikimedia.org/T355541) (owner: 10Marostegui) [14:10:09] hubaishan: I left some comments on the first change, please update it (once that’s done it should be okay to go) [14:10:12] looking at the second change now [14:10:56] PROBLEM - Host ganeti1036 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:32] (03PS1) 10EoghanGaffney: [phabricator] Use python3 for task dump script [puppet] - 10https://gerrit.wikimedia.org/r/992189 (https://phabricator.wikimedia.org/T355502) [14:13:16] RECOVERY - Host ganeti1036 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:13:33] (03PS1) 10Marostegui: db1134: Temporary place in m2 [puppet] - 10https://gerrit.wikimedia.org/r/992190 (https://phabricator.wikimedia.org/T355541) [14:13:36] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:13:48] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:15:04] (03CR) 10Marostegui: [C: 03+2] db1134: Temporary place in m2 [puppet] - 10https://gerrit.wikimedia.org/r/992190 (https://phabricator.wikimedia.org/T355541) (owner: 10Marostegui) [14:15:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:15:32] (03PS2) 10Lucas Werkmeister (WMDE): Set ShowRollbackConfirmation in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991358 (https://phabricator.wikimedia.org/T355213) (owner: 10Hubaishan) [14:15:53] I made a tiny formatting improvement to the second change, otherwise that one should also be okay [14:15:56] (let’s see what CI says) [14:16:10] (03PS3) 10Hubaishan: Restrict pagequality-validate right to patroller in arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) [14:16:38] PROBLEM - Host ganeti1036 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:50] (03PS1) 10Marostegui: Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992111 [14:16:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "run CI again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:17:56] (03CR) 10Hnowlan: [C: 03+1] trafficserver: move 30% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/992158 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [14:18:32] (03CR) 10Marostegui: [C: 03+2] Revert "db1165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/992111 (owner: 10Marostegui) [14:18:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:19:28] RECOVERY - Check systemd state on ganeti1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:30] RECOVERY - Host ganeti1036 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [14:19:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Restrict pagequality-validate right to patroller in arwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:20:09] (03Merged) 10jenkins-bot: Restrict pagequality-validate right to patroller in arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) (owner: 10Hubaishan) [14:20:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:20:24] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:991379|Restrict pagequality-validate right to patroller in arwikisource (T354503)]] [14:20:31] T354503: add pagequality-validate right to patroller user group in arwikisource - https://phabricator.wikimedia.org/T354503 [14:20:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: Fix rest.php path for core page html requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/992163 (owner: 10Jgiannelos) [14:21:41] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and hubaishan: Backport for [[gerrit:991379|Restrict pagequality-validate right to patroller in arwikisource (T354503)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:13] hubaishan: please test on mwdebug [14:22:48] I compared https://ar.wikisource.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=usergroups with and without the change and to me the diff looks good, fwiw [14:22:49] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) p:05Triage→03Medium [14:23:08] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:24:15] it is OK in ar.wikisource [14:24:20] alright, thanks! [14:24:22] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and hubaishan: Continuing with sync [14:24:42] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:25:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:25:17] ^ me [14:25:20] fixing dbctl [14:25:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1134', diff saved to https://phabricator.wikimedia.org/P55203 and previous config saved to /var/cache/conftool/dbconfig/20240122-142530-marostegui.json [14:25:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 1%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55204 and previous config saved to /var/cache/conftool/dbconfig/20240122-142538-root.json [14:26:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1036.eqiad.wmnet to cluster eqiad and group B [14:28:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1036.eqiad.wmnet to cluster eqiad and group B [14:28:42] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:29:52] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:30:06] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:991379|Restrict pagequality-validate right to patroller in arwikisource (T354503)]] (duration: 09m 41s) [14:30:16] T354503: add pagequality-validate right to patroller user group in arwikisource - https://phabricator.wikimedia.org/T354503 [14:30:28] (03PS3) 10Lucas Werkmeister (WMDE): Set ShowRollbackConfirmation in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991358 (https://phabricator.wikimedia.org/T355213) (owner: 10Hubaishan) [14:30:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991358 (https://phabricator.wikimedia.org/T355213) (owner: 10Hubaishan) [14:30:48] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:31:33] (03Merged) 10jenkins-bot: Set ShowRollbackConfirmation in arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991358 (https://phabricator.wikimedia.org/T355213) (owner: 10Hubaishan) [14:31:47] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:991358|Set ShowRollbackConfirmation in arwiki (T355213)]] [14:31:59] T355213: Set ShowRollbackConfirmationDefaultUserOptions on arwiki to true - https://phabricator.wikimedia.org/T355213 [14:32:17] (03PS4) 10Paladox: gerrit: Fix linking to hash url [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) [14:32:18] (03CR) 10Paladox: "Thanks, done." [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [14:33:20] !log lucaswerkmeister-wmde@deploy2002 hubaishan and lucaswerkmeister-wmde: Backport for [[gerrit:991358|Set ShowRollbackConfirmation in arwiki (T355213)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:33:30] hubaishan: please test :) [14:33:50] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:33:55] tested OK [14:35:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:35:16] !log lucaswerkmeister-wmde@deploy2002 hubaishan and lucaswerkmeister-wmde: Continuing with sync [14:35:20] alright, thanks! [14:35:35] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Fix rest.php path for core page html requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/992163 (owner: 10Jgiannelos) [14:36:34] (03Merged) 10jenkins-bot: mobileapps: Fix rest.php path for core page html requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/992163 (owner: 10Jgiannelos) [14:39:08] (03PS1) 10Hashar: Update Zuul plugin for Gerrit 3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/992191 (https://phabricator.wikimedia.org/T355521) [14:39:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:14] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:40:17] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:40:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 5%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55205 and previous config saved to /var/cache/conftool/dbconfig/20240122-144043-root.json [14:40:55] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:991358|Set ShowRollbackConfirmation in arwiki (T355213)]] (duration: 09m 07s) [14:40:59] T355213: Set ShowRollbackConfirmationDefaultUserOptions on arwiki to true - https://phabricator.wikimedia.org/T355213 [14:41:04] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:41:28] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:41:29] alright, I think that’s it [14:41:34] !log UTC afternoon backport+config window done [14:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:15] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:44:39] (03PS1) 10Andrew Bogott: Galera: switch eqiad1 nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/992192 (https://phabricator.wikimedia.org/T355418) [14:45:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [14:45:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [14:46:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992192 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [14:46:37] (03CR) 10Hashar: [C: 03+2] Update Zuul plugin for Gerrit 3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/992191 (https://phabricator.wikimedia.org/T355521) (owner: 10Hashar) [14:46:39] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) p:05Triage→03Medium [14:48:41] (03PS1) 10Hashar: Update Zuul plugin for Gerrit 3.7 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/992193 (https://phabricator.wikimedia.org/T355521) [14:49:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) [14:49:57] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [14:51:23] (03CR) 10Hashar: [C: 03+2] Update Zuul plugin for Gerrit 3.7 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/992193 (https://phabricator.wikimedia.org/T355521) (owner: 10Hashar) [14:53:31] (03Merged) 10jenkins-bot: Update Zuul plugin for Gerrit 3.7 [software/gerrit] (wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/992191 (https://phabricator.wikimedia.org/T355521) (owner: 10Hashar) [14:53:33] (03Merged) 10jenkins-bot: Update Zuul plugin for Gerrit 3.7 [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/992193 (https://phabricator.wikimedia.org/T355521) (owner: 10Hashar) [14:54:55] !log hashar@deploy2002 Started deploy [gerrit/gerrit@6257faa]: Update Zuul plugin for Gerrit 3.7 - T355521 [14:54:59] T355521: Gerrit Zuul plugin does not show Depends-On/Needed-By since Gerrit 3.7 - https://phabricator.wikimedia.org/T355521 [14:55:02] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@6257faa]: Update Zuul plugin for Gerrit 3.7 - T355521 (duration: 00m 07s) [14:55:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55206 and previous config saved to /var/cache/conftool/dbconfig/20240122-145548-root.json [14:56:11] (03PS5) 10Paladox: gerrit: Fix linking to hash url [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) [14:56:33] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10brouberol) [14:59:20] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [15:00:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1167.eqiad.wmnet with reason: Maintenance [15:00:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:00:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:00:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T354336)', diff saved to https://phabricator.wikimedia.org/P55207 and previous config saved to /var/cache/conftool/dbconfig/20240122-150046-marostegui.json [15:01:00] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:01:14] ثءهف [15:01:20] exit [15:02:16] (03CR) 10Brouberol: [C: 03+2] global_config: list IPs of hadoop master/workers and kerberos nodes [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:03:21] (03CR) 10Majavah: [C: 03+1] Galera: switch eqiad1 nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/992192 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:05:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T354336)', diff saved to https://phabricator.wikimedia.org/P55208 and previous config saved to /var/cache/conftool/dbconfig/20240122-150555-marostegui.json [15:06:10] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:07:49] (03CR) 10Andrew Bogott: [C: 03+2] Galera: switch eqiad1 nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/992192 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:08:54] (03PS1) 10Majavah: hieradata: bump striker container to 2024-01-22-150527-production [puppet] - 10https://gerrit.wikimedia.org/r/992197 (https://phabricator.wikimedia.org/T355519) [15:10:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55209 and previous config saved to /var/cache/conftool/dbconfig/20240122-151052-root.json [15:12:08] (03CR) 10Majavah: [C: 03+2] hieradata: bump striker container to 2024-01-22-150527-production [puppet] - 10https://gerrit.wikimedia.org/r/992197 (https://phabricator.wikimedia.org/T355519) (owner: 10Majavah) [15:13:03] !log disable Puppet on A:dns-rec to decouple anycast-hc and pdns-rec systemd binding: CR 979159 [15:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:31] (03CR) 10Ssingh: [C: 03+2] hiera: dnsbox: remove anycast-hc dependency on pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/979159 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:15:46] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 30% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/992198 (https://phabricator.wikimedia.org/T355532) [15:16:07] (03PS2) 10Clément Goubert: trafficserver: move 30% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/992158 (https://phabricator.wikimedia.org/T355532) [15:16:29] (03PS1) 10Hnowlan: kubernetes: Add usernames for mw-videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/992199 (https://phabricator.wikimedia.org/T355292) [15:18:11] (03PS2) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 30% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/992198 (https://phabricator.wikimedia.org/T355532) [15:18:19] (03CR) 10CI reject: [V: 04-1] trafficserver: move 30% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/992158 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [15:18:27] (03PS1) 10Hnowlan: admin_ng: add namespace for mw-videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/992200 (https://phabricator.wikimedia.org/T355292) [15:20:44] (03CR) 10Clément Goubert: "Just a commit message typo, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992100 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P55210 and previous config saved to /var/cache/conftool/dbconfig/20240122-152102-marostegui.json [15:21:07] !log enable puppet on dns6001 and run agent to test CR 979159 [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:33] !log re-enable puppet on A:dns-rec and run agent to finish merging CR 979159 [15:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55211 and previous config saved to /var/cache/conftool/dbconfig/20240122-152557-root.json [15:26:44] !log sudo cumin -b1 -s120 "A:dns-rec and not P{dns6001*}" "enable-puppet 'do not enable' && run-puppet-agent" [15:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:41] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:32:28] (03PS1) 10Andrew Bogott: Galera: open 4567 to UDP [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) [15:33:28] (03CR) 10David Caro: [C: 03+1] Galera: open 4567 to UDP [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:33:37] (03CR) 10Majavah: [C: 04-1] Galera: open 4567 to UDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:33:41] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:34:02] (03CR) 10Muehlenhoff: Galera: open 4567 to UDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:35:18] (03PS2) 10Andrew Bogott: Galera: open 4567 to UDP [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) [15:35:43] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10Jhancock.wm) @cmooney I think that's doable. I'll block out my schedule for it. [15:35:51] (03CR) 10Andrew Bogott: Galera: open 4567 to UDP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:36:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P55212 and previous config saved to /var/cache/conftool/dbconfig/20240122-153608-marostegui.json [15:37:01] (03CR) 10Majavah: [C: 03+1] Galera: open 4567 to UDP [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:37:24] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: Add usernames for mw-videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/992199 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [15:37:58] (03CR) 10Clément Goubert: [C: 03+1] admin_ng: add namespace for mw-videoscaler [deployment-charts] - 10https://gerrit.wikimedia.org/r/992200 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [15:39:34] (03CR) 10Andrew Bogott: [C: 03+2] Galera: open 4567 to UDP [puppet] - 10https://gerrit.wikimedia.org/r/992204 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott) [15:41:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55213 and previous config saved to /var/cache/conftool/dbconfig/20240122-154102-root.json [15:41:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:41:36] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/992158 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [15:43:49] (03PS2) 10Hnowlan: kubernetes: reclaim eqiad jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/992100 (https://phabricator.wikimedia.org/T354791) [15:45:42] kamila_: Can you check out the latency alert above please? [15:47:00] claime: ack, thanks [15:47:05] ty [15:48:49] (03CR) 10Hnowlan: [C: 03+2] kubernetes: reclaim eqiad jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/992100 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:49:40] kamila_: https://grafana.wikimedia.org/goto/LshkEX5Sk?orgId=1 < Looks like a good'ol request spike [15:50:00] claime: yeah, there've been a few [15:50:10] not sure what I should be doing about it [15:50:37] 10SRE, 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [15:51:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:51:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T354336)', diff saved to https://phabricator.wikimedia.org/P55214 and previous config saved to /var/cache/conftool/dbconfig/20240122-155115-marostegui.json [15:51:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:51:20] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:51:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:51:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:51:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:51:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T354336)', diff saved to https://phabricator.wikimedia.org/P55215 and previous config saved to /var/cache/conftool/dbconfig/20240122-155154-marostegui.json [15:52:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P55216 and previous config saved to /var/cache/conftool/dbconfig/20240122-155210-ladsgroup.json [15:52:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:52:34] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:53:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T354336)', diff saved to https://phabricator.wikimedia.org/P55217 and previous config saved to /var/cache/conftool/dbconfig/20240122-155302-marostegui.json [15:53:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:53:22] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:55:33] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1486.eqiad.wmnet with OS bullseye [15:55:42] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1495.eqiad.wmnet with OS bullseye [15:55:46] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1486.eqiad.wmnet with OS bullseye [15:55:56] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1495.eqiad.wmnet with OS bullseye [15:56:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Upgrade to 10.6.16 and bookworm', diff saved to https://phabricator.wikimedia.org/P55218 and previous config saved to /var/cache/conftool/dbconfig/20240122-155607-root.json [15:56:11] kamila_: I think we just need to add a few replicas to handle spikes, I'm not finding an obvious source rn [15:56:35] claime: OK, that's easy enough [15:56:47] 10SRE, 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [15:56:57] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [15:57:01] 10SRE, 10Infrastructure-Foundations, 10Traffic: Mapping Client IPs to Resolver IPs - https://phabricator.wikimedia.org/T336947 (10CDanis) 05Open→03Declined Probably the "best" way to solve this is via the Alt-Svc mechanism, which Traffic means to experiment with at some point ({T208242}). That's somethi... [15:57:52] (seems like it's mobileapps traffic, https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?orgId=1&from=1705935438864&to=1705939038864&viewPanel=10 ) [15:58:07] 10SRE-tools, 10Infrastructure-Foundations: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300 (10joanna_borun) p:05Triage→03Medium [15:58:12] 10SRE, 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [15:58:18] kamila_: looks like it yeah [15:59:21] 10SRE, 10SRE-swift-storage, 10Data-Persistence: GRUB fails to determine the disks to install to on swift backends - https://phabricator.wikimedia.org/T345816 (10joanna_borun) [16:00:18] 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10joanna_borun) p:05Triage→03Medium [16:00:21] 10SRE, 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [16:02:36] (03PS1) 10Majavah: galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 [16:03:12] (03PS1) 10Ammarpad: ruwiki: Add 'edituserjson' right to 'engineers' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992206 (https://phabricator.wikimedia.org/T355499) [16:04:11] (03CR) 10CI reject: [V: 04-1] galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 (owner: 10Majavah) [16:05:55] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for apt-staging - https://phabricator.wikimedia.org/T347032 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This VM is up and running [16:06:02] (03PS2) 10Majavah: galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 [16:07:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P55219 and previous config saved to /var/cache/conftool/dbconfig/20240122-160716-ladsgroup.json [16:07:37] (03CR) 10CI reject: [V: 04-1] galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 (owner: 10Majavah) [16:08:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P55220 and previous config saved to /var/cache/conftool/dbconfig/20240122-160809-marostegui.json [16:08:56] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage [16:09:08] (03PS3) 10Majavah: galera: Fix deployment name access [alerts] - 10https://gerrit.wikimedia.org/r/992205 [16:09:11] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1495.eqiad.wmnet with reason: host reimage [16:11:28] 10SRE, 10Infrastructure-Foundations, 10Traffic: decide on an aggregation function to combine multiple probes into a single measurement - https://phabricator.wikimedia.org/T337318 (10CDanis) 05Open→03Resolved a:03CDanis [16:11:35] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [16:12:20] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1486.eqiad.wmnet with reason: host reimage [16:13:58] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10joanna_borun) p:05Triage→03Medium [16:14:00] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Revisit IP fragmention sysctl settings - https://phabricator.wikimedia.org/T345724 (10joanna_borun) a:03cmooney [16:14:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1495.eqiad.wmnet with reason: host reimage [16:16:21] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Volans) 05Open→03Declined Declining because of inactivity and unclear line of action due to the opposed views. Feel free to re-open if you feel li... [16:17:23] 10SRE, 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [16:18:15] (03PS1) 10Majavah: P:openstack: galera: always use cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/992208 (https://phabricator.wikimedia.org/T355418) [16:18:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:19:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1181/console" [puppet] - 10https://gerrit.wikimedia.org/r/992208 (https://phabricator.wikimedia.org/T355418) (owner: 10Majavah) [16:20:12] 10SRE, 10Infrastructure-Foundations: Adapt firewall logging for nftables - https://phabricator.wikimedia.org/T348736 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:20:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [16:20:48] (03PS1) 10Btullis: Fix the hostname for the wikishared password on superset [puppet] - 10https://gerrit.wikimedia.org/r/992209 (https://phabricator.wikimedia.org/T351925) [16:20:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] ml-serve/staging: Add group to allow debugging operations on [puppet] - 10https://gerrit.wikimedia.org/r/992152 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [16:21:22] (03CR) 10Klausman: [C: 03+2] ml-serve/staging: Add group to allow debugging operations on [puppet] - 10https://gerrit.wikimedia.org/r/992152 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [16:21:57] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1182/co" [puppet] - 10https://gerrit.wikimedia.org/r/992209 (https://phabricator.wikimedia.org/T351925) (owner: 10Btullis) [16:22:01] 10SRE, 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) [16:22:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P55221 and previous config saved to /var/cache/conftool/dbconfig/20240122-162223-ladsgroup.json [16:23:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P55222 and previous config saved to /var/cache/conftool/dbconfig/20240122-162315-marostegui.json [16:24:41] (03CR) 10Klausman: [C: 03+2] helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [16:27:27] (03Merged) 10jenkins-bot: helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [16:28:38] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330 (10Jhancock.wm) 05Open→03Resolved replacement disk arrived, broken one shipped back. new one put back into stock [16:29:35] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:29:44] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T1630). [16:30:39] (03PS1) 10Ryan Kemper: wdqs graph-split: enable microsite [puppet] - 10https://gerrit.wikimedia.org/r/992115 (https://phabricator.wikimedia.org/T354658) [16:31:03] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1486.eqiad.wmnet with OS bullseye [16:31:11] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1486.eqiad.wmnet with OS bullseye completed: - mw1486 (**PASS**) - Downtimed on Icinga/Alertma... [16:33:16] (03PS1) 10Kamila Součková: mw-api-int: add 15 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/992210 [16:33:25] (03PS1) 10Clément Goubert: mw-api-int: Raise replicas to 150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992212 [16:34:18] (03CR) 10Kamila Součková: [C: 03+1] mw-api-int: Raise replicas to 150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992212 (owner: 10Clément Goubert) [16:35:14] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Raise replicas to 150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992212 (owner: 10Clément Goubert) [16:36:10] (03Merged) 10jenkins-bot: mw-api-int: Raise replicas to 150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992212 (owner: 10Clément Goubert) [16:36:12] (03Abandoned) 10Kamila Součková: mw-api-int: add 15 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/992210 (owner: 10Kamila Součková) [16:37:27] (03PS1) 10Klausman: Lift Wing/Rec-API-NG: fix slightly misleading description text [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/992213 (https://phabricator.wikimedia.org/T347262) [16:37:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P55224 and previous config saved to /var/cache/conftool/dbconfig/20240122-163729-ladsgroup.json [16:37:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [16:37:38] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:37:38] (03CR) 10Klausman: [V: 03+2 C: 03+2] Lift Wing/Rec-API-NG: fix slightly misleading description text [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/992213 (https://phabricator.wikimedia.org/T347262) (owner: 10Klausman) [16:37:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [16:37:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [16:37:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:38:01] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [16:38:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T352010)', diff saved to https://phabricator.wikimedia.org/P55225 and previous config saved to /var/cache/conftool/dbconfig/20240122-163808-ladsgroup.json [16:38:09] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [16:38:17] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [16:38:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T354336)', diff saved to https://phabricator.wikimedia.org/P55226 and previous config saved to /var/cache/conftool/dbconfig/20240122-163822-marostegui.json [16:38:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [16:38:28] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:38:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1177.eqiad.wmnet with reason: Maintenance [16:38:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T354336)', diff saved to https://phabricator.wikimedia.org/P55227 and previous config saved to /var/cache/conftool/dbconfig/20240122-163844-marostegui.json [16:39:01] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1495.eqiad.wmnet with OS bullseye [16:39:10] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1495.eqiad.wmnet with OS bullseye completed: - mw1495 (**WARN**) - Downtimed on Icinga/Alertma... [16:40:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T354336)', diff saved to https://phabricator.wikimedia.org/P55228 and previous config saved to /var/cache/conftool/dbconfig/20240122-164053-marostegui.json [16:42:10] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951086 (owner: 10PipelineBot) [16:42:24] !log T353459 Running mwscript /home/daimona/GenerateInvitationList.php to test the script before it reaches production [16:42:26] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/940224 (owner: 10PipelineBot) [16:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:28] T353459: Develop a prototype for Event Invitations with scoring on likelihood of valuable participation - https://phabricator.wikimedia.org/T353459 [16:42:39] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977235 (owner: 10PipelineBot) [16:42:41] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978654 (owner: 10PipelineBot) [16:42:43] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980838 (owner: 10PipelineBot) [16:42:45] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/979465 (owner: 10PipelineBot) [16:42:51] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/981439 (owner: 10PipelineBot) [16:43:01] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/983236 (owner: 10PipelineBot) [16:43:02] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/986231 (owner: 10PipelineBot) [16:43:04] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/986835 (owner: 10PipelineBot) [16:43:06] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/988230 (owner: 10PipelineBot) [16:43:11] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/989769 (owner: 10PipelineBot) [16:43:13] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990666 (owner: 10PipelineBot) [16:43:15] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990673 (owner: 10PipelineBot) [16:43:29] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991453 (owner: 10PipelineBot) [16:43:31] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990672 (owner: 10PipelineBot) [16:43:33] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/990671 (owner: 10PipelineBot) [16:43:35] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991462 (owner: 10PipelineBot) [16:43:41] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978600 (owner: 10PipelineBot) [16:43:43] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/970844 (owner: 10PipelineBot) [16:43:45] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/970845 (owner: 10PipelineBot) [16:43:47] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/959981 (owner: 10PipelineBot) [16:43:49] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980839 (owner: 10PipelineBot) [16:46:27] !log dcausse@deploy2002 Started deploy [airflow-dags/search@dcf08b2]: (no justification provided) [16:46:58] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@dcf08b2]: (no justification provided) (duration: 00m 31s) [16:53:05] !log testing HAProxy tune.bufsize = 32768 in cp3066 - T354424 [16:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:10] T354424: HAProxy 2.6.16 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [16:56:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P55229 and previous config saved to /var/cache/conftool/dbconfig/20240122-165559-marostegui.json [17:00:29] (03CR) 10Brouberol: [C: 03+1] Fix the hostname for the wikishared password on superset [puppet] - 10https://gerrit.wikimedia.org/r/992209 (https://phabricator.wikimedia.org/T351925) (owner: 10Btullis) [17:03:43] 10SRE, 10Wikimedia-Mailing-lists: Close mailing list safetywikimania2021 - https://phabricator.wikimedia.org/T355480 (10Ladsgroup) 05Open→03Resolved [17:05:24] !log restore HAProxy tune.bufsize = 16684 in cp3066 - T354424 [17:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:39] T354424: HAProxy 2.6.16 CPU spikes on cp3066 - https://phabricator.wikimedia.org/T354424 [17:11:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P55230 and previous config saved to /var/cache/conftool/dbconfig/20240122-171106-marostegui.json [17:14:16] RECOVERY - Check systemd state on db1213 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:32] RECOVERY - Check systemd state on db2169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:51] 10SRE, 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) @Papaul the hosts belonging to Data Persistence will be off and ready to be moved. [17:15:57] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) [17:17:12] !log draining kubestage2001, uncordoning kubestage2002 to allow it to receive the pods. T355437 [17:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:17] T355437: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 [17:17:39] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10akosiaris) [17:18:05] (03PS1) 10Majavah: P:openstack: galera: migrate to firewall [puppet] - 10https://gerrit.wikimedia.org/r/992217 [17:19:52] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1183/co" [puppet] - 10https://gerrit.wikimedia.org/r/992217 (owner: 10Majavah) [17:24:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/992217 (owner: 10Majavah) [17:24:47] (03PS2) 10Majavah: P:openstack: galera: migrate to firewall [puppet] - 10https://gerrit.wikimedia.org/r/992217 [17:26:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T354336)', diff saved to https://phabricator.wikimedia.org/P55231 and previous config saved to /var/cache/conftool/dbconfig/20240122-172612-marostegui.json [17:26:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [17:26:18] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:26:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [17:26:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T354336)', diff saved to https://phabricator.wikimedia.org/P55232 and previous config saved to /var/cache/conftool/dbconfig/20240122-172635-marostegui.json [17:27:00] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992217 (owner: 10Majavah) [17:27:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T354336)', diff saved to https://phabricator.wikimedia.org/P55233 and previous config saved to /var/cache/conftool/dbconfig/20240122-172743-marostegui.json [17:30:21] 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571 (10RobH) [17:30:54] 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571 (10RobH) [17:32:50] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:32:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:35:17] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10cmooney) @papaul we have lvs2011 in U43 in A2, so we can't put ml-serve2005 there. Also es2026 can't be connected at 1G on lsw1-a2-codfw port 41, as port 42 is connected to lvs2011 at... [17:35:51] (03PS7) 10Htriedman: T354456: update eventstream helm values.yaml file to include hard-coded list of redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 [17:35:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:36:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:37:40] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:38:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T352010)', diff saved to https://phabricator.wikimedia.org/P55234 and previous config saved to /var/cache/conftool/dbconfig/20240122-173840-ladsgroup.json [17:38:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:42:40] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:42:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P55235 and previous config saved to /var/cache/conftool/dbconfig/20240122-174249-marostegui.json [17:44:33] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye [17:46:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [17:53:20] (03CR) 10Dzahn: [C: 03+2] "thanks for the quick fix!" [puppet] - 10https://gerrit.wikimedia.org/r/992189 (https://phabricator.wikimedia.org/T355502) (owner: 10EoghanGaffney) [17:53:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P55236 and previous config saved to /var/cache/conftool/dbconfig/20240122-175346-ladsgroup.json [17:55:50] (03CR) 10Dzahn: [C: 03+2] "this is fine but there is another problem that became obvious afterwards: ModuleNotFoundError: No module named 'bzlib'" [puppet] - 10https://gerrit.wikimedia.org/r/992189 (https://phabricator.wikimedia.org/T355502) (owner: 10EoghanGaffney) [17:57:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P55237 and previous config saved to /var/cache/conftool/dbconfig/20240122-175755-marostegui.json [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T1800) [18:00:04] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T1800). nyaa~ [18:08:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P55238 and previous config saved to /var/cache/conftool/dbconfig/20240122-180853-ladsgroup.json [18:10:00] (03CR) 10Dzahn: "a patch from Paladox :) nice surprise. how are you?" [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [18:10:57] (03CR) 10Dzahn: [C: 03+1] "This looks like a needed fix for an issue caused by today's upgrade, so yea! thanks. https://phabricator.wikimedia.org/T355537" [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [18:11:36] (03CR) 10Dzahn: [C: 03+1] "thc" [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [18:12:20] (03CR) 10Dzahn: "should be ready tomorrow, after we added the new names to the certs. this is planned for our office hours meeting. so wait about 24 hours" [puppet] - 10https://gerrit.wikimedia.org/r/992115 (https://phabricator.wikimedia.org/T354658) (owner: 10Ryan Kemper) [18:12:52] (03CR) 10Dzahn: [C: 03+2] Revert "vrts: test delaying blackbox::check::http" [puppet] - 10https://gerrit.wikimedia.org/r/992108 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [18:13:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T354336)', diff saved to https://phabricator.wikimedia.org/P55239 and previous config saved to /var/cache/conftool/dbconfig/20240122-181302-marostegui.json [18:13:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [18:13:07] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:13:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [18:13:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T354336)', diff saved to https://phabricator.wikimedia.org/P55240 and previous config saved to /var/cache/conftool/dbconfig/20240122-181324-marostegui.json [18:14:16] (03PS1) 10Ladsgroup: mariadb: prometheus on localhost grant should be VIA unix_socket [puppet] - 10https://gerrit.wikimedia.org/r/992220 [18:14:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T354336)', diff saved to https://phabricator.wikimedia.org/P55241 and previous config saved to /var/cache/conftool/dbconfig/20240122-181433-marostegui.json [18:15:16] PROBLEM - Check systemd state on mw1495 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:19] (03CR) 10Dzahn: [C: 03+1] "upstream did indeed "Remove html commentlink functionality."" [puppet] - 10https://gerrit.wikimedia.org/r/992109 (https://phabricator.wikimedia.org/T354886) (owner: 10Paladox) [18:17:13] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the hostname for the wikishared password on superset [puppet] - 10https://gerrit.wikimedia.org/r/992209 (https://phabricator.wikimedia.org/T351925) (owner: 10Btullis) [18:21:23] 10SRE-tools, 10Infrastructure-Foundations: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10Volans) Debugging this it seems that this was caused by a race condition in which `run-puppet-agent` check passed and said that puppet was not running but by... [18:24:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T352010)', diff saved to https://phabricator.wikimedia.org/P55242 and previous config saved to /var/cache/conftool/dbconfig/20240122-182359-ladsgroup.json [18:24:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [18:24:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:24:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [18:24:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T352010)', diff saved to https://phabricator.wikimedia.org/P55243 and previous config saved to /var/cache/conftool/dbconfig/20240122-182432-ladsgroup.json [18:29:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P55244 and previous config saved to /var/cache/conftool/dbconfig/20240122-182939-marostegui.json [18:34:45] (03PS1) 10Santiago Faci: [DNM] Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) [18:36:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:36:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:38:24] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:38:38] PROBLEM - Check whether ferm is active by checking the default input chain on mw1495 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:44:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P55245 and previous config saved to /var/cache/conftool/dbconfig/20240122-184446-marostegui.json [18:45:10] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992225 (https://phabricator.wikimedia.org/T278596) [18:48:36] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2023-11-29-143341 to 2024-01-18-182630 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992227 (https://phabricator.wikimedia.org/T278596) [18:53:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:54:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.384 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:54:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:55:02] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:59:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T354336)', diff saved to https://phabricator.wikimedia.org/P55246 and previous config saved to /var/cache/conftool/dbconfig/20240122-185952-marostegui.json [18:59:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [19:00:03] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:00:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1193.eqiad.wmnet with reason: Maintenance [19:00:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T354336)', diff saved to https://phabricator.wikimedia.org/P55247 and previous config saved to /var/cache/conftool/dbconfig/20240122-190014-marostegui.json [19:01:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T354336)', diff saved to https://phabricator.wikimedia.org/P55248 and previous config saved to /var/cache/conftool/dbconfig/20240122-190123-marostegui.json [19:06:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye [19:13:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:14:45] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:16:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P55249 and previous config saved to /var/cache/conftool/dbconfig/20240122-191629-marostegui.json [19:19:45] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:26:07] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992225 (https://phabricator.wikimedia.org/T278596) (owner: 10Jforrester) [19:27:13] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992225 (https://phabricator.wikimedia.org/T278596) (owner: 10Jforrester) [19:28:13] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:28:56] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:31:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P55250 and previous config saved to /var/cache/conftool/dbconfig/20240122-193136-marostegui.json [19:35:28] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) [19:41:54] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Papaul) @Marostegui thank you @cmooney i will again take a look at it thanks [19:41:59] (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-18-182456" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992122 (https://phabricator.wikimedia.org/T355592) (owner: 10Jforrester) [19:43:06] (03Merged) 10jenkins-bot: Revert "wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-18-182456" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992122 (https://phabricator.wikimedia.org/T355592) (owner: 10Jforrester) [19:44:18] (03PS1) 10Jforrester: Revert "wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-18-182456" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992122 (https://phabricator.wikimedia.org/T355592) [19:45:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:45:30] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-09-190638 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992229 (https://phabricator.wikimedia.org/T292804) [19:45:37] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-09-190638 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992229 (https://phabricator.wikimedia.org/T292804) (owner: 10Jforrester) [19:45:38] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) [19:46:30] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2023-11-29-152839 to 2024-01-09-190638 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992229 (https://phabricator.wikimedia.org/T292804) (owner: 10Jforrester) [19:46:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T354336)', diff saved to https://phabricator.wikimedia.org/P55251 and previous config saved to /var/cache/conftool/dbconfig/20240122-194642-marostegui.json [19:46:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [19:46:51] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:46:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [19:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T354336)', diff saved to https://phabricator.wikimedia.org/P55252 and previous config saved to /var/cache/conftool/dbconfig/20240122-194704-marostegui.json [19:47:40] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:48:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T354336)', diff saved to https://phabricator.wikimedia.org/P55253 and previous config saved to /var/cache/conftool/dbconfig/20240122-194813-marostegui.json [19:48:24] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:48:52] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:50:01] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:50:03] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:51:13] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [19:52:14] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2023-11-29-143341 to 2024-01-18-182630 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992227 (https://phabricator.wikimedia.org/T278596) [19:52:17] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade evaluators from 2023-11-29-143341 to 2024-01-18-182630 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992227 (https://phabricator.wikimedia.org/T278596) (owner: 10Jforrester) [19:53:12] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2023-11-29-143341 to 2024-01-18-182630 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992227 (https://phabricator.wikimedia.org/T278596) (owner: 10Jforrester) [19:54:05] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:54:47] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:55:09] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:55:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:56:08] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:56:17] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:56:45] (03PS8) 10Gmodena: eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [19:57:15] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:03:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P55254 and previous config saved to /var/cache/conftool/dbconfig/20240122-200319-marostegui.json [20:18:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P55255 and previous config saved to /var/cache/conftool/dbconfig/20240122-201826-marostegui.json [20:33:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T354336)', diff saved to https://phabricator.wikimedia.org/P55256 and previous config saved to /var/cache/conftool/dbconfig/20240122-203332-marostegui.json [20:33:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [20:33:42] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:33:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [20:33:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T354336)', diff saved to https://phabricator.wikimedia.org/P55257 and previous config saved to /var/cache/conftool/dbconfig/20240122-203354-marostegui.json [20:36:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T354336)', diff saved to https://phabricator.wikimedia.org/P55258 and previous config saved to /var/cache/conftool/dbconfig/20240122-203602-marostegui.json [20:38:14] (03PS1) 10Majavah: site: Add cloudrabbit1003 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/992234 (https://phabricator.wikimedia.org/T345610) [20:38:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:43:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 49.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:50:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:51:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P55259 and previous config saved to /var/cache/conftool/dbconfig/20240122-205109-marostegui.json [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T2100). nyaa~ [21:00:05] varnent: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:03:10] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [21:03:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10taavi) These hosts are still in Netbox and are marked as occupying switch ports etc - can those be cleaned up? [21:04:20] (03CR) 10Majavah: [C: 03+2] site: Add cloudrabbit1003 as insetup [puppet] - 10https://gerrit.wikimedia.org/r/992234 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [21:05:17] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate IPs for cloudrabbit1003 - taavi@cumin1002" [21:06:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P55260 and previous config saved to /var/cache/conftool/dbconfig/20240122-210615-marostegui.json [21:07:09] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate IPs for cloudrabbit1003 - taavi@cumin1002" [21:07:09] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:07:37] !log taavi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit1003 [21:07:59] !log taavi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit1003 [21:09:30] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) Physically moved the server to F4, U18. Port 4 CableID 2M-20220019 [21:17:11] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.eqiad.wmnet with OS bookworm [21:17:31] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/992087 (owner: 10Majavah) [21:20:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:21:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T354336)', diff saved to https://phabricator.wikimedia.org/P55261 and previous config saved to /var/cache/conftool/dbconfig/20240122-212122-marostegui.json [21:21:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [21:21:33] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [21:21:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [21:21:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T354336)', diff saved to https://phabricator.wikimedia.org/P55262 and previous config saved to /var/cache/conftool/dbconfig/20240122-212144-marostegui.json [21:22:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T354336)', diff saved to https://phabricator.wikimedia.org/P55263 and previous config saved to /var/cache/conftool/dbconfig/20240122-212252-marostegui.json [21:24:00] !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit1003.eqiad.wmnet with OS bookworm [21:24:34] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.eqiad.wmnet with OS bookworm [21:27:57] (03PS9) 10Gmodena: eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [21:30:30] (03PS10) 10Gmodena: eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [21:32:48] !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit1003.eqiad.wmnet with OS bookworm [21:33:02] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.eqiad.wmnet with OS bookworm [21:34:01] (03PS11) 10Gmodena: eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [21:34:30] (03CR) 10TChin: [C: 03+1] eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [21:34:38] 10SRE, 10SRE-Access-Requests: Requesting access for amastilovic - https://phabricator.wikimedia.org/T355606 (10amastilovic) [21:36:16] (03CR) 10Majavah: [C: 03+2] get_config: use .mailmap [puppet] - 10https://gerrit.wikimedia.org/r/992087 (owner: 10Majavah) [21:37:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P55264 and previous config saved to /var/cache/conftool/dbconfig/20240122-213758-marostegui.json [21:38:48] (03CR) 10Htriedman: [C: 03+1] eventstreams: add redacted pages config. [deployment-charts] - 10https://gerrit.wikimedia.org/r/988114 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [21:41:19] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10amastilovic) [21:45:02] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set cloudrabbit1003 as active - taavi@cumin1002" [21:46:30] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set cloudrabbit1003 as active - taavi@cumin1002" [21:48:26] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [21:50:33] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloudrabbit1003 cloud-private address - taavi@cumin1002" [21:51:24] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add cloudrabbit1003 cloud-private address - taavi@cumin1002" [21:51:25] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:52:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:53:03] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.eqiad.wmnet with reason: host reimage [21:53:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P55265 and previous config saved to /var/cache/conftool/dbconfig/20240122-215305-marostegui.json [21:56:21] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.eqiad.wmnet with reason: host reimage [21:57:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T2200) [22:04:30] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Rmaung) @Arinaigu can you add analytics-privatedata-users to "Requested group membership" [22:08:02] (03PS1) 10Majavah: cr-labs: Remove temporary openstack-apis rule [homer/public] - 10https://gerrit.wikimedia.org/r/992244 [22:08:04] (03PS1) 10Majavah: cr-labs: Add temporary term for cloudrabbit hosts [homer/public] - 10https://gerrit.wikimedia.org/r/992245 (https://phabricator.wikimedia.org/T345610) [22:08:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T354336)', diff saved to https://phabricator.wikimedia.org/P55266 and previous config saved to /var/cache/conftool/dbconfig/20240122-220811-marostegui.json [22:08:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [22:08:26] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) [22:08:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [22:08:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [22:08:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [22:08:44] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [22:08:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T354336)', diff saved to https://phabricator.wikimedia.org/P55267 and previous config saved to /var/cache/conftool/dbconfig/20240122-220850-marostegui.json [22:09:11] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) [22:09:56] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) [22:10:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T354336)', diff saved to https://phabricator.wikimedia.org/P55268 and previous config saved to /var/cache/conftool/dbconfig/20240122-221058-marostegui.json [22:11:28] 10SRE, 10SRE-Access-Requests: Requesting access to deployment or deploy-service group for sbailey(WMF) - https://phabricator.wikimedia.org/T355612 (10Sbailey) [22:13:25] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [22:14:16] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [22:14:17] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.eqiad.wmnet with OS bookworm [22:24:56] !log Deployed patch for T355538 [22:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P55269 and previous config saved to /var/cache/conftool/dbconfig/20240122-222605-marostegui.json [22:35:11] 10SRE, 10SRE-Access-Requests: Requesting access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Ahoelzl) As @amastilovic manager I approve the access request. [22:41:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P55270 and previous config saved to /var/cache/conftool/dbconfig/20240122-224111-marostegui.json [22:45:44] (03PS1) 10Eevans: cassandra: reconfigure 'dev' target_version for a 4.x release [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) [22:46:10] (03PS2) 10Eevans: cassandra: reconfigure 'dev' target_version for a 4.x release [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) [22:47:05] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088'] [22:47:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2088'] [22:48:27] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/992249 (https://phabricator.wikimedia.org/T352469) (owner: 10Eevans) [22:56:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T354336)', diff saved to https://phabricator.wikimedia.org/P55271 and previous config saved to /var/cache/conftool/dbconfig/20240122-225618-marostegui.json [22:56:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:56:26] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [22:56:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:57:24] jouncebot: nowandnext [22:57:24] For the next 1 hour(s) and 2 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240122T2200) [22:57:24] In 4 hour(s) and 2 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240123T0300) [22:58:06] maryum: are you done with deploying? [23:00:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:02:55] (03PS1) 10Zabe: beta: Start reading from af_user(_text)/afh_user(_text) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992250 (https://phabricator.wikimedia.org/T355616) [23:04:18] (03PS2) 10Zabe: Stop setting wgShowIPinHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991930 (https://phabricator.wikimedia.org/T355479) [23:04:20] (03CR) 10Zabe: [C: 03+2] Stop setting wgShowIPinHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991930 (https://phabricator.wikimedia.org/T355479) (owner: 10Zabe) [23:05:04] (03Merged) 10jenkins-bot: Stop setting wgShowIPinHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991930 (https://phabricator.wikimedia.org/T355479) (owner: 10Zabe) [23:05:21] (03PS2) 10Zabe: beta: Start reading from af_user(_text)/afh_user(_text) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992250 (https://phabricator.wikimedia.org/T355616) [23:05:24] (03CR) 10Zabe: [C: 03+2] beta: Start reading from af_user(_text)/afh_user(_text) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992250 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [23:06:10] (03Merged) 10jenkins-bot: beta: Start reading from af_user(_text)/afh_user(_text) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992250 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [23:06:33] !log zabe@deploy2002 Started scap: Backport for [[gerrit:991930|Stop setting wgShowIPinHeader (T355479)]], [[gerrit:992250|beta: Start reading from af_user(_text)/afh_user(_text) (T355616)]] [23:06:46] T355479: CommonSettings.php defines $wgShowIPinHeader, which was deprecated in 1.27 and has no effect anymore - https://phabricator.wikimedia.org/T355479 [23:06:46] T355616: Start reading from af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T355616 [23:08:20] !log zabe@deploy2002 zabe: Backport for [[gerrit:991930|Stop setting wgShowIPinHeader (T355479)]], [[gerrit:992250|beta: Start reading from af_user(_text)/afh_user(_text) (T355616)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:08:22] !log zabe@deploy2002 zabe: Continuing with sync [23:09:30] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf, sre-admins for swfrench - https://phabricator.wikimedia.org/T355618 (10Scott_French) [23:14:04] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:991930|Stop setting wgShowIPinHeader (T355479)]], [[gerrit:992250|beta: Start reading from af_user(_text)/afh_user(_text) (T355616)]] (duration: 07m 31s) [23:14:10] T355479: CommonSettings.php defines $wgShowIPinHeader, which was deprecated in 1.27 and has no effect anymore - https://phabricator.wikimedia.org/T355479 [23:14:11] T355616: Start reading from af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T355616 [23:20:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:23:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:24:25] (03PS1) 10Scott French: admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) [23:24:27] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [23:26:00] (03CR) 10CI reject: [V: 04-1] admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [23:26:59] (03CR) 10Scott French: "Many thanks in advance for the review, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [23:28:03] (03PS2) 10Scott French: admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) [23:28:05] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [23:28:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:28:31] (03PS3) 10Scott French: admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) [23:37:22] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf, sre-admins for swfrench - https://phabricator.wikimedia.org/T355618 (10Scott_French) 05Open→03In progress [23:43:38] (03PS3) 10Varnent: Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790) [23:46:40] Anyone around to do deployment of https://gerrit.wikimedia.org/r/c/991100/ - looks like deployment window for it closed but not merged yet. [23:46:40] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10Papaul) @cmooney can we get those 2 hosts back in decom? Thanks [23:47:14] 10SRE, 10serviceops: Scap Error - https://phabricator.wikimedia.org/T355622 (10Mstyles) [23:47:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T352010)', diff saved to https://phabricator.wikimedia.org/P55272 and previous config saved to /var/cache/conftool/dbconfig/20240122-234757-ladsgroup.json [23:48:06] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:53:45] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10Papaul) [23:54:20] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf, sre-admins for swfrench - https://phabricator.wikimedia.org/T355618 (10Scott_French) [23:55:45] (03PS4) 10Scott French: admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) [23:57:20] (03CR) 10CI reject: [V: 04-1] admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618) (owner: 10Scott French) [23:57:24] (03PS5) 10Scott French: admin: add swfrench to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/992251 (https://phabricator.wikimedia.org/T355618)