[00:02:19] (03PS1) 10Razzi: karapace: switch karapace to use kafka-jumbo1001 [puppet] - 10https://gerrit.wikimedia.org/r/787112 (https://phabricator.wikimedia.org/T301562) [00:02:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26847 and previous config saved to /var/cache/conftool/dbconfig/20220428-000244-ladsgroup.json [00:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:03:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [00:03:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [00:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26848 and previous config saved to /var/cache/conftool/dbconfig/20220428-000317-ladsgroup.json [00:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:36] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34971/console" [puppet] - 10https://gerrit.wikimedia.org/r/787112 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [00:05:27] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:23] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26849 and previous config saved to /var/cache/conftool/dbconfig/20220428-000849-ladsgroup.json [00:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:36] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on an-web1001.eqiad.wmnet with reason: Restart for kernel upgrade [00:09:38] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on an-web1001.eqiad.wmnet with reason: Restart for kernel upgrade [00:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:54] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on an-web1001.eqiad.wmnet with reason: Restart for kernel upgrade [00:09:56] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-web1001.eqiad.wmnet with reason: Restart for kernel upgrade [00:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:11] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-web1001.eqiad.wmnet [00:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:54] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-web1001.eqiad.wmnet [00:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:50] 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10razzi) [00:23:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T306560)', diff saved to https://phabricator.wikimedia.org/P26850 and previous config saved to /var/cache/conftool/dbconfig/20220428-002354-ladsgroup.json [00:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:02] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [00:24:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26851 and previous config saved to /var/cache/conftool/dbconfig/20220428-002420-ladsgroup.json [00:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:39:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26852 and previous config saved to /var/cache/conftool/dbconfig/20220428-003925-ladsgroup.json [00:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [00:52:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:54:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26853 and previous config saved to /var/cache/conftool/dbconfig/20220428-005430-ladsgroup.json [00:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:36] (03PS1) 10Stang: Revert Simplified Chinese logo of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787113 (https://phabricator.wikimedia.org/T276694) [00:56:19] (03Abandoned) 10Stang: Revert Simplified Chinese logo of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787113 (https://phabricator.wikimedia.org/T276694) (owner: 10Stang) [00:59:27] RECOVERY - DNS on logstash2028.mgmt is OK: DNS OK: 0.012 seconds response time. logstash2028.mgmt.codfw.wmnet returns 10.193.1.93 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:03:47] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:09:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26854 and previous config saved to /var/cache/conftool/dbconfig/20220428-010935-ladsgroup.json [01:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:10:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [01:10:02] (03CR) 10Stang: "Please optimize png file via the following command" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [01:10:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [01:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26855 and previous config saved to /var/cache/conftool/dbconfig/20220428-011007-ladsgroup.json [01:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26856 and previous config saved to /var/cache/conftool/dbconfig/20220428-013218-ladsgroup.json [01:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:39:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26857 and previous config saved to /var/cache/conftool/dbconfig/20220428-014723-ladsgroup.json [01:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:26] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:02:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26858 and previous config saved to /var/cache/conftool/dbconfig/20220428-020228-ladsgroup.json [02:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:57] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:10:01] (03PS2) 10Steven Sun: Revert Simplified Chinese logo of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) [02:12:31] (03CR) 10Steven Sun: Revert Simplified Chinese logo of zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [02:17:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26859 and previous config saved to /var/cache/conftool/dbconfig/20220428-021733-ladsgroup.json [02:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:17:59] (03PS1) 10Stang: zhwiki: Add comment to corresponding task of logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) [02:23:29] (03PS2) 10Stang: zhwiki: Add comment to corresponding task of logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) [02:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:38:14] (03PS1) 10Stang: zhwikiversity: Enable blocking feature of AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787115 (https://phabricator.wikimedia.org/T307007) [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:08:07] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:08:41] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:10:55] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:48:21] (03PS1) 10Marostegui: core-percona.my.cnf.erb: Update optimizer options [puppet] - 10https://gerrit.wikimedia.org/r/787119 (https://phabricator.wikimedia.org/T301879) [04:49:07] (03CR) 10Marostegui: [C: 03+2] core-percona.my.cnf.erb: Update optimizer options [puppet] - 10https://gerrit.wikimedia.org/r/787119 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [05:19:41] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (21) node(s) change every puppet run: contint1001, contint2001, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, puppetmaster1001, puppetmaster2001, releases1002, releases2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://w [05:19:41] wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:38:27] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (21) node(s) change every puppet run: contint1001, contint2001, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, puppetmaster1001, puppetmaster2001, releases1002, releases2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://w [05:38:27] wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:45:02] (03PS1) 10Marostegui: Revert "es2022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/787090 [05:45:47] (03CR) 10Marostegui: [C: 03+2] Revert "es2022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/787090 (owner: 10Marostegui) [06:00:04] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T0600). [06:11:39] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:22] (03PS3) 10Giuseppe Lavagetto: varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) [06:34:13] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [06:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:00:04] Amir1, apergos, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T0700). [07:00:04] kart_ and koi: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:14] o/ I can deploy today [07:00:16] morning! 6 patches in the window, that's max [07:00:20] hey taavi, awesome [07:00:24] no trainees today [07:00:28] kart_: do you want to self-service? [07:01:00] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [07:01:02] * kart_ is here [07:01:23] taavi: sure. I'll deploy 2 patches and ping you. [07:01:33] sounds good, ty [07:01:38] koi: around? [07:01:42] (03CR) 10KartikMistry: [C: 03+2] Enable Section Translation for Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786446 (https://phabricator.wikimedia.org/T304862) (owner: 10KartikMistry) [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:01] (03PS3) 10KartikMistry: Enable Section Translation for Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786446 (https://phabricator.wikimedia.org/T304862) [07:04:47] ah. rebase. [07:06:02] (03CR) 10Majavah: [C: 04-1] "see task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786407 (https://phabricator.wikimedia.org/T306795) (owner: 10Stang) [07:06:59] (03CR) 10Majavah: [C: 04-1] "please run `python3 logos/manage.py generate` to update logos.php with the new comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) (owner: 10Stang) [07:08:06] (03PS4) 10Giuseppe Lavagetto: varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) [07:08:31] Seems mwdebug1001 `scap pull` stuck at, `07:06:27 Started scap-cdb-rebuild` for me.. [07:08:39] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [07:08:43] urbanecm: if you're around, could you confirm that the growth team is fine with https://phabricator.wikimedia.org/T307005? [07:08:46] (03PS5) 10Giuseppe Lavagetto: varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) [07:09:03] taavi: checking [07:09:09] also good morning everyone [07:09:21] morning [07:09:25] morning! [07:09:31] (03CR) 10Urbanecm: [C: 03+1] "SGTM from Growth perspective" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787111 (https://phabricator.wikimedia.org/T307005) (owner: 10Stang) [07:09:40] +1'ed taavi, feel free to go ahead. [07:10:11] OK. mwdebug is OK now. [07:10:20] thanks! [07:10:51] koi: {{ping}} [07:10:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: remove absented resource [puppet] - 10https://gerrit.wikimedia.org/r/778546 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [07:11:54] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:786446|Enable Section Translation for Basque Wikipedia (T304862)]] (duration: 00m 50s) [07:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:02] T304862: Enable Content and Section Translation for Basque Wikipedia - https://phabricator.wikimedia.org/T304862 [07:12:26] (03PS3) 10KartikMistry: Enable SectionTranslation in testwiki for Punjabi, Tsonga, Nepali, and Swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786947 (https://phabricator.wikimedia.org/T304828) [07:14:50] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation in testwiki for Punjabi, Tsonga, Nepali, and Swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786947 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [07:15:44] (03Merged) 10jenkins-bot: Enable SectionTranslation in testwiki for Punjabi, Tsonga, Nepali, and Swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786947 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [07:19:40] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:786947|Enable SectionTranslation in testwiki for Punjabi, Tsonga, Nepali, and Swahili (T304828)]] (duration: 00m 50s) [07:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:47] T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default - https://phabricator.wikimedia.org/T304828 [07:19:59] taavi: Sorry, took a bit long. I'm done. [07:20:00] PROBLEM - puppet last run on cp2031 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:04] PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:04] PROBLEM - puppet last run on cp3052 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:04] PROBLEM - puppet last run on cp3063 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:06] no worries, thanks [07:20:12] PROBLEM - puppet last run on cp1079 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:12] PROBLEM - puppet last run on cp1076 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:12] PROBLEM - puppet last run on cp4034 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:12] PROBLEM - puppet last run on cp6006 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:14] PROBLEM - puppet last run on cp5013 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:14] PROBLEM - puppet last run on cp1087 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:14] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:14] PROBLEM - puppet last run on cp1089 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:16] PROBLEM - puppet last run on cp2037 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:16] PROBLEM - puppet last run on cp4035 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:19] huh [07:20:22] _joe_: ^ these look like related to your requestctl work [07:20:30] PROBLEM - puppet last run on cp1075 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:30] <_joe_> it's me reenabling puppet yes [07:20:38] thanks, ignoring [07:20:38] <_joe_> it's the usual bug in puppet's reporting [07:20:42] PROBLEM - puppet last run on cp2038 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:49] <_joe_> the moment you reenable it it's suddenly old [07:20:50] PROBLEM - puppet last run on cp6004 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:20:51] <_joe_> sigh [07:21:11] I'm scared. Things break when I'm done with deployment :) [07:21:35] apergos: still waiting for koi to show up before continuing to those patches [07:21:45] PROBLEM - puppet last run on cp6003 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:21:48] yeah, I noticed the reviews you did on a couple too [07:21:49] <_joe_> forcing puppet runs so it clears faster [07:21:53] PROBLEM - puppet last run on cp2035 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:21:53] PROBLEM - puppet last run on cp1085 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:21:56] PROBLEM - puppet last run on cp3059 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:22:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase db1132 weight T301879', diff saved to https://phabricator.wikimedia.org/P26860 and previous config saved to /var/cache/conftool/dbconfig/20220428-072200-marostegui.json [07:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:07] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [07:22:15] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:22:27] PROBLEM - puppet last run on cp3056 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:22:27] PROBLEM - puppet last run on cp3064 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:22:32] I need to leave by 07:40Z or so, you might need to take over if we're not done by that [07:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:22:50] gotcha [07:22:55] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:22:57] PROBLEM - puppet last run on cp5012 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:23:41] PROBLEM - puppet last run on cp4027 is CRITICAL: CRITICAL: Puppet last ran 15 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:24:59] RECOVERY - puppet last run on cp2031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:03] RECOVERY - puppet last run on cp3063 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:11] RECOVERY - puppet last run on cp1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:11] RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:13] RECOVERY - puppet last run on cp4034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:13] RECOVERY - puppet last run on cp6006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:15] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:15] RECOVERY - puppet last run on cp1087 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:15] RECOVERY - puppet last run on cp1089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:17] RECOVERY - puppet last run on cp2037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:31] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:41] RECOVERY - puppet last run on cp2038 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:25:51] RECOVERY - puppet last run on cp6004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:26:15] RECOVERY - puppet last run on cp6003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:26:55] RECOVERY - puppet last run on cp2035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:26:57] RECOVERY - puppet last run on cp1085 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:27:01] RECOVERY - puppet last run on cp3059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:27:21] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:27:33] RECOVERY - puppet last run on cp3056 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:27:33] RECOVERY - puppet last run on cp3064 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:28:07] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:28:09] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:28:55] RECOVERY - puppet last run on cp4027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:30:19] RECOVERY - puppet last run on cp4024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:30:19] RECOVERY - puppet last run on cp3052 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:30:27] RECOVERY - puppet last run on cp5013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:30:31] RECOVERY - puppet last run on cp4035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:33:57] (03CR) 10Hashar: [C: 03+2] Gerrit v3.4.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [07:34:16] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [07:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:22] (03Merged) 10jenkins-bot: Gerrit v3.4.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [07:34:43] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6008 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:34:43] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1086 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:34:45] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2030 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:34:45] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2028 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:34:45] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2032 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:34:59] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2039 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:03] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1090 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:07] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2041 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:07] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2040 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:07] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1083 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:09] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3052 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:11] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5014 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:13] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1084 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:13] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1087 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:15] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4036 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:15] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4022 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:15] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3058 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:15] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1089 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:15] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4027 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:16] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4035 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:16] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6003 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:17] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6011 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:17] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6010 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:19] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5015 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:23] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2027 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:23] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2033 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:23] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2042 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:27] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6013 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:29] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4030 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:31] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6001 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:33] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4024 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:33] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2034 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:33] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6007 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:33] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6005 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:36] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6014 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:49] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5013 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:51] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4026 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:51] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3057 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:51] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3050 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:35:59] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2035 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:01] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1082 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:05] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6009 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:05] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp6002 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:09] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5001 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:11] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3051 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:23] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3060 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:25] _joe_ I suppose you might want to look ^^ [07:36:27] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3055 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:27] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1079 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:29] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5006 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:33] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4034 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:33] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4033 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:35] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3065 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:39] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5008 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:36:39] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5012 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:38:11] <_joe_> doh [07:38:50] <_joe_> frankly, not sure why the alert is still there even [07:38:59] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1075 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:39:01] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2037 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:39:03] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3056 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:39:03] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3054 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:39:05] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp4025 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:39:06] <_joe_> sigh [07:40:07] <_joe_> ahhhh puppet still didn't run on the alerts host [07:40:09] <_joe_> sigh [07:40:25] <_joe_> why don't you have a decent orchestration, puppet [07:41:39] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp2038 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:41:41] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp3063 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:41:45] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp5004 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:41:58] <_joe_> ok solved [07:42:02] <_joe_> puppet is running on icinga [07:42:19] 👍 [07:42:49] <_joe_> once it manages to reload icinga the alerts should vanish [07:43:16] okeley dokeley [07:44:32] (03CR) 10Marostegui: [C: 03+1] ProductionServices: Promote pc1014 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786955 (https://phabricator.wikimedia.org/T306983) (owner: 10Kormat) [07:44:34] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Marostegui) I would stop pc1014 to replicate from pc1012 before promoting it - I am not sure how MW copes with pcX masters and replication threads in terms of errors or expectations. The rest looks g... [07:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:53:31] koi, a merge at this point wouldn't complete in time for the end of the window, and we can't really run over today, as there is a gerrit upgrade scheduled right afterwards. you'll want to reschedule your patches, and make sure you can be here at the time... [07:55:58] gonna close the window I guess [07:56:24] !log UTC morning backport and config training window closed [07:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] hashar: Time to snap out of that daydream and deploy Gerrit. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T0800). [08:02:04] apergos: just in time perfect! [08:02:07] I am going to upgrade Gerrit [08:02:54] \o/ [08:03:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase-dev1004.eqiad.wmnet [08:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:37] !log hashar@deploy1002 Started deploy [gerrit/gerrit@031f315]: Gerrit to 3.4.4 on gerrit2001 # T292759 [08:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:44] T292759: Upgrade to Gerrit 3.4 - https://phabricator.wikimedia.org/T292759 [08:04:48] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@031f315]: Gerrit to 3.4.4 on gerrit2001 # T292759 (duration: 00m 11s) [08:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:16] (03CR) 10Volans: "I don't have context on the script itself, just left a comment for the Python aspect of it. Feel free to ignore it." [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [08:08:07] bringing up gerrit2001 [08:08:12] which is the replica [08:08:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase-dev1004.eqiad.wmnet [08:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:59] !log hashar@deploy1002 Started deploy [gerrit/gerrit@031f315]: Gerrit to 3.4.4 on gerrit1001 # T292759 [08:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:05] T292759: Upgrade to Gerrit 3.4 - https://phabricator.wikimedia.org/T292759 [08:10:08] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@031f315]: Gerrit to 3.4.4 on gerrit1001 # T292759 (duration: 00m 09s) [08:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:45] !log Stopping Gerrit for version ugprade # T292759 [08:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:08] hashar: I see you are busy. want to move our chat to tomorrow? [08:15:23] !log https://gerrit.wikimedia.org/ is now running 3.4.4 # T292759 [08:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:30] T292759: Upgrade to Gerrit 3.4 - https://phabricator.wikimedia.org/T292759 [08:15:38] duesen: ho was our meeting at 10:00 or 10:30 ? [08:15:51] 10:30 :) [08:15:56] I am done with the upgrade [08:16:32] 👏 [08:16:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase-dev1005.eqiad.wmnet [08:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:33] duesen: I will be there. The Gerrit upgrade is straightforward, assuming it got battle tested beforehand :] [08:18:08] looks like CI is happy [08:21:03] though it once again shows empty avatars for users bah [08:21:57] which I totally missed in my local testing [08:23:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase-dev1005.eqiad.wmnet [08:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:49] hashar: 🎉 congrats! [08:23:56] (on the deployment :)) [08:24:58] hashar: cool, i'll be there, then [08:28:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase-dev1006.eqiad.wmnet [08:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:47] !log upload scap 4.7.1 to {buster,stretch,bullseye}-wikimedia apt repos - T306998 [08:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:52] T306998: Deploy Scap version 4.7.1 - https://phabricator.wikimedia.org/T306998 [08:33:44] filed as https://phabricator.wikimedia.org/T307072 [08:36:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase-dev1006.eqiad.wmnet [08:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:04] (03PS2) 10Filippo Giunchedi: clinic-duty: add euNetworks support [software] - 10https://gerrit.wikimedia.org/r/786941 [08:42:39] PROBLEM - DPKG on ganeti-test2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:50:29] PROBLEM - cassandra-c SSL 10.192.48.70:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:50:43] PROBLEM - cassandra-a SSL 10.192.48.68:7001 on restbase2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [08:50:46] 10SRE-Access-Requests: Requesting access to wmf ldap group for ejoseph - https://phabricator.wikimedia.org/T307074 (10dcausse) [08:51:03] ^ restbase2012 is me, ignore [08:51:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2012.codfw.wmnet [08:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:46] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:48] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:07] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:57:07] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10Joe) [08:57:09] 10SRE, 10conftool, 10Patch-For-Review: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 (10Joe) 05Open→03Resolved [08:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:42] 10SRE, 10conftool: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10Joe) a:03Joe [08:58:01] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:07] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:42] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase2012.codfw.wmnet [09:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:29] 10SRE-Access-Requests: Requesting access to wmf ldap group for ejoseph - https://phabricator.wikimedia.org/T307074 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! Please check. Since the user is a wmf employee with production access already I've gone ahead and added to `wmf` ldap group. [09:11:03] 10SRE: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10fgiunchedi) [09:13:11] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:19:09] (03CR) 10JMeybohm: [C: 04-1] docker: ensure apparmor package is installed if on bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [09:21:17] 10SRE-Access-Requests: Requesting access to wmf ldap group for ejoseph - https://phabricator.wikimedia.org/T307074 (10Aklapper) This isn't done per steps on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_access as the last bullet point was missed. [09:23:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[2013-2015].codfw.wmnet with reason: reboot [09:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[2013-2015].codfw.wmnet with reason: reboot [09:23:44] (03PS1) 10Vgutierrez: haproxy,ats-tls: Remove X-OpenStack-Request-ID response header [puppet] - 10https://gerrit.wikimedia.org/r/787432 [09:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:57] (03PS2) 10Vgutierrez: haproxy,ats-tls: Remove X-OpenStack-Request-ID response header [puppet] - 10https://gerrit.wikimedia.org/r/787432 [09:27:02] (03PS3) 10Vgutierrez: haproxy,ats-tls: Remove X-OpenStack-Request-ID response header [puppet] - 10https://gerrit.wikimedia.org/r/787432 [09:27:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2013.codfw.wmnet [09:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:43] !log uploadded ganeti 3.0.1-2+deb11u0 to apt.wikimedia.org (backport of Py2->Py3 regression) T306499 [09:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:49] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [09:34:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2013.codfw.wmnet [09:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:03] (03PS1) 10Jgiannelos: tegola: Use new container for maps tiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 [09:39:16] (03CR) 10jerkins-bot: [V: 04-1] tegola: Use new container for maps tiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 (owner: 10Jgiannelos) [09:40:00] (03CR) 10Jgiannelos: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 (owner: 10Jgiannelos) [09:40:31] (03PS2) 10Jgiannelos: tegola: Use new container for maps tiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 (https://phabricator.wikimedia.org/T306424) [09:40:50] (03CR) 10Filippo Giunchedi: haproxy,ats-tls: Remove X-OpenStack-Request-ID response header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787432 (owner: 10Vgutierrez) [09:41:40] !log Change db1132 innodb_max_dirty_pages_pct from 90% to 75% T307082 [09:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:46] T307082: Investigate spikes on db1132 (mariadb 10.6 host) - https://phabricator.wikimedia.org/T307082 [09:44:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [09:45:55] (03CR) 10Filippo Giunchedi: "LGTM modulo missing dep" [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/787046 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [09:46:11] (03PS3) 10Jgiannelos: tegola: Use new swift container for maps tiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 (https://phabricator.wikimedia.org/T306424) [09:51:29] !log Change pc2014 innodb_max_dirty_pages_pct from 90% to 75% T307082 [09:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:36] T307082: Investigate spikes on db1132 (mariadb 10.6 host) - https://phabricator.wikimedia.org/T307082 [09:52:31] (03PS1) 10Jbond: P:cumin::master: Add documentation and fix minor lint issue [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) [09:52:53] (03PS2) 10Jbond: P:cumin::master: Add documentation and fix minor lint issue [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) [09:53:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34972/console" [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) (owner: 10Jbond) [09:54:59] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/786981 (owner: 10Muehlenhoff) [09:55:12] (03CR) 10Jbond: [V: 03+1] "PCC diff is just `ensure => present` vs `ensure => file`" [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) (owner: 10Jbond) [09:55:30] !log failover idp.wikimedia.org to idp1001 for CAS update [09:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:59] (03PS4) 10Vgutierrez: haproxy,ats-tls: Remove X-OpenStack-Request-ID response header [puppet] - 10https://gerrit.wikimedia.org/r/787432 [09:56:04] 10SRE, 10SRE-Access-Requests: Requesting access to wmf ldap group for ejoseph - https://phabricator.wikimedia.org/T307074 (10fgiunchedi) >>! In T307074#7887034, @Aklapper wrote: > This isn't done per steps on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_access as the last bullet poi... [09:56:29] (03CR) 10Vgutierrez: haproxy,ats-tls: Remove X-OpenStack-Request-ID response header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787432 (owner: 10Vgutierrez) [09:56:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [09:57:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/787432 (owner: 10Vgutierrez) [09:59:54] (03PS1) 10Giuseppe Lavagetto: requestctl: set an X-Requestctl header for matching rules [software/conftool] - 10https://gerrit.wikimedia.org/r/787437 (https://phabricator.wikimedia.org/T305582) [09:59:56] (03PS1) 10Giuseppe Lavagetto: requestctl: Allow detecting matching rules that are disabled [software/conftool] - 10https://gerrit.wikimedia.org/r/787438 (https://phabricator.wikimedia.org/T305582) [10:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1000). [10:00:54] (03CR) 10Volans: [C: 03+1] "LGTM, doc typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) (owner: 10Jbond) [10:03:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2014.codfw.wmnet [10:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:10:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2014.codfw.wmnet [10:10:11] (03PS1) 10Jbond: P:cumin::master: Add contact owners aliases [puppet] - 10https://gerrit.wikimedia.org/r/787440 [10:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:18] (03PS2) 10Jbond: P:cumin::master: Add contact owners aliases [puppet] - 10https://gerrit.wikimedia.org/r/787440 [10:14:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34974/console" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:15:07] !log update scap to 4.7.1 on A:mw-canary or A:parsoid-canary or A:mw-jobrunner-canary - T306998 [10:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:13] T306998: Deploy Scap version 4.7.1 - https://phabricator.wikimedia.org/T306998 [10:16:03] (03PS3) 10Jbond: P:cumin::master: Add documentation and fix minor lint issue [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) [10:16:14] (03CR) 10Jbond: P:cumin::master: Add documentation and fix minor lint issue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) (owner: 10Jbond) [10:18:14] (03PS3) 10Jbond: P:cumin::master: Add contact owners aliases [puppet] - 10https://gerrit.wikimedia.org/r/787440 [10:18:49] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) (owner: 10Jbond) [10:19:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34975/console" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:19:52] (03CR) 10Jbond: [V: 03+1] P:cumin::master: Add contact owners aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:20:06] (03CR) 10Jbond: [C: 03+2] P:cumin::master: Add documentation and fix minor lint issue [puppet] - 10https://gerrit.wikimedia.org/r/787436 (https://phabricator.wikimedia.org/T306830) (owner: 10Jbond) [10:20:47] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:01] (03CR) 10Volans: "Nice! See few typos inline" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:23:12] (03PS1) 10Giuseppe Lavagetto: Rakefile: do not mix selection of an asset and its status [deployment-charts] - 10https://gerrit.wikimedia.org/r/787443 (https://phabricator.wikimedia.org/T307043) [10:23:20] <_joe_> jayme: ^^ [10:23:24] <_joe_> I expect CI to fail [10:23:32] !log update scap to 4.7.1 on restbase1016 (canary) - T306998 [10:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:38] T306998: Deploy Scap version 4.7.1 - https://phabricator.wikimedia.org/T306998 [10:24:03] !log elukey@deploy1002 Started deploy [restbase/deploy@0205f1d] (dev-cluster): (no justification provided) [10:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:22] !log elukey@deploy1002 Finished deploy [restbase/deploy@0205f1d] (dev-cluster): (no justification provided) (duration: 00m 18s) [10:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:02] _joe_: I assume the changes to regex strings are autoformatting? [10:25:17] <_joe_> sigh yes [10:25:25] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: do not mix selection of an asset and its status [deployment-charts] - 10https://gerrit.wikimedia.org/r/787443 (https://phabricator.wikimedia.org/T307043) (owner: 10Giuseppe Lavagetto) [10:25:29] okay. Got a bit confused at first :) [10:25:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (sans typos)" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:26:17] (03PS4) 10Jbond: P:cumin::master: Add contact owners aliases [puppet] - 10https://gerrit.wikimedia.org/r/787440 [10:26:40] _joe_: File.exist? as well? [10:26:52] <_joe_> yes [10:27:55] (03CR) 10Jbond: P:cumin::master: Add contact owners aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:28:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2015.codfw.wmnet [10:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:22] (03CR) 10Jbond: cumin: add "owner" aliases to get lists of host per SRE subteam (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) (owner: 10Dzahn) [10:29:05] (03PS5) 10Jbond: P:cumin::master: Add contact owners aliases [puppet] - 10https://gerrit.wikimedia.org/r/787440 [10:35:19] (03PS5) 10Vgutierrez: haproxy,ats-tls: Remove X-OpenStack-Request-ID response header [puppet] - 10https://gerrit.wikimedia.org/r/787432 (https://phabricator.wikimedia.org/T279637) [10:36:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2015.codfw.wmnet [10:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:51] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778299 (owner: 10PipelineBot) [10:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:40:58] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/778299 (owner: 10PipelineBot) [10:43:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[2016-2018].codfw.wmnet with reason: reboot [10:43:53] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) [10:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[2016-2018].codfw.wmnet with reason: reboot [10:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:00] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [10:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:24] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc1014 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786955 (https://phabricator.wikimedia.org/T306983) (owner: 10Kormat) [10:44:35] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [10:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:03] (03Merged) 10jenkins-bot: ProductionServices: Promote pc1014 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786955 (https://phabricator.wikimedia.org/T306983) (owner: 10Kormat) [10:45:25] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [10:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:33] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) [10:45:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2012.codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Rebooting pc1012 T306983 [10:45:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc[1012,1014].eqiad.wmnet with reason: Rebooting pc1012 T306983 [10:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:56] T306983: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 [10:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:08] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Set pc1014 as pc2 primary T306983 (duration: 00m 52s) [10:46:10] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [10:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:28] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [10:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:47] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) [10:46:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc1012.eqiad.wmnet with reason: Rebooting for T303174 [10:46:58] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) p:05Triage→03Medium [10:46:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc1012.eqiad.wmnet with reason: Rebooting for T303174 [10:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:12] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [10:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:34] (03PS1) 10Kormat: Revert "ProductionServices: Promote pc1014 to primary of pc2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787447 [10:48:46] jouncebot: nowandnext [10:48:46] For the next 0 hour(s) and 11 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1000) [10:48:46] In 2 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1300) [10:52:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:52:56] (03PS1) 10Jbond: hieradata: remove ' SREs' from contacts [puppet] - 10https://gerrit.wikimedia.org/r/787444 [10:53:11] (03CR) 10Kormat: [C: 03+2] Revert "ProductionServices: Promote pc1014 to primary of pc2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787447 (owner: 10Kormat) [10:53:58] (03Merged) 10jenkins-bot: Revert "ProductionServices: Promote pc1014 to primary of pc2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787447 (owner: 10Kormat) [10:54:06] (03CR) 10Jbond: P:cumin::master: Add contact owners aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:55:14] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Set pc1012 as pc2 primary T306983 (duration: 00m 57s) [10:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:22] T306983: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 [10:55:53] (03PS2) 10Giuseppe Lavagetto: Rakefile: do not mix selection of an asset and its status [deployment-charts] - 10https://gerrit.wikimedia.org/r/787443 (https://phabricator.wikimedia.org/T307043) [10:55:55] (03PS2) 10Giuseppe Lavagetto: Revert "Update path to values file with image names" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786424 (owner: 10Ahmon Dancy) [10:56:25] (03PS6) 10Jbond: P:cumin::master: Add contact owners aliases [puppet] - 10https://gerrit.wikimedia.org/r/787440 [10:56:29] (03CR) 10Jbond: P:cumin::master: Add contact owners aliases (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [10:56:40] !log failover Ganeti master in ganeti-test to ganeti-test1001 (bullseye node) T306499 [10:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:46] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [10:56:51] !log failover Ganeti master in ganeti-test to ganeti-test2001 (bullseye node) T306499 [10:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:18] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) [10:57:19] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [10:57:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [11:01:11] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:27] PROBLEM - ganeti-wconfd running on ganeti-test2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:01:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2016.codfw.wmnet [11:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:04:04] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) 05Open→03Resolved Success. [11:08:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2016.codfw.wmnet [11:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:25] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) This also affects backup hosts- at the very least backup2002- most likely others backup[12]00[123]. [11:12:01] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) [11:21:39] (03PS3) 10Giuseppe Lavagetto: Rakefile: do not mix selection of an asset and its status [deployment-charts] - 10https://gerrit.wikimedia.org/r/787443 (https://phabricator.wikimedia.org/T307043) [11:21:41] (03PS3) 10Giuseppe Lavagetto: Revert "Update path to values file with image names" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786424 (owner: 10Ahmon Dancy) [11:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:22:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2017.codfw.wmnet [11:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:11] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: do not mix selection of an asset and its status [deployment-charts] - 10https://gerrit.wikimedia.org/r/787443 (https://phabricator.wikimedia.org/T307043) (owner: 10Giuseppe Lavagetto) [11:30:55] PROBLEM - Host mw1323 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:26] ^ thats me restarting the host, is stuck in boot [11:33:38] !log powercycling restbase2017 [11:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:44] !log applying NIC firmware update onto backup2002 T286722 [11:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:51] T286722: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 [11:39:27] (03PS1) 10Jbond: P:debmonitor::server: drop qery_nodes functions [puppet] - 10https://gerrit.wikimedia.org/r/787483 [11:39:29] (03PS1) 10Jbond: P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 [11:39:31] (03PS1) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [11:39:35] (03CR) 10Jbond: [C: 03+2] P:scap::dsh: Add scap targets as a dsh group [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [11:40:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2017.codfw.wmnet [11:40:03] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [11:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:24] (03CR) 10jerkins-bot: [V: 04-1] P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [11:41:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2005.codfw.wmnet [11:41:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:19] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) For logging, the exact firmware I am using for backup2002 is: NetXtreme-E Network Device Firmware 22.0 Version: 22.00.07.60, 22.00.07.60 File: Network_Firmware_NPNT5_WN6... [11:46:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:51] (03PS2) 10Jbond: P:debmonitor::server: drop qery_nodes functions [puppet] - 10https://gerrit.wikimedia.org/r/787483 [11:48:38] (03PS2) 10Jbond: P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 [11:48:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34976/console" [puppet] - 10https://gerrit.wikimedia.org/r/787483 (owner: 10Jbond) [11:49:02] !log pool name=mw[1324-1326].eqiad.wmnet , manual puppet run and icinga green after reboot (cookbook failed because of mw1323 stuck in boot) [11:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:24] (03CR) 10jerkins-bot: [V: 04-1] P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [11:50:07] (03PS3) 10Jbond: P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 [11:50:14] (03PS2) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [11:50:18] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw132[4-6].eqiad.wmnet [11:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:50:55] (03PS3) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [11:52:06] (03CR) 10jerkins-bot: [V: 04-1] P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [11:52:54] (03PS4) 10Jbond: P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 [11:52:57] (03CR) 10jerkins-bot: [V: 04-1] P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [11:54:31] (03PS4) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [11:54:50] (03CR) 10jerkins-bot: [V: 04-1] P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [11:56:13] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:25] (03CR) 10jerkins-bot: [V: 04-1] P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [11:57:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2018.codfw.wmnet [11:57:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2005.codfw.wmnet [11:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:33] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) @Papaul @MoritzMuehlenhoff I am a bit lost now, as apparently the NIC firmware upgrade didn't fix my issue, as it did for Filippo. Should we try upgrading more firmwares... [12:00:47] (03PS5) 10Jbond: P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 [12:03:53] (03PS3) 10Stang: zhwiki: Add comment to corresponding task of logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) [12:04:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2018.codfw.wmnet [12:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2019.codfw.wmnet with reason: reboot [12:05:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2019.codfw.wmnet with reason: reboot [12:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] (03CR) 10Stang: Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786407 (https://phabricator.wikimedia.org/T306795) (owner: 10Stang) [12:06:58] 10SRE: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10fgiunchedi) p:05Triage→03Medium [12:07:05] (03PS1) 10Muehlenhoff: Add testvm2005 to DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/787490 [12:07:12] 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10fgiunchedi) p:05Triage→03Medium [12:07:23] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10fgiunchedi) p:05Triage→03Medium [12:07:37] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10fgiunchedi) p:05Triage→03Medium [12:08:41] (03PS6) 10Krinkle: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) [12:08:45] (03CR) 10Krinkle: [C: 03+2] static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [12:09:24] (03Merged) 10jenkins-bot: static.php: Fold "unknown" handling into "nohash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/777900 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [12:14:29] (03CR) 10Hnowlan: [C: 03+2] Remove Thumbor Community Core as Wikimedia Thumbor dependency [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/785127 (https://phabricator.wikimedia.org/T305053) (owner: 10Roman Stolar) [12:15:05] (03PS3) 10Filippo Giunchedi: clinic-duty: add euNetworks support [software] - 10https://gerrit.wikimedia.org/r/786941 [12:17:17] (03Merged) 10jenkins-bot: Remove Thumbor Community Core as Wikimedia Thumbor dependency [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/785127 (https://phabricator.wikimedia.org/T305053) (owner: 10Roman Stolar) [12:20:32] (03CR) 10Muehlenhoff: [C: 03+2] Add testvm2005 to DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/787490 (owner: 10Muehlenhoff) [12:21:51] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: add euNetworks support [software] - 10https://gerrit.wikimedia.org/r/786941 (owner: 10Filippo Giunchedi) [12:23:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34983/console" [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [12:24:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2019.codfw.wmnet [12:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34984/console" [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [12:26:34] (03PS6) 10Jbond: P:openstack::nova::compute::service: replace query_nodes with role_hosts [puppet] - 10https://gerrit.wikimedia.org/r/787484 [12:28:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34985/console" [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [12:28:20] !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1323.eqiad.wmnet [12:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:35] (03CR) 10Jbond: [V: 03+1] "PCC diff just relates to sorting the list earlier no real diff" [puppet] - 10https://gerrit.wikimedia.org/r/787484 (owner: 10Jbond) [12:29:47] (03PS5) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [12:31:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2019.codfw.wmnet [12:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:39] (03CR) 10jerkins-bot: [V: 04-1] P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [12:32:46] !log krinkle@deploy1002 Synchronized w/static.php: I0bdf0b4038d8639858f6cb2ee90747d8c40e6294 (duration: 01m 56s) [12:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:39] (03PS6) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [12:36:50] (03PS7) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [12:37:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[2020-2022].codfw.wmnet with reason: reboot [12:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[2020-2022].codfw.wmnet with reason: reboot [12:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:39] !log installing testvm2005 T306499 [12:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:47] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [12:41:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2020.codfw.wmnet [12:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:58] jouncebot: next [12:42:58] In 0 hour(s) and 17 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1300) [12:44:17] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] "CI was expected to fail on this change, force-merging it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/787443 (https://phabricator.wikimedia.org/T307043) (owner: 10Giuseppe Lavagetto) [12:45:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Update path to values file with image names" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786424 (owner: 10Ahmon Dancy) [12:49:25] (03Merged) 10jenkins-bot: Revert "Update path to values file with image names" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786424 (owner: 10Ahmon Dancy) [12:49:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2020.codfw.wmnet [12:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:49] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:53:52] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:01] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:54:05] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:18] 10SRE, 10Dumps-Generation, 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4), 10Security: Remaining data engineering host security restarts - https://phabricator.wikimedia.org/T307055 (10Ottomata) Sounds good! [12:54:55] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:51] (03CR) 10Volans: [C: 03+1] "LGTM, couple of nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/787444 (owner: 10Jbond) [12:56:27] (03Abandoned) 10Ottomata: Add role::analytics_cluster::database::meta on an-db100[12] [puppet] - 10https://gerrit.wikimedia.org/r/736019 (https://phabricator.wikimedia.org/T284150) (owner: 10Ottomata) [12:56:29] (03CR) 10David Caro: [V: 03+1 C: 03+2] wmcs.codfw1: use the correct memcached port for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/786330 (owner: 10David Caro) [12:57:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2021.codfw.wmnet [12:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:14] (03PS4) 10Stang: zhwiki: Add comment to corresponding task of logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) [12:58:48] (03CR) 10Volans: [C: 03+1] "LGTM, PCC output looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1300). Please do the needful. [13:00:04] StevenSun and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] hi [13:00:53] apologize for the absence in the last window [13:01:24] (03PS1) 10Giuseppe Lavagetto: mwdebug: reference the correct deployment file [deployment-charts] - 10https://gerrit.wikimedia.org/r/787495 [13:02:31] (03PS2) 10Giuseppe Lavagetto: mwdebug: reference the correct deployment file [deployment-charts] - 10https://gerrit.wikimedia.org/r/787495 [13:02:39] (03PS8) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [13:03:28] 10SRE, 10DBA, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) [13:03:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34989/console" [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [13:04:07] (03PS1) 10Kormat: pc1014: Move to pc3. [puppet] - 10https://gerrit.wikimedia.org/r/787496 (https://phabricator.wikimedia.org/T307101) [13:04:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2021.codfw.wmnet [13:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:49] (03CR) 10Kormat: [C: 03+2] pc1014: Move to pc3. [puppet] - 10https://gerrit.wikimedia.org/r/787496 (https://phabricator.wikimedia.org/T307101) (owner: 10Kormat) [13:06:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787483 (owner: 10Jbond) [13:07:08] (03PS9) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [13:07:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: reference the correct deployment file [deployment-charts] - 10https://gerrit.wikimedia.org/r/787495 (owner: 10Giuseppe Lavagetto) [13:09:25] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) [13:09:34] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10Papaul) @jcrespo we have seen some issues on version 22 on those cards with some of the cloud nodes. Maybe tried to downgrade the firmware to version 21.80.9 and let me know. [13:09:45] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) p:05Triage→03Medium [13:10:44] (03PS1) 10Kormat: ProductionServices: Promote pc1014 to primary of pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787497 (https://phabricator.wikimedia.org/T307101) [13:11:05] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) [13:11:28] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) [13:11:55] (03Merged) 10jenkins-bot: mwdebug: reference the correct deployment file [deployment-charts] - 10https://gerrit.wikimedia.org/r/787495 (owner: 10Giuseppe Lavagetto) [13:14:03] jouncebot: next [13:14:03] In 2 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1600) [13:15:45] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:15:49] !log upgrade scap to 4.7.1 to all nodes (except ores[12]* since they need to be upgraded to buster first) [13:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:53] I can deploy in 15 minutes or so, if nobody else is around [13:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:33] 10ops-eqiad, 10DC-Ops: mw1323 stuck after reboot - https://phabricator.wikimedia.org/T307103 (10Jelto) [13:20:38] ok, I can deploy now [13:21:07] <_joe_> Lucas_WMDE: can you hold on for like 5 minutes? [13:21:14] ok sure [13:21:22] <_joe_> we want to test something on the sre side with your deployments [13:21:36] <_joe_> I'll give you the green light asap [13:21:42] I’ll look at the config changes in the meantime [13:21:45] <_joe_> jayme: let's merge your change? [13:21:48] hadn’t started reviewing them yet anyways [13:22:12] StevenSun: are you there? [13:22:27] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:38] Yes [13:22:41] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:04] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [debs/python-opensearch] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/787500 [13:23:08] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [debs/python-opensearch] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/787500 (owner: 10QChris) [13:23:16] _joe_: deploy_to_mwdebug user you mean? [13:23:17] ok [13:23:20] (03PS1) 10QChris: Import done. Revoke import grants [debs/python-opensearch] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/787501 [13:23:22] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [debs/python-opensearch] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/787501 (owner: 10QChris) [13:23:24] <_joe_> jayme: yes [13:23:26] sure [13:23:29] (03CR) 10JMeybohm: [C: 03+2] Revert "Revert "Revert "mwdebug_deploy: switch back to using the root user""" [puppet] - 10https://gerrit.wikimedia.org/r/786418 (owner: 10JMeybohm) [13:24:20] _joe_: you've puppet disabled on deploy1002 [13:24:26] <_joe_> jayme: I'll force-run puppet on deploy1002 as soon as I'm done with the deployment yes [13:24:33] ok [13:24:33] (03PS1) 10Ssingh: dnsdist: use absolute path for systemctl [puppet] - 10https://gerrit.wikimedia.org/r/787502 [13:24:38] patch is merged [13:24:43] <_joe_> ty! [13:24:51] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) Thank you, @Papaul, that's exactly what I hoped to get- some insight from people that may had more experience with similar issues, to try something that could fix the is... [13:24:53] <_joe_> let's see when lucas deploys mediawiki how that goes [13:24:56] (03PS10) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [13:25:20] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34993/console" [puppet] - 10https://gerrit.wikimedia.org/r/787502 (owner: 10Ssingh) [13:26:07] StevenSun: is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/787114 also scheduled for deployment somewhere? [13:26:14] or maybe it should be combined with the other change, if they’re related? [13:26:38] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: use absolute path for systemctl [puppet] - 10https://gerrit.wikimedia.org/r/787502 (owner: 10Ssingh) [13:27:32] <_joe_> Lucas_WMDE: another couple minutes and we should be gtg [13:27:35] <_joe_> sorry for the delay [13:27:36] Lucas_WMDE: yes, it is scheduled [13:28:05] (03Abandoned) 10JMeybohm: Fix use of .Release.Name [deployment-charts] - 10https://gerrit.wikimedia.org/r/787058 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [13:28:09] ah, it’s under your section, I didn’t think to look there ^^ [13:28:18] any thoughts on combining the two changes into one? [13:28:59] well it could be if StevenSun could do so [13:29:05] cherry-pick I guess? [13:29:14] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:22] <_joe_> Lucas_WMDE: green light! [13:29:26] I think technically you should also be able to push a new version of StevenSun’s patch [13:29:28] _joe_: ack [13:29:36] let’s just deploy the two changes separately [13:29:43] (but do those two first and then the other koi changes) [13:29:48] no, it could not amend it :( [13:29:51] *I [13:29:53] huh, ok [13:29:59] then I guess there’s some permission I’m not aware of [13:30:00] <_joe_> I mean you can also merge both and deploy them toghether :) [13:30:07] I guess so ^^ [13:30:10] it’ll be different syncs either way [13:30:17] (03PS3) 10Lucas Werkmeister (WMDE): Revert Simplified Chinese logo of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [13:30:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Revert Simplified Chinese logo of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [13:30:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "(Will be deployed together with I3506416b94, to add a reference to the task in logos.php.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [13:31:05] (03Merged) 10jenkins-bot: Revert Simplified Chinese logo of zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775320 (https://phabricator.wikimedia.org/T276694) (owner: 10Steven Sun) [13:31:35] (03PS11) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [13:31:56] !log drain ganeti-test2002 T306499 [13:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [13:32:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet [13:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:18] (03CR) 10Marostegui: [C: 03+1] ProductionServices: Promote pc1014 to primary of pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787497 (https://phabricator.wikimedia.org/T307101) (owner: 10Kormat) [13:32:39] syncing the 1x file [13:32:45] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Marostegui) [13:32:54] (03PS5) 10Lucas Werkmeister (WMDE): zhwiki: Add comment to corresponding task of logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) (owner: 10Stang) [13:32:56] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Marostegui) [13:33:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "(Logo itself updated in I932b9ce0fa.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) (owner: 10Stang) [13:33:16] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Marostegui) [13:33:38] scap waiting for one in-flight sync-proxy [13:33:41] ah, no, now it’s done [13:33:56] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Marostegui) Made some fixes - it looks good [13:33:58] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/zhwiki-hans.png: Config: [[gerrit:775320|Revert Simplified Chinese logo of zhwiki (T276694)]] (1/3) (duration: 01m 25s) [13:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:05] T276694: Simplified Chinese logo of zhwiki was overrided by an old version - https://phabricator.wikimedia.org/T276694 [13:34:15] (03Merged) 10jenkins-bot: zhwiki: Add comment to corresponding task of logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787114 (https://phabricator.wikimedia.org/T276694) (owner: 10Stang) [13:35:05] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/zhwiki-hans-1.5x.png: Config: [[gerrit:775320|Revert Simplified Chinese logo of zhwiki (T276694)]] (2/3) (duration: 00m 49s) [13:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:14] !log lucaswerkmeister-wmde@deploy1002 Synchronized static/images/project-logos/zhwiki-hans-2x.png: Config: [[gerrit:775320|Revert Simplified Chinese logo of zhwiki (T276694)]] (3/3) (duration: 00m 56s) [13:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:44] 10SRE, 10conftool, 10Patch-For-Review: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10Joe) The idea with the above patches is that the vcl fragment from requestctl will be able to add an HTTP header `X-Requestctl` with a comma-separated list of rules that... [13:37:32] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/zhwiki-hans%s.png\n' '' '-1.5x' '-2x' | mwscript purgeList.php # T276694 [13:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:45] StevenSun, koi: the logo should be updated now, can you confirm? [13:38:20] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 (owner: 10Krinkle) [13:38:28] confirmed, lgtm [13:38:29] (03CR) 10jerkins-bot: [V: 04-1] clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 (owner: 10Krinkle) [13:38:34] great, thanks [13:38:56] ok, let’s continue with kanoner.com [13:39:03] which has a nice license declaration in the footer now, yay [13:39:09] (03PS4) 10Lucas Werkmeister (WMDE): Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786407 (https://phabricator.wikimedia.org/T306795) (owner: 10Stang) [13:39:16] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc1014 to primary of pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787497 (https://phabricator.wikimedia.org/T307101) (owner: 10Kormat) [13:39:24] Confirmed. Thanks! [13:40:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2022.codfw.wmnet [13:40:05] (03Merged) 10jenkins-bot: ProductionServices: Promote pc1014 to primary of pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787497 (https://phabricator.wikimedia.org/T307101) (owner: 10Kormat) [13:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:09] oh wait, I still need to sync the logos.php changes [13:40:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:40:13] Lucas_WMDE: er, crap. i probably shouldn't have merged that just now, should i? [13:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:14] if if they should be no-ops [13:40:26] kormat: uh, not sure [13:40:30] is that something that needs to be synced? [13:40:33] (03CR) 10Ottomata: docker: ensure apparmor package is installed if on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [13:40:45] <_joe_> Lucas_WMDE: yes [13:40:47] Lucas_WMDE: yeah. i wasn't paying attention to the window, and was just about to do it [13:40:59] you can sync it now if you want [13:41:04] <_joe_> it's a spearate file though [13:41:05] ok great, thanks. and sorry! [13:41:06] and then I’ll do the logos afterwards [13:41:23] (I didn’t fetch the logos change yet so you should see two new commits) [13:42:06] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Set pc1014 as pc3 primary T307101 (duration: 00m 52s) [13:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] T307101: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 [13:42:35] Lucas_WMDE: done, i'm out of your way now. (i'll have a revert and another sync to do in a bit, but there's no urgent rush) [13:42:40] ok [13:42:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Rebooting pc1013 T307101 [13:42:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Rebooting pc1013 T307101 [13:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:20] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) [13:43:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc1013.eqiad.wmnet with reason: Rebooting for T303174 [13:43:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc1013.eqiad.wmnet with reason: Rebooting for T303174 [13:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:49] (03PS5) 10Lucas Werkmeister (WMDE): Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786407 (https://phabricator.wikimedia.org/T306795) (owner: 10Stang) [13:44:01] !log lucaswerkmeister-wmde@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:787114|zhwiki: Add comment to corresponding task of logo (T276694)]] (1/2, no-op) (duration: 00m 50s) [13:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:07] T276694: Simplified Chinese logo of zhwiki was overrided by an old version - https://phabricator.wikimedia.org/T276694 [13:45:16] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:787114|zhwiki: Add comment to corresponding task of logo (T276694)]] (2/2, no-op) (duration: 00m 53s) [13:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786407 (https://phabricator.wikimedia.org/T306795) (owner: 10Stang) [13:46:25] (03Merged) 10jenkins-bot: Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786407 (https://phabricator.wikimedia.org/T306795) (owner: 10Stang) [13:46:55] koi: the kanoner change is on mwdebug1001, can you test it? [13:47:12] it could not be tested IMO.. [13:47:31] I thought these changes are usually tested by uploading a file? [13:47:40] (03PS2) 10Filippo Giunchedi: clinic-duty: add Arelion to Telia detection [software] - 10https://gerrit.wikimedia.org/r/786893 [13:47:42] (03PS3) 10Filippo Giunchedi: clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 (owner: 10Krinkle) [13:48:22] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: add Arelion to Telia detection [software] - 10https://gerrit.wikimedia.org/r/786893 (owner: 10Filippo Giunchedi) [13:48:27] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: Minor DOM handling clean up [software] - 10https://gerrit.wikimedia.org/r/717653 (owner: 10Krinkle) [13:48:47] (03PS1) 10JMeybohm: Make /etc/helmfile-defaults/private readable by deployment group [puppet] - 10https://gerrit.wikimedia.org/r/787504 (https://phabricator.wikimedia.org/T305729) [13:48:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase[2023-2026].codfw.wmnet with reason: reboot [13:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase[2023-2026].codfw.wmnet with reason: reboot [13:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:11] tested, lgtm [13:50:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:41] ok, thanks [13:50:56] (tested at https://commons.wikimedia.org/wiki/File:Zheleznaja-doroha-na-synopskoj-naberezhnoj.jpg, for the record) [13:50:57] (03CR) 10Jgiannelos: [C: 03+2] tegola: Use new swift container for maps tiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [13:51:10] (03PS1) 10Kormat: Revert "ProductionServices: Promote pc1014 to primary of pc3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787449 [13:51:59] Lucas_WMDE: that revert is ready. let me know when's a good time to merge+sync it. [13:51:59] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:786407|Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T306795)]] (duration: 00m 50s) [13:52:03] let’s do zhwikiversity before itwiki, the abusefilter change sounds more important [13:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:05] T306795: Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T306795 [13:52:09] kormat: is it ok if it’s after the end of the hour? [13:52:14] (03PS2) 10Lucas Werkmeister (WMDE): zhwikiversity: Enable blocking feature of AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787115 (https://phabricator.wikimedia.org/T307007) (owner: 10Stang) [13:52:16] Lucas_WMDE: yeah sure [13:52:19] ok [13:52:29] ok [13:52:59] (03CR) 10Muehlenhoff: hieradata: remove ' SREs' from contacts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787444 (owner: 10Jbond) [13:53:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] zhwikiversity: Enable blocking feature of AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787115 (https://phabricator.wikimedia.org/T307007) (owner: 10Stang) [13:53:58] (03Merged) 10jenkins-bot: zhwikiversity: Enable blocking feature of AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787115 (https://phabricator.wikimedia.org/T307007) (owner: 10Stang) [13:53:59] koi: can the zhwikiversity change be tested? [13:54:06] (in https://zh.wikiversity.org/wiki/Special:%E7%94%A8%E6%88%B7%E6%9D%83%E9%99%90/Stang?uselang=en it looks like you’re not an admin) [13:54:39] (03PS1) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [13:55:00] I am poking a local sysop, you could move on the next patch :) [13:55:06] (03PS2) 10JMeybohm: Make /etc/helmfile-defaults/private readable by deployment group [puppet] - 10https://gerrit.wikimedia.org/r/787504 (https://phabricator.wikimedia.org/T305729) [13:55:14] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [13:55:23] (03Merged) 10jenkins-bot: tegola: Use new swift container for maps tiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/787435 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [13:55:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Make /etc/helmfile-defaults/private readable by deployment group [puppet] - 10https://gerrit.wikimedia.org/r/787504 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [13:55:59] well, I already merged this one [13:56:17] (03PS2) 10Jforrester: SpecialExport: Add page table once [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787089 (https://phabricator.wikimedia.org/T307037) (owner: 10Ladsgroup) [13:56:30] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:40] <_joe_> ^^ known [13:56:44] koi: the zhwikiversity change is on mwdebug1001 [13:56:56] maybe I can test it myself with createAndPromote.php, let me see… [13:57:02] ack [13:57:32] (03CR) 10JMeybohm: [C: 03+2] Make /etc/helmfile-defaults/private readable by deployment group [puppet] - 10https://gerrit.wikimedia.org/r/787504 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [13:57:53] (03PS1) 10Jgiannelos: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/787506 [13:57:56] hm, can I also remove the rights again from myself afterwards? [13:57:59] * Lucas_WMDE looks for docs on wikitech [13:58:39] oh, not needed now [13:58:50] did you find a sysop? [13:59:07] Even you could not change filter, you could still see if block setting is available [13:59:13] ah [13:59:23] like https://zh.wikiversity.org/w/index.php?title=Special:%E6%BB%A5%E7%94%A8%E8%BF%87%E6%BB%A4%E5%99%A8/4&uselang=en [13:59:30] (03PS12) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [13:59:40] there is a "Block the user and/or IP address from editing" option on mwdebug1001 [13:59:46] you’re right, I can see it at https://zh.wikiversity.org/wiki/Special:%E6%BB%A5%E7%94%A8%E8%BF%87%E6%BB%A4%E5%99%A8/1?uselang=en [13:59:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:54] ok, good enough :) thanks! [13:59:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2023.codfw.wmnet [14:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:11] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1004.eqiad.wmnet [14:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:17] syncing [14:00:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34995/console" [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [14:00:32] jouncebot: nowandnext [14:00:32] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [14:00:32] In 1 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1600) [14:00:47] I’ll just extend the window a bit for the itwiki change [14:00:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:00:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:00] (03PS2) 10Lucas Werkmeister (WMDE): itwiki: assign 'setmentor' to 'bot' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787111 (https://phabricator.wikimedia.org/T307005) (owner: 10Stang) [14:01:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:05] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/abusefilter.php: Config: [[gerrit:787115|zhwikiversity: Enable blocking feature of AbuseFilter (T307007)]] (duration: 00m 51s) [14:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:10] T307007: Enable "auto-block" feature for AbuseFilter on Chinese Wikiversity - https://phabricator.wikimedia.org/T307007 [14:01:45] there seems to be a shift in performance on all percentiles for appservers since around 13:42: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=1651143679256&orgId=1&to=1651154479256&var-cluster=appserver&var-datasource=eqiad+prometheus%2Fops&var-method=GET&var-code=200 [14:02:04] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] itwiki: assign 'setmentor' to 'bot' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787111 (https://phabricator.wikimedia.org/T307005) (owner: 10Stang) [14:02:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2071.codfw.wmnet with reason: Rebooting for T303171 [14:02:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:02:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2071.codfw.wmnet with reason: Rebooting for T303171 [14:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:13] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [14:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:38] <_joe_> jynus: that is probably kormat's change? [14:02:48] don't think so [14:02:51] (03Merged) 10jenkins-bot: itwiki: assign 'setmentor' to 'bot' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787111 (https://phabricator.wikimedia.org/T307005) (owner: 10Stang) [14:02:54] or maybe ? [14:03:07] <_joe_> jynus: definitely is [14:03:10] mc router seems to be running lots of queries [14:03:15] pc3 currently has an "empty" primary [14:03:31] ok, if it returns to normal after repool, then it would be it [14:03:33] and no worries [14:03:34] koi: itwiki change is on mwdebug1001 [14:03:53] (03PS2) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:04:05] I will keep an eye on it in case it doesn't get back to normal later [14:04:25] I thought at first it was traffic-pattern related [14:04:26] itwiki change looks good to me, no diff apart from one added setmentor [14:04:27] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:04:29] Lucas_WMDE, lgtm [14:04:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:02] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [14:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [14:05:09] syncing [14:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:05:46] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [14:05:49] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [14:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:54] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787111|itwiki: assign 'setmentor' to 'bot' usergroup (T307005)]] (duration: 00m 55s) [14:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:04] T307005: Assign "setmentor" right to "bot" usergroup on itwiki - https://phabricator.wikimedia.org/T307005 [14:06:19] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [14:06:21] !log UTC afternoon backport window done [14:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:30] kormat: I’m done, feel free to deploy the pc changes [14:06:34] Lucas_WMDE: cheers! [14:06:47] (03CR) 10Kormat: [C: 03+2] Revert "ProductionServices: Promote pc1014 to primary of pc3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787449 (owner: 10Kormat) [14:06:49] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [14:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:28] (03Merged) 10jenkins-bot: Revert "ProductionServices: Promote pc1014 to primary of pc3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787449 (owner: 10Kormat) [14:07:40] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1004.eqiad.wmnet [14:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:11] (03PS13) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [14:08:28] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10MoritzMuehlenhoff) [14:08:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2023.codfw.wmnet [14:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:48] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Set pc1013 as pc3 primary T307101 (duration: 00m 54s) [14:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:54] T307101: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 [14:09:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34996/console" [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [14:09:30] !log reimaging s1 to bullseye on s2@codfw dbmaint (T303171) [14:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:36] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [14:09:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:09] (03CR) 10Jbond: [V: 03+1] "latest PCC is noop" [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [14:10:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:10:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:00] PROBLEM - ganeti-noded running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:11:06] PROBLEM - ganeti-confd running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:11:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:45] (03PS3) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [14:12:10] PROBLEM - ganeti-mond running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [14:12:18] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:12:32] (03PS3) 10Jbond: P:debmonitor::server: drop query_nodes functions [puppet] - 10https://gerrit.wikimedia.org/r/787483 [14:12:34] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/787483 (owner: 10Jbond) [14:12:51] (03PS6) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [14:13:33] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:13:46] jouncebot: nowandnext [14:13:46] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [14:13:46] In 1 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1600) [14:14:01] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2071.codfw.wmnet with OS bullseye [14:14:03] (03CR) 10Ladsgroup: [C: 03+2] SpecialExport: Add page table once [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787089 (https://phabricator.wikimedia.org/T307037) (owner: 10Ladsgroup) [14:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:11] (03CR) 10Vgutierrez: [C: 03+2] haproxy,ats-tls: Remove X-OpenStack-Request-ID response header [puppet] - 10https://gerrit.wikimedia.org/r/787432 (https://phabricator.wikimedia.org/T279637) (owner: 10Vgutierrez) [14:14:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:15:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2002.codfw.wmnet with OS bullseye [14:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:40] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bullseye [14:16:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:15] (03PS2) 10Jbond: hieradata: remove ' SREs' from contacts [puppet] - 10https://gerrit.wikimedia.org/r/787444 [14:17:27] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [14:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:17:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:45] (03CR) 10Jbond: hieradata: remove ' SREs' from contacts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787444 (owner: 10Jbond) [14:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:26] (03CR) 10Jbond: hieradata: remove ' SREs' from contacts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787444 (owner: 10Jbond) [14:18:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:18:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2024.codfw.wmnet [14:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:15] (03CR) 10Jbond: [C: 03+2] P:debmonitor::server: drop query_nodes functions [puppet] - 10https://gerrit.wikimedia.org/r/787483 (owner: 10Jbond) [14:21:57] 10SRE, 10Search-Console-access-request: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10jcrespo) The final[sic] (there is always room for improvements) wording on SRE side has been documented at: https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/A... [14:23:54] (03PS7) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [14:24:04] (03PS1) 10Kormat: pc1014: Move back to pc1. [puppet] - 10https://gerrit.wikimedia.org/r/787511 (https://phabricator.wikimedia.org/T307101) [14:24:22] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) [14:24:29] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:25:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2024.codfw.wmnet [14:25:37] (03CR) 10Kormat: [C: 03+2] pc1014: Move back to pc1. [puppet] - 10https://gerrit.wikimedia.org/r/787511 (https://phabricator.wikimedia.org/T307101) (owner: 10Kormat) [14:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2002.codfw.wmnet with reason: host reimage [14:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:33] (03CR) 10jerkins-bot: [V: 04-1] SpecialExport: Add page table once [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787089 (https://phabricator.wikimedia.org/T307037) (owner: 10Ladsgroup) [14:32:17] 10SRE, 10DBA, 10Security: Reboot pc1013 - https://phabricator.wikimedia.org/T307101 (10Kormat) 05Open→03Resolved [14:32:21] (03CR) 10Ladsgroup: [C: 03+2] "try again" [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787089 (https://phabricator.wikimedia.org/T307037) (owner: 10Ladsgroup) [14:32:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2071.codfw.wmnet with reason: host reimage [14:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test2002.codfw.wmnet with reason: host reimage [14:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:23] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos Closing this ticket since productio... [14:36:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2071.codfw.wmnet with reason: host reimage [14:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:42] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2092.codfw.wmnet with reason: Rebooting for T303171 [14:38:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2092.codfw.wmnet with reason: Rebooting for T303171 [14:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:49] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [14:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host restbase2025.codfw.wmnet [14:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:30] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10MoritzMuehlenhoff) While Debian has e.g. 14 and 16 in various development branches none of those are going to be continously updated (e.g. 14 will be replaced by 16 in testing so... [14:40:06] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2112.codfw.wmnet with reason: Rebooting for T303171 [14:40:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2112.codfw.wmnet with reason: Rebooting for T303171 [14:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2130.codfw.wmnet with reason: Rebooting for T303171 [14:40:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2130.codfw.wmnet with reason: Rebooting for T303171 [14:40:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2116.codfw.wmnet with reason: Rebooting for T303171 [14:40:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2116.codfw.wmnet with reason: Rebooting for T303171 [14:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:35] (03PS8) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [14:41:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2145.codfw.wmnet with reason: Rebooting for T303171 [14:41:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2145.codfw.wmnet with reason: Rebooting for T303171 [14:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2146.codfw.wmnet with reason: Rebooting for T303171 [14:42:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2146.codfw.wmnet with reason: Rebooting for T303171 [14:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:42:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26861 and previous config saved to /var/cache/conftool/dbconfig/20220428-144252-ladsgroup.json [14:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [14:43:30] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:43:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2094.codfw.wmnet with reason: Reimaging db2072 T303171 [14:43:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2094.codfw.wmnet with reason: Reimaging db2072 T303171 [14:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26862 and previous config saved to /var/cache/conftool/dbconfig/20220428-144401-ladsgroup.json [14:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2072.codfw.wmnet with reason: Rebooting for T303171 [14:44:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2072.codfw.wmnet with reason: Rebooting for T303171 [14:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:30] T303171: Upgrade s1 to Bullseye - https://phabricator.wikimedia.org/T303171 [14:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2002.codfw.wmnet with OS bullseye [14:45:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:45:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:45:19] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bullseye completed: - ganeti-test2002 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled P... [14:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:45:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298558)', diff saved to https://phabricator.wikimedia.org/P26863 and previous config saved to /var/cache/conftool/dbconfig/20220428-144528-ladsgroup.json [14:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:44] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [14:45:56] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:46:06] (03PS9) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [14:46:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:35] !log powercycling restbase2025 [14:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:28] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:47:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298558)', diff saved to https://phabricator.wikimedia.org/P26864 and previous config saved to /var/cache/conftool/dbconfig/20220428-144744-ladsgroup.json [14:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:56] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:48:23] (03Merged) 10jenkins-bot: SpecialExport: Add page table once [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787089 (https://phabricator.wikimedia.org/T307037) (owner: 10Ladsgroup) [14:49:08] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@8278877]: (no justification provided) [14:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:16] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@8278877]: (no justification provided) (duration: 00m 07s) [14:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2071.codfw.wmnet with OS bullseye [14:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:46] (03CR) 10Ahmon Dancy: Allow deploy-mwdebug.py to be paused externally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [14:53:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:33] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.9/includes/specials/SpecialExport.php: Backport: [[gerrit:787089|SpecialExport: Add page table once (T307037)]] (duration: 00m 51s) [14:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:40] T307037: DBQueryError: Error 1066: Not unique table/alias: 'page' ([db]) Function: SpecialExport::getLinksQuery - https://phabricator.wikimedia.org/T307037 [14:56:01] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:20] (03PS3) 10Ahmon Dancy: Allow deploy-mwdebug.py to be paused externally [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) [14:56:34] (03CR) 10Ahmon Dancy: Allow deploy-mwdebug.py to be paused externally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [14:56:39] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host restbase2025.codfw.wmnet [14:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:58:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:58:10] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2092.codfw.wmnet with OS bullseye [14:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:39] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2112.codfw.wmnet with OS bullseye [14:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:58:58] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2116.codfw.wmnet with OS bullseye [14:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26866 and previous config saved to /var/cache/conftool/dbconfig/20220428-145906-ladsgroup.json [14:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:46] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2130.codfw.wmnet with OS bullseye [14:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:53] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2145.codfw.wmnet with OS bullseye [14:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:01] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2146.codfw.wmnet with OS bullseye [15:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:10] !log kormat@cumin1001 START - Cookbook sre.hosts.reimage for host db2072.codfw.wmnet with OS bullseye [15:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:45] (03PS1) 10Ebernhardson: [DNM] test puppet globals [puppet] - 10https://gerrit.wikimedia.org/r/787515 [15:01:17] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787515 (owner: 10Ebernhardson) [15:01:19] (03CR) 10jerkins-bot: [V: 04-1] [DNM] test puppet globals [puppet] - 10https://gerrit.wikimedia.org/r/787515 (owner: 10Ebernhardson) [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:27] (03CR) 10David Caro: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:02:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26867 and previous config saved to /var/cache/conftool/dbconfig/20220428-150249-ladsgroup.json [15:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:33] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:05:03] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:05:49] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10Jdforrester-WMF) >>! In T306996#7888124, @MoritzMuehlenhoff wrote: > We can import the nodesource packages into separate repository components, e.g. thirdparty/node14 and thirdpa... [15:06:49] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:08:56] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2092.codfw.wmnet with reason: host reimage [15:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:58] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [15:11:40] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2116.codfw.wmnet with reason: host reimage [15:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2092.codfw.wmnet with reason: host reimage [15:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:48] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2112.codfw.wmnet with reason: host reimage [15:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2145.codfw.wmnet with reason: host reimage [15:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:08] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2146.codfw.wmnet with reason: host reimage [15:14:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26868 and previous config saved to /var/cache/conftool/dbconfig/20220428-151411-ladsgroup.json [15:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:13] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2072.codfw.wmnet with reason: host reimage [15:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:40] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2146.codfw.wmnet with reason: host reimage [15:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2130.codfw.wmnet with reason: host reimage [15:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2116.codfw.wmnet with reason: host reimage [15:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:11] (03Abandoned) 10Ebernhardson: [DNM] test puppet globals [puppet] - 10https://gerrit.wikimedia.org/r/787515 (owner: 10Ebernhardson) [15:17:24] (03CR) 10David Caro: [C: 03+1] "As we talked the other day I think this is ok." [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [15:17:26] !log kormat@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2145.codfw.wmnet with reason: host reimage [15:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26869 and previous config saved to /var/cache/conftool/dbconfig/20220428-151754-ladsgroup.json [15:17:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2112.codfw.wmnet with reason: host reimage [15:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 (10Cmjohnson) 05Open→03Resolved replaced the DIMM, cleared the log. [15:20:01] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup2002.codfw.wmnet [15:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:09] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2130.codfw.wmnet with reason: host reimage [15:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:23:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2072.codfw.wmnet with reason: host reimage [15:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:13] PROBLEM - Host db2146 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:15] RECOVERY - Host mw1323 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [15:26:31] RECOVERY - Host db2146 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [15:26:54] (03PS10) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [15:27:29] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:27:34] PROBLEM - mysqld processes on db2146 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:27:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2092.codfw.wmnet with OS bullseye [15:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:17] PROBLEM - Host db2145 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:39] PROBLEM - MariaDB Replica IO: s1 on db2146 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:28:55] RECOVERY - Host db2145 is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [15:29:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26870 and previous config saved to /var/cache/conftool/dbconfig/20220428-152916-ladsgroup.json [15:29:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [15:29:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [15:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:23] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26871 and previous config saved to /var/cache/conftool/dbconfig/20220428-152924-ladsgroup.json [15:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:34] (03CR) 10Cwhite: [V: 03+2 C: 03+2] "Builds successfully." [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/787046 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [15:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:59] PROBLEM - MariaDB Replica SQL: s1 on db2146 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:29:59] PROBLEM - MariaDB read only s1 on db2146 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:30:14] PROBLEM - MariaDB Replica IO: s1 on db2145 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:31:31] PROBLEM - mysqld processes on db2145 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:31:31] PROBLEM - MariaDB Replica SQL: s1 on db2145 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:31:55] 10SRE, 10ops-eqiad, 10DC-Ops: mw1323 stuck after reboot - https://phabricator.wikimedia.org/T307103 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson did a hard power reset, server came back okay. no hardware issues were found [15:31:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2146.codfw.wmnet with OS bullseye [15:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26872 and previous config saved to /var/cache/conftool/dbconfig/20220428-153233-ladsgroup.json [15:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:45] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Cmjohnson) @Eevans Can this happen anytime or do you need a work window? [15:32:55] PROBLEM - MariaDB read only s1 on db2145 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:32:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298558)', diff saved to https://phabricator.wikimedia.org/P26873 and previous config saved to /var/cache/conftool/dbconfig/20220428-153259-ladsgroup.json [15:33:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:33:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [15:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:06] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [15:33:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26874 and previous config saved to /var/cache/conftool/dbconfig/20220428-153307-ladsgroup.json [15:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:23] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2116.codfw.wmnet with OS bullseye [15:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10Cmjohnson) [15:33:54] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26875 and previous config saved to /var/cache/conftool/dbconfig/20220428-153523-ladsgroup.json [15:35:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2145.codfw.wmnet with OS bullseye [15:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:39] RECOVERY - mysqld processes on db2146 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:35:39] RECOVERY - MariaDB Replica IO: s1 on db2146 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:35:39] RECOVERY - MariaDB Replica SQL: s1 on db2146 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:35:43] RECOVERY - MariaDB read only s1 on db2146 is OK: Version 10.4.22-MariaDB-log, Uptime 36s, read_only: True, event_scheduler: True, 1353.78 QPS, connection latency: 0.004133s, query latency: 0.000499s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:36:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2112.codfw.wmnet with OS bullseye [15:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:29] RECOVERY - mysqld processes on db2145 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:37:31] RECOVERY - MariaDB Replica SQL: s1 on db2145 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:37:31] RECOVERY - MariaDB Replica IO: s1 on db2145 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:37:31] RECOVERY - MariaDB read only s1 on db2145 is OK: Version 10.4.22-MariaDB-log, Uptime 67s, read_only: True, event_scheduler: True, 722.13 QPS, connection latency: 0.003312s, query latency: 0.000456s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:37:51] (03PS1) 10Papaul: Fix partman for new gitlab nodes [puppet] - 10https://gerrit.wikimedia.org/r/787521 (https://phabricator.wikimedia.org/T306989) [15:38:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:38:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2130.codfw.wmnet with OS bullseye [15:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:21] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2072.codfw.wmnet with OS bullseye [15:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:51] (03CR) 10Papaul: [C: 03+2] Fix partman for new gitlab nodes [puppet] - 10https://gerrit.wikimedia.org/r/787521 (https://phabricator.wikimedia.org/T306989) (owner: 10Papaul) [15:41:26] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: Enable AddLink where it's enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787522 (https://phabricator.wikimedia.org/T306833) [15:43:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudnet200[2,4]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306989 (10Papaul) [15:43:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [15:43:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudnet200[2,4]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306989 (10Papaul) 05Open→03Resolved complete [15:44:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner2002.codfw.wmnet with OS bullseye [15:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:28] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab-runner2002.codfw.w... [15:45:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:34] (03PS2) 10Cwhite: opensearch: ensure curator is >=5.8.1 [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) [15:47:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26876 and previous config saved to /var/cache/conftool/dbconfig/20220428-154738-ladsgroup.json [15:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:46] (03PS1) 10Hnowlan: changeprop: use helm3 semantics for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/787526 (https://phabricator.wikimedia.org/T295578) [15:48:33] (03CR) 10Cwhite: "PCC checks out https://puppet-compiler.wmflabs.org/pcc-worker1003/34997/" [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [15:50:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26877 and previous config saved to /var/cache/conftool/dbconfig/20220428-155028-ladsgroup.json [15:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:53:42] (03CR) 10Ahmon Dancy: "Covered by https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/787495" [deployment-charts] - 10https://gerrit.wikimedia.org/r/787058 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [15:54:01] (03CR) 10Jbond: [V: 03+1] P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [15:54:42] (03CR) 10Jbond: [C: 03+2] hieradata: remove ' SREs' from contacts [puppet] - 10https://gerrit.wikimedia.org/r/787444 (owner: 10Jbond) [15:55:28] jouncebot nowandnext [15:55:28] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [15:55:28] In 0 hour(s) and 4 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1600) [15:56:53] dancy: nothing in the puppet window, it's all yours [15:57:01] Thanks! [15:57:26] on that note, can you review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/787108 ? [15:58:59] checking [15:59:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner2003.codfw.wmnet with OS bullseye [15:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:53] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab-runner2003.codfw.w... [16:00:04] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:37] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:01:01] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:01:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1003.mgmt.eqiad.wmnet with reboot policy FORCED [16:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:41] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1015.mgmt.eqiad.wmnet with reboot policy FORCED [16:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:50] dancy: I can understand wanting to ignore --force when paused, but it might be surprising -- do you want to add a warning message if --force is mistakenly passed anyway? [16:02:10] no wrong answer, happy to merge this way if you prefer it [16:02:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [16:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26878 and previous config saved to /var/cache/conftool/dbconfig/20220428-160243-ladsgroup.json [16:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:21] rzl: Sure I can do that. [16:03:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup1009.eqiad.wmnet - https://phabricator.wikimedia.org/T307048 (10RobH) [16:04:45] (03PS4) 10Ahmon Dancy: Allow deploy-mwdebug.py to be paused externally [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) [16:05:13] (03PS1) 10David Caro: lvm::volume: add createonly flag and use in cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) [16:05:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26879 and previous config saved to /var/cache/conftool/dbconfig/20220428-160533-ladsgroup.json [16:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:48] looks good, will merge as soon as jerkins stamps it [16:05:51] thanks! [16:05:54] Thanks! [16:06:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [16:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:34] !log dcaro@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudbackup2002.codfw.wmnet [16:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1015.mgmt.eqiad.wmnet with reboot policy FORCED [16:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:37] (03PS2) 10David Caro: lvm::volume: add createonly flag and use in cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) [16:11:23] (03CR) 10RLazarus: [C: 03+2] Allow deploy-mwdebug.py to be paused externally [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [16:14:25] (03PS3) 10David Caro: lvm::volume: add createonly flag and use in cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) [16:14:58] dancy: merged, and updated on deploy1002 [16:15:07] Awesome. Many thanks! [16:16:05] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 4 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35000/console" [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) (owner: 10David Caro) [16:17:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [16:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26880 and previous config saved to /var/cache/conftool/dbconfig/20220428-161748-ladsgroup.json [16:17:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:17:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [16:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:56] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:17:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [16:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [16:18:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [16:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [16:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:18:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:18:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2002.codfw.wmnet with OS bullseye [16:18:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26881 and previous config saved to /var/cache/conftool/dbconfig/20220428-161828-ladsgroup.json [16:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:35] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab-runner2002.codfw.wmnet... [16:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:49] (03PS1) 10Volans: sre.hosts.reimage: allow to not remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/787528 [16:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:30] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner2004.codfw.wmnet with OS bullseye [16:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26882 and previous config saved to /var/cache/conftool/dbconfig/20220428-162039-ladsgroup.json [16:20:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [16:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [16:20:45] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab-runner2004.codfw.w... [16:20:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298558)', diff saved to https://phabricator.wikimedia.org/P26883 and previous config saved to /var/cache/conftool/dbconfig/20220428-162047-ladsgroup.json [16:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:48] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [16:20:49] (03PS4) 10David Caro: lvm::volume: add createonly flag and use in cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/787527 (https://phabricator.wikimedia.org/T307117) [16:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [16:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:18] (03PS1) 10Cmjohnson: add new parsoid nodes parse1001-1024 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787530 (https://phabricator.wikimedia.org/T299573) [16:21:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26884 and previous config saved to /var/cache/conftool/dbconfig/20220428-162138-ladsgroup.json [16:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:56] (03CR) 10jerkins-bot: [V: 04-1] add new parsoid nodes parse1001-1024 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787530 (https://phabricator.wikimedia.org/T299573) (owner: 10Cmjohnson) [16:21:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298558)', diff saved to https://phabricator.wikimedia.org/P26885 and previous config saved to /var/cache/conftool/dbconfig/20220428-162159-ladsgroup.json [16:22:03] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: proc-sys-fs-binfmt_misc.automount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Cmjohnson) [16:26:03] (03CR) 10Kormat: [C: 03+1] "<3 !" [cookbooks] - 10https://gerrit.wikimedia.org/r/787528 (owner: 10Volans) [16:28:45] (03Abandoned) 10Cmjohnson: add new parsoid nodes parse1001-1024 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787530 (https://phabricator.wikimedia.org/T299573) (owner: 10Cmjohnson) [16:31:31] !log dancy@deploy1002 Started scap: testing mediawiki container image build and deploy [16:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2003.codfw.wmnet with OS bullseye [16:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:07] (03PS1) 10Cmjohnson: add new parsoid servers parse1001-1024 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787532 (https://phabricator.wikimedia.org/T299573) [16:33:10] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab-runner2003.codfw.wmnet... [16:34:57] (03CR) 10Cmjohnson: [C: 03+2] add new parsoid servers parse1001-1024 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787532 (https://phabricator.wikimedia.org/T299573) (owner: 10Cmjohnson) [16:36:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26886 and previous config saved to /var/cache/conftool/dbconfig/20220428-163643-ladsgroup.json [16:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:59] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:55] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: allow to not remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/787528 (owner: 10Volans) [16:38:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1029.mgmt.eqiad.wmnet with reboot policy FORCED [16:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:41] (03PS1) 10Aqu: Bumping refine job version to 0.1.27 [puppet] - 10https://gerrit.wikimedia.org/r/787533 (https://phabricator.wikimedia.org/T305386) [16:38:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1030.mgmt.eqiad.wmnet with reboot policy FORCED [16:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [16:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1031.mgmt.eqiad.wmnet with reboot policy FORCED [16:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1032.mgmt.eqiad.wmnet with reboot policy FORCED [16:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:48] (03Merged) 10jenkins-bot: sre.hosts.reimage: allow to not remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/787528 (owner: 10Volans) [16:41:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1031.mgmt.eqiad.wmnet with reboot policy FORCED [16:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [16:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:16] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:09] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [16:46:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1001.eqiad.wmnet with OS buster [16:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1001.eqiad.wmnet with OS buster [16:51:25] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [16:51:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26887 and previous config saved to /var/cache/conftool/dbconfig/20220428-165148-ladsgroup.json [16:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1029.mgmt.eqiad.wmnet with reboot policy FORCED [16:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1030.mgmt.eqiad.wmnet with reboot policy FORCED [16:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1032.mgmt.eqiad.wmnet with reboot policy FORCED [16:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2004.codfw.wmnet with OS bullseye [16:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:36] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab-runner2004.codfw.wmnet... [16:54:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1002.eqiad.wmnet with OS buster [16:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1002.eqiad.wmnet with OS buster [16:54:43] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [16:55:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1004.eqiad.wmnet with OS buster [16:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1004.eqiad.wmnet with OS buster [16:55:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1005.eqiad.wmnet with OS buster [16:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1005.eqiad.wmnet with OS buster [16:56:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1006.eqiad.wmnet with OS buster [16:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1006.eqiad.wmnet with OS buster [16:57:12] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:50] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10lmata) 05Open→03Resolved docs metadata and scorecard filled. Resolving. [16:57:53] 10SRE, 10Infrastructure-Foundations, 10Mail: MX record issue on mx2001.wikimedia.org - https://phabricator.wikimedia.org/T297017 (10lmata) [16:57:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1001.eqiad.wmnet with reason: host reimage [16:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:26] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-23 Core Network Routing - https://phabricator.wikimedia.org/T299969 (10lmata) 05Open→03Resolved docs on wikitech, metadata, and scorecard filled. Resolving. [16:58:50] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [16:58:56] (03CR) 10Btullis: [C: 03+2] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/787533 (https://phabricator.wikimedia.org/T305386) (owner: 10Aqu) [16:59:52] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-18 codfw ipv6 network - https://phabricator.wikimedia.org/T299968 (10lmata) 05Open→03Resolved docs on wikitech, metadata and scorecard filled. Resolving. [17:00:15] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10jcrespo) The firmware update worked- just I don't know which version (I am currently using 21.80.16.95). The problem was that, after it had first failed, when installing an older... [17:00:20] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [17:01:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1001.eqiad.wmnet with reason: host reimage [17:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye [17:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1003.eqiad.wmnet with OS buster [17:02:23] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab2002.wikimedia.org... [17:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1003.eqiad.wmnet with OS buster [17:03:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1007.eqiad.wmnet with OS buster [17:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1007.eqiad.wmnet with OS buster [17:03:12] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1008.eqiad.wmnet with OS buster [17:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1008.eqiad.wmnet with OS buster [17:04:53] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:05:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [17:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:05:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1004.eqiad.wmnet with reason: host reimage [17:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1005.eqiad.wmnet with reason: host reimage [17:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:06:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T306560)', diff saved to https://phabricator.wikimedia.org/P26889 and previous config saved to /var/cache/conftool/dbconfig/20220428-170652-ladsgroup.json [17:06:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [17:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [17:07:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T306560)', diff saved to https://phabricator.wikimedia.org/P26890 and previous config saved to /var/cache/conftool/dbconfig/20220428-170700-ladsgroup.json [17:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:02] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:07:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1006.eqiad.wmnet with reason: host reimage [17:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:07:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:07:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [17:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [17:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:08:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:08:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298558)', diff saved to https://phabricator.wikimedia.org/P26891 and previous config saved to /var/cache/conftool/dbconfig/20220428-170820-ladsgroup.json [17:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:23] (03PS1) 10Cmjohnson: Add ganeti10[29|3(012)] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/787537 (https://phabricator.wikimedia.org/T299459) [17:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:28] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [17:08:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [17:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T306560)', diff saved to https://phabricator.wikimedia.org/P26892 and previous config saved to /var/cache/conftool/dbconfig/20220428-170908-ladsgroup.json [17:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:19] (03CR) 10Cmjohnson: [C: 03+2] Add ganeti10[29|3(012)] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/787537 (https://phabricator.wikimedia.org/T299459) (owner: 10Cmjohnson) [17:10:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298558)', diff saved to https://phabricator.wikimedia.org/P26893 and previous config saved to /var/cache/conftool/dbconfig/20220428-171035-ladsgroup.json [17:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:07] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1006.eqiad.wmnet with reason: host reimage [17:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1004.eqiad.wmnet with reason: host reimage [17:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1001.eqiad.wmnet with OS buster [17:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1001.eqiad.wmnet with OS buster complet... [17:12:27] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:13:09] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1009.eqiad.wmnet with OS buster [17:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1009.eqiad.wmnet with OS buster [17:13:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1003.eqiad.wmnet with reason: host reimage [17:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1005.eqiad.wmnet with reason: host reimage [17:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1007.eqiad.wmnet with reason: host reimage [17:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1008.eqiad.wmnet with reason: host reimage [17:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:18] (03PS1) 10Jbond: C:docker_registry_ha::web: use puppetdb_query instead of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787538 [17:16:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35001/console" [puppet] - 10https://gerrit.wikimedia.org/r/787538 (owner: 10Jbond) [17:16:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1008.eqiad.wmnet with reason: host reimage [17:16:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:51] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab2003.wikimedia.org... [17:17:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1003.eqiad.wmnet with reason: host reimage [17:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:41] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1029.eqiad.wmnet with OS buster [17:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1029.eqiad.wmnet with OS buster [17:17:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1030.eqiad.wmnet with OS buster [17:17:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1031.eqiad.wmnet with OS buster [17:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1032.eqiad.wmnet with OS buster [17:17:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1030.eqiad.wmnet with OS buster [17:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1031.eqiad.wmnet with OS buster [17:18:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1032.eqiad.wmnet with OS buster [17:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:40] (03PS11) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [17:19:15] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:19:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1002.eqiad.wmnet with OS buster [17:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1002.eqiad.wmnet with OS buster complet... [17:19:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1007.eqiad.wmnet with reason: host reimage [17:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1010.eqiad.wmnet with OS buster [17:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1010.eqiad.wmnet with OS buster [17:22:29] (03CR) 10Jbond: [V: 03+1] "> Patch Set 1: Verified+1" [puppet] - 10https://gerrit.wikimedia.org/r/787538 (owner: 10Jbond) [17:23:09] (03PS2) 10Jbond: C:docker_registry_ha::web: use puppetdb_query instead of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787538 [17:23:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1006.eqiad.wmnet with OS buster [17:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1006.eqiad.wmnet with OS buster complet... [17:24:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1009.eqiad.wmnet with reason: host reimage [17:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26894 and previous config saved to /var/cache/conftool/dbconfig/20220428-172413-ladsgroup.json [17:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:20] (03PS3) 10Jbond: C:docker_registry_ha::web: use puppetdb_query instead of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787538 [17:25:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26895 and previous config saved to /var/cache/conftool/dbconfig/20220428-172540-ladsgroup.json [17:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1004.eqiad.wmnet with OS buster [17:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1004.eqiad.wmnet with OS buster complet... [17:26:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1005.eqiad.wmnet with OS buster [17:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1005.eqiad.wmnet with OS buster complet... [17:27:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1011.eqiad.wmnet with OS buster [17:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1011.eqiad.wmnet with OS buster [17:27:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1012.eqiad.wmnet with OS buster [17:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1009.eqiad.wmnet with reason: host reimage [17:27:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1012.eqiad.wmnet with OS buster [17:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1013.eqiad.wmnet with OS buster [17:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1013.eqiad.wmnet with OS buster [17:28:24] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) [17:29:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage [17:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage [17:29:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage [17:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1008.eqiad.wmnet with OS buster [17:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:45] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1003.eqiad.wmnet with OS buster [17:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1008.eqiad.wmnet with OS buster complet... [17:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1003.eqiad.wmnet with OS buster complet... [17:30:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS buster [17:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1014.eqiad.wmnet with OS buster [17:30:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1015.eqiad.wmnet with OS buster [17:30:28] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2002.wikimedia.org with OS bullseye [17:30:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage [17:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:36] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab2002.wikimedia.org with... [17:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1015.eqiad.wmnet with OS buster [17:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:45] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1007.eqiad.wmnet with OS buster [17:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1007.eqiad.wmnet with OS buster complet... [17:30:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1016.eqiad.wmnet with OS buster [17:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster [17:31:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye [17:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:41] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab2002.wikimedia.org... [17:32:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1029.eqiad.wmnet with reason: host reimage [17:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) [17:34:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Cmjohnson) 05Open→03Resolved Removed from rack, updated scs [17:35:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1010.eqiad.wmnet with reason: host reimage [17:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage [17:35:53] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1016.eqiad.wmnet with OS buster [17:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster [17:36:00] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1016.eqiad.wmnet with OS buster [17:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster executed with errors: - parse... [17:36:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1016.eqiad.wmnet with OS buster [17:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster [17:36:48] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1016.eqiad.wmnet with OS buster [17:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster executed with errors: - parse... [17:37:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1030.eqiad.wmnet with reason: host reimage [17:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage [17:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:15] !log testing logging: [[gerrit:9999]] [17:38:18] (03PS4) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) [17:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage [17:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage [17:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26896 and previous config saved to /var/cache/conftool/dbconfig/20220428-173918-ladsgroup.json [17:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:36] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1031.eqiad.wmnet with reason: host reimage [17:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:53] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787505 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [17:40:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1016.eqiad.wmnet with OS buster [17:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster [17:40:15] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host parse1016.eqiad.wmnet with OS buster [17:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster executed with errors: - parse... [17:40:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26897 and previous config saved to /var/cache/conftool/dbconfig/20220428-174046-ladsgroup.json [17:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:58] (03Abandoned) 10Dzahn: DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [17:41:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage [17:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage [17:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage [17:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1017.eqiad.wmnet with OS buster [17:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1017.eqiad.wmnet with OS buster [17:43:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1013.eqiad.wmnet with reason: host reimage [17:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1029.eqiad.wmnet with OS buster [17:44:47] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1029.eqiad.wmnet with OS buster completed: - ganeti1029 (**PASS**) - R... [17:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1018.eqiad.wmnet with OS buster [17:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1018.eqiad.wmnet with OS buster [17:45:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1009.eqiad.wmnet with OS buster [17:45:39] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1015.eqiad.wmnet with reason: host reimage [17:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1009.eqiad.wmnet with OS buster completed: - parse1009 (**PAS... [17:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:49] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage [17:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1019.eqiad.wmnet with OS buster [17:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1019.eqiad.wmnet with OS buster [17:46:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1016.eqiad.wmnet with reason: host reimage [17:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [17:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [17:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:13] (03PS12) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [17:47:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1010.eqiad.wmnet with OS buster [17:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1010.eqiad.wmnet with OS buster completed: - parse1010 (**PAS... [17:47:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1020.eqiad.wmnet with OS buster [17:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1020.eqiad.wmnet with OS buster [17:47:48] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:48:07] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1011.eqiad.wmnet with reason: host reimage [17:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:22] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1012.eqiad.wmnet with reason: host reimage [17:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1030.eqiad.wmnet with OS buster [17:48:51] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1030.eqiad.wmnet with OS buster completed: - ganeti1030 (**PASS**) - R... [17:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1021.eqiad.wmnet with OS buster [17:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1021.eqiad.wmnet with OS buster [17:51:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage [17:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1031.eqiad.wmnet with OS buster [17:51:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [17:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:43] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1031.eqiad.wmnet with OS buster completed: - ganeti1031 (**WARN**) - R... [17:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1022.eqiad.wmnet with OS buster [17:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1022.eqiad.wmnet with OS buster [17:53:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1017.eqiad.wmnet with reason: host reimage [17:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2002.wikimedia.org with reason: host reimage [17:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:19] (03CR) 10Dzahn: [C: 03+2] admin/dzahn: use the run-puppet-agent wrapper in my personal aliases [puppet] - 10https://gerrit.wikimedia.org/r/787107 (owner: 10Dzahn) [17:54:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T306560)', diff saved to https://phabricator.wikimedia.org/P26898 and previous config saved to /var/cache/conftool/dbconfig/20220428-175423-ladsgroup.json [17:54:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:54:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:54:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:30] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:54:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T306560)', diff saved to https://phabricator.wikimedia.org/P26899 and previous config saved to /var/cache/conftool/dbconfig/20220428-175436-ladsgroup.json [17:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1013.eqiad.wmnet with OS buster [17:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1013.eqiad.wmnet with OS buster completed: - parse1013 (**PAS... [17:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298558)', diff saved to https://phabricator.wikimedia.org/P26900 and previous config saved to /var/cache/conftool/dbconfig/20220428-175551-ladsgroup.json [17:55:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:55:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [17:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:58] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [17:55:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26901 and previous config saved to /var/cache/conftool/dbconfig/20220428-175559-ladsgroup.json [17:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:36] (03CR) 10Dzahn: [C: 03+2] "thank you all" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [17:56:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T306560)', diff saved to https://phabricator.wikimedia.org/P26902 and previous config saved to /var/cache/conftool/dbconfig/20220428-175644-ladsgroup.json [17:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1019.eqiad.wmnet with reason: host reimage [17:56:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1018.eqiad.wmnet with reason: host reimage [17:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:11] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1017.eqiad.wmnet with reason: host reimage [17:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:39] (03Abandoned) 10Dzahn: cumin: add "owner" aliases to get lists of host per SRE subteam [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) (owner: 10Dzahn) [17:58:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26903 and previous config saved to /var/cache/conftool/dbconfig/20220428-175815-ladsgroup.json [17:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1032.eqiad.wmnet with OS buster [17:58:34] (03CR) 10Dzahn: cumin: add "owner" aliases to get lists of host per SRE subteam (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) (owner: 10Dzahn) [17:58:36] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1032.eqiad.wmnet with OS buster completed: - ganeti1032 (**WARN**) - R... [17:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1020.eqiad.wmnet with reason: host reimage [17:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1015.eqiad.wmnet with OS buster [17:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1015.eqiad.wmnet with OS buster completed: - parse1015 (**WAR... [17:59:11] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host parse1016.eqiad.wmnet with OS buster [17:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster completed: - parse1016 (**FAI... [17:59:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1016.eqiad.wmnet with OS buster executed with errors: - parse... [17:59:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1023.eqiad.wmnet with OS buster [17:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1023.eqiad.wmnet with OS buster [17:59:50] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on parse1018.eqiad.wmnet with reason: host reimage [17:59:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host parse1024.eqiad.wmnet with OS buster [17:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host parse1024.eqiad.wmnet with OS buster [17:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1021.eqiad.wmnet with reason: host reimage [18:00:05] brennen and jeena: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T1800). [18:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:12] o/ [18:00:14] o/ [18:00:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1019.eqiad.wmnet with reason: host reimage [18:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:01] !log train 1.39.0-wmf.9 (T305215): no current blockers, logs fairly clear, proceeding to all wikis as soon as i finish this burrito [18:01:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1012.eqiad.wmnet with OS buster [18:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:07] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [18:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1012.eqiad.wmnet with OS buster completed: - parse1012 (**WAR... [18:01:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1011.eqiad.wmnet with OS buster [18:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1011.eqiad.wmnet with OS buster completed: - parse1011 (**WAR... [18:01:28] (03CR) 10Dzahn: [C: 03+2] "works for me on cumin2002 now:" [puppet] - 10https://gerrit.wikimedia.org/r/787440 (owner: 10Jbond) [18:01:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1014.eqiad.wmnet with OS buster [18:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1014.eqiad.wmnet with OS buster completed: - parse1014 (**WAR... [18:02:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1022.eqiad.wmnet with reason: host reimage [18:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1020.eqiad.wmnet with reason: host reimage [18:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:53] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [18:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:28] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab2003.wikimedia.org with... [18:04:37] (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787540 [18:04:39] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787540 (owner: 10Brennen Bearnes) [18:05:02] 10SRE, 10Patch-For-Review: role_contacts (service owners) as a custom puppet fact / cumin aliases for owners - https://phabricator.wikimedia.org/T306830 (10Dzahn) [18:05:19] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787540 (owner: 10Brennen Bearnes) [18:05:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1021.eqiad.wmnet with reason: host reimage [18:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:42] 10SRE, 10Patch-For-Review: role_contacts (service owners) as a custom puppet fact / cumin aliases for owners - https://phabricator.wikimedia.org/T306830 (10Dzahn) 05Open→03Resolved a:03Dzahn This was resolved by John's final change https://gerrit.wikimedia.org/r/c/operations/puppet/+/787440 which I just... [18:06:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2002.wikimedia.org with OS bullseye [18:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:26] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab2002.wikimedia.org with... [18:06:30] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10JAdams) Hi @Dzahn -- I am commenting on this resolved thread with a related question. When recipients reply to this email, it's supposed to reach our endowment@wikimedia.org email address. However, it has been directing all me... [18:07:43] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.9 refs T305215 [18:07:44] (03PS1) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787541 [18:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:49] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [18:08:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1022.eqiad.wmnet with reason: host reimage [18:08:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1017.eqiad.wmnet with OS buster [18:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:30] (03PS4) 10Jbond: C:docker_registry_ha::web: use puppetdb_query instead of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787538 [18:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1017.eqiad.wmnet with OS buster completed: - parse1017 (**PAS... [18:09:33] 10SRE, 10Infrastructure-Foundations, 10netops: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10cmooney) [18:09:43] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) [18:09:58] 10SRE, 10Infrastructure-Foundations, 10netops: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10cmooney) [18:10:04] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) [18:10:16] (03PS2) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787541 [18:10:30] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1023.eqiad.wmnet with reason: host reimage [18:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:36] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:50] 10SRE, 10Infrastructure-Foundations, 10netops: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10cmooney) [18:10:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1024.eqiad.wmnet with reason: host reimage [18:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) [18:11:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35003/console" [puppet] - 10https://gerrit.wikimedia.org/r/787541 (owner: 10Jbond) [18:11:30] (03PS1) 10Gergő Tisza: Video landing page: Don't show campaign body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787459 (https://phabricator.wikimedia.org/T303785) [18:11:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1018.eqiad.wmnet with OS buster [18:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1018.eqiad.wmnet with OS buster completed: - parse1018 (**WAR... [18:11:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26904 and previous config saved to /var/cache/conftool/dbconfig/20220428-181149-ladsgroup.json [18:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1019.eqiad.wmnet with OS buster [18:13:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26905 and previous config saved to /var/cache/conftool/dbconfig/20220428-181320-ladsgroup.json [18:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1019.eqiad.wmnet with OS buster completed: - parse1019 (**PAS... [18:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1023.eqiad.wmnet with reason: host reimage [18:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:25] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) [18:16:14] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) 05In progress→03Resolved This is complete [18:16:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1024.eqiad.wmnet with reason: host reimage [18:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:35] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) @Eevans thanks [18:16:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:17:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1020.eqiad.wmnet with OS buster [18:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1020.eqiad.wmnet with OS buster completed: - parse1020 (**PAS... [18:17:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1021.eqiad.wmnet with OS buster [18:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1021.eqiad.wmnet with OS buster completed: - parse1021 (**PAS... [18:19:49] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10Dzahn) Hello @JAdams all we do is forward the special alias jason@wikipedia.org to endowment@wikimedia.org. But what happens to mail sent to endowment@wikimedia.org is not under our control. That is Tech Support / Google doma... [18:21:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1022.eqiad.wmnet with OS buster [18:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1022.eqiad.wmnet with OS buster completed: - parse1022 (**PAS... [18:22:19] 10SRE: mail alias jason@wikipedia.org - https://phabricator.wikimedia.org/T280026 (10JAdams) @Dzahn - Thanks for this clarification. I will work on this with Tech Support from here. Okay to resolve this ticket again! [18:25:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1023.eqiad.wmnet with OS buster [18:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1023.eqiad.wmnet with OS buster completed: - parse1023 (**PAS... [18:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26906 and previous config saved to /var/cache/conftool/dbconfig/20220428-182654-ladsgroup.json [18:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1024.eqiad.wmnet with OS buster [18:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host parse1024.eqiad.wmnet with OS buster completed: - parse1024 (**PAS... [18:28:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26907 and previous config saved to /var/cache/conftool/dbconfig/20220428-182825-ladsgroup.json [18:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:22] (03CR) 10jerkins-bot: [V: 04-1] Video landing page: Don't show campaign body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787459 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [18:35:06] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-05 TOC language converter - https://phabricator.wikimedia.org/T299966 (10lmata) 05Open→03Resolved wikitech is updated with docs and scorecard, resolving [18:35:39] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [18:36:09] (03PS3) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787541 [18:36:11] (03PS1) 10Jbond: puppetdb: add query_facts function [puppet] - 10https://gerrit.wikimedia.org/r/787547 [18:36:13] (03PS1) 10Jbond: C:ssh::publish_fingerprints: update to use new_query facts function [puppet] - 10https://gerrit.wikimedia.org/r/787548 [18:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:37:55] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-04 large file upload timeouts - https://phabricator.wikimedia.org/T299965 (10lmata) 05Open→03Resolved scorecard updated, docs on wikitech. resolving [18:38:19] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [18:39:13] 10SRE-OnFire (FY2021/2022-Q2), 10cloud-services-team (Kanban): 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10lmata) 05In progress→03Resolved this incident is scored, reviewed and updated docs in wikitech. resolving, [18:41:03] (03PS4) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787541 [18:41:26] (03PS2) 10Jbond: C:ssh::publish_fingerprints: update to use new_query facts function [puppet] - 10https://gerrit.wikimedia.org/r/787548 [18:41:52] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) Just a quick update here, will provide a fuller update (incl. updated diagrams etc.) next week. I've been working in a virtualize... [18:42:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T306560)', diff saved to https://phabricator.wikimedia.org/P26908 and previous config saved to /var/cache/conftool/dbconfig/20220428-184159-ladsgroup.json [18:42:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [18:42:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [18:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:07] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:42:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T306560)', diff saved to https://phabricator.wikimedia.org/P26909 and previous config saved to /var/cache/conftool/dbconfig/20220428-184207-ladsgroup.json [18:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:22] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26910 and previous config saved to /var/cache/conftool/dbconfig/20220428-184330-ladsgroup.json [18:43:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:43:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:37] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [18:43:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26911 and previous config saved to /var/cache/conftool/dbconfig/20220428-184338-ladsgroup.json [18:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T306560)', diff saved to https://phabricator.wikimedia.org/P26912 and previous config saved to /var/cache/conftool/dbconfig/20220428-184415-ladsgroup.json [18:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26913 and previous config saved to /var/cache/conftool/dbconfig/20220428-184554-ladsgroup.json [18:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10cmooney) a:05Cmjohnson→03cmooney This requires the updated WMCS network design to be agreed / validated (T304989) after which we can... [18:46:32] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10lmata) 05In progress→03Resolved metadata and docs updated on wikitech, scorecard complete, resolving [18:46:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install cloudvirt10[48-50].eqiad.wmnet - https://phabricator.wikimedia.org/T299574 (10cmooney) a:05Cmjohnson→03cmooney This requires the updated WMCS network design to be agreed / validated (T304989) after whic... [18:48:07] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-25 s3 db recentchanges replica - https://phabricator.wikimedia.org/T295154 (10lmata) 05Open→03Resolved docs updated, scorecard complete. resolving [18:48:38] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [18:48:50] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:49:37] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-22 eqiad return path timeouts - https://phabricator.wikimedia.org/T295152 (10lmata) 05Open→03Resolved docs updated on wikitech, scorecard is scored. resolving [18:50:09] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [18:51:13] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [18:55:37] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10lmata) [18:56:20] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-22 eqiad return path timeouts - https://phabricator.wikimedia.org/T295152 (10lmata) [18:57:01] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-25 s3 db recentchanges replica - https://phabricator.wikimedia.org/T295154 (10lmata) [18:57:09] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10lmata) [18:57:37] 10SRE-OnFire (FY2021/2022-Q2), 10cloud-services-team (Kanban): Incident: 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10lmata) [18:57:45] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-04 large file upload timeouts - https://phabricator.wikimedia.org/T299965 (10lmata) [18:57:52] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-05 TOC language converter - https://phabricator.wikimedia.org/T299966 (10lmata) [18:58:02] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-10 cirrussearch commonsfile outage - https://phabricator.wikimedia.org/T299967 (10lmata) [18:58:10] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-18 codfw ipv6 network - https://phabricator.wikimedia.org/T299968 (10lmata) [18:58:21] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-25 eventgate-main outage - https://phabricator.wikimedia.org/T299970 (10lmata) [18:58:34] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-23 Core Network Routing - https://phabricator.wikimedia.org/T299969 (10lmata) [18:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26914 and previous config saved to /var/cache/conftool/dbconfig/20220428-185920-ladsgroup.json [18:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26915 and previous config saved to /var/cache/conftool/dbconfig/20220428-190059-ladsgroup.json [19:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:19] 10SRE, 10SRE-swift-storage, 10ops-codfw: upgrade firmware on ms-be2040 - https://phabricator.wikimedia.org/T306988 (10Papaul) 05Open→03Resolved This complete [19:02:22] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10Papaul) [19:05:13] (03CR) 10Dzahn: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/787521 (https://phabricator.wikimedia.org/T306989) (owner: 10Papaul) [19:11:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Cmjohnson) [19:12:45] (03PS13) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [19:13:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Cmjohnson) 05Open→03Resolved these have all be installed [19:14:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26917 and previous config saved to /var/cache/conftool/dbconfig/20220428-191425-ladsgroup.json [19:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:41] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:16:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26918 and previous config saved to /var/cache/conftool/dbconfig/20220428-191604-ladsgroup.json [19:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:24] (03PS1) 10Dzahn: admin: revoke prod key of phedenskog [puppet] - 10https://gerrit.wikimedia.org/r/787551 [19:21:32] (03PS1) 10Gergő Tisza: Disable change tag test, broke CI [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787552 [19:21:51] (03CR) 10Dzahn: [C: 03+2] admin: revoke prod key of phedenskog [puppet] - 10https://gerrit.wikimedia.org/r/787551 (owner: 10Dzahn) [19:22:27] (03PS14) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [19:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:22:45] (03PS2) 10Gergő Tisza: Video landing page: Don't show campaign body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787459 (https://phabricator.wikimedia.org/T303785) [19:23:14] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10Cmjohnson) [19:24:15] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10Cmjohnson) 05Open→03Resolved the bios as virtualization enabled and installed OS. DC ops tasks have been completed. [19:24:21] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T306560)', diff saved to https://phabricator.wikimedia.org/P26919 and previous config saved to /var/cache/conftool/dbconfig/20220428-192930-ladsgroup.json [19:29:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:29:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:37] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:29:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26920 and previous config saved to /var/cache/conftool/dbconfig/20220428-192938-ladsgroup.json [19:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:21] (03PS15) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [19:31:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298558)', diff saved to https://phabricator.wikimedia.org/P26921 and previous config saved to /var/cache/conftool/dbconfig/20220428-193109-ladsgroup.json [19:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:16] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [19:32:12] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:50:01] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:50:37] (03PS16) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [19:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:52:35] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:59:16] (03PS1) 10Gergő Tisza: Video landing page: Record campaign parameter for control users [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787461 (https://phabricator.wikimedia.org/T303785) [20:00:05] brennen: #bothumor I � Unicode. All rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220428T2000). [20:00:05] tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] o/ [20:00:41] I can deploy [20:01:19] unless you are doing training brennen [20:02:27] tgr: go ahead, i don't think we've got anybody for training [20:03:53] (03PS2) 10Gergő Tisza: [beta] GrowthExperiments: Enable AddLink where it's enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787522 (https://phabricator.wikimedia.org/T306833) [20:04:04] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: Enable AddLink where it's enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787522 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [20:04:08] (03CR) 10Gergő Tisza: [C: 03+2] Disable change tag test, broke CI [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787552 (owner: 10Gergő Tisza) [20:04:12] (03CR) 10Gergő Tisza: [C: 03+2] Video landing page: Don't show campaign body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787459 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:04:49] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: Enable AddLink where it's enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787522 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [20:05:10] (03CR) 10Gergő Tisza: [C: 03+2] Video landing page: Record campaign parameter for control users [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787461 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:06:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:10] !log no trainee attendees for backport & config session; tgr self-serving some patches, calling end of training window [20:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:08] tgr: looks like you need to fix a phpcs lint error [20:09:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:09:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:09:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:11] oh, duh. [20:11:44] is it just me or is new gerrit kinda broken? [20:11:54] sometimes I just get an empty page with a header [20:12:41] (03PS2) 10Gergő Tisza: Video landing page: Record campaign parameter for control users [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787461 (https://phabricator.wikimedia.org/T303785) [20:12:56] (03CR) 10Gergő Tisza: [C: 03+2] "phpcs fix" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787461 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:13:55] (03CR) 10Dzahn: "I see your comments but that code isn't really from this change." [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [20:19:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Cmjohnson) [20:21:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10RobH) [20:22:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10RobH) [20:23:42] (03PS17) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [20:25:37] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:29:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26922 and previous config saved to /var/cache/conftool/dbconfig/20220428-202952-ladsgroup.json [20:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:30:19] (03CR) 10Dzahn: [C: 03+2] "mailed Peter about this" [puppet] - 10https://gerrit.wikimedia.org/r/787551 (owner: 10Dzahn) [20:30:52] (03CR) 10jerkins-bot: [V: 04-1] Disable change tag test, broke CI [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787552 (owner: 10Gergő Tisza) [20:30:58] (03CR) 10jerkins-bot: [V: 04-1] Video landing page: Don't show campaign body text on mobile [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787459 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:31:04] (03CR) 10jerkins-bot: [V: 04-1] Video landing page: Record campaign parameter for control users [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787461 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:34:26] 10SRE, 10Infrastructure-Foundations, 10netops: Configure cloudsw1-e4-eqiad and cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T304936 (10RobH) https://netbox.wikimedia.org/extras/reports/results/2954579/ cloudsw1-e4-eqiad - missing Netbox device from LibreNMS of role asw [20:34:49] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:30] (03PS18) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [20:35:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Cmjohnson) @Volans @Papaul Have you had a chance to look into the partman recipe for this? [20:37:37] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:44:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26923 and previous config saved to /var/cache/conftool/dbconfig/20220428-204457-ladsgroup.json [20:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:50] thcipriani: Hello? [20:47:46] I was on the street, and now I'm back late. Can you tell me if you can deploy now? [20:48:20] 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10wiki_willy) [20:49:43] (03PS4) 10Juan90264: Fix: Enable '$wgCopyUploadsDomains' to viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785208 (https://phabricator.wikimedia.org/T303577) [20:50:12] Juan_90264: the backport window is over in 10 minutes, I don't have the time to start deploying patches now. Please join the next available window. [20:51:08] thcipriani: No problem, thanks for letting me know! [20:51:20] thanks :) [20:51:34] I just wanted to be sure [20:54:55] (03PS19) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [20:54:57] I will probably run over with the window anyway, if it's a config change, I can deploy it. [20:55:44] not sure what's changed but recently, GrowthExperiments backports can take over an hour just to merge :/ [20:55:58] Juan_90264: ^ [20:56:54] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [21:00:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26924 and previous config saved to /var/cache/conftool/dbconfig/20220428-210002-ladsgroup.json [21:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:47] (03CR) 10Gergő Tisza: [C: 03+2] "Random Parsoid CI error, retrying." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787552 (owner: 10Gergő Tisza) [21:04:05] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] "On second thought, this is test-only so safe to force." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787552 (owner: 10Gergő Tisza) [21:07:12] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] "Trivial change, passed the normal test job, only failed because of a random CI issue with parent. Forcing to avoid another 40min wait for " [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787461 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [21:07:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [21:07:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [21:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [21:08:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [21:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:10] (03CR) 10Gergő Tisza: [C: 04-2] "Probably not needed after all, and the backport is taking way too long already, so putting this aside for now." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787459 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [21:11:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:53] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:42] (03PS7) 10Dzahn: docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 [21:14:19] (03CR) 10Ottomata: [C: 03+1] docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [21:15:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26925 and previous config saved to /var/cache/conftool/dbconfig/20220428-211507-ladsgroup.json [21:15:11] (03CR) 10Dzahn: docker: ensure apparmor package is installed if on bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [21:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:15] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:15:19] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [21:17:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [21:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26926 and previous config saved to /var/cache/conftool/dbconfig/20220428-211727-ladsgroup.json [21:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:45] (03CR) 10Muehlenhoff: docker: ensure apparmor package is installed if on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [21:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26927 and previous config saved to /var/cache/conftool/dbconfig/20220428-211934-ladsgroup.json [21:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:09] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) Aggregated scores for FY2021-2022 Q2 |**Incident**|**Score (15 max)**| |**Q2 - Average Incident Engagement Score**|**6.1**| |T292792: Incident: 2021-10-07 netwo... [21:23:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [21:23:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [21:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26928 and previous config saved to /var/cache/conftool/dbconfig/20220428-212331-ladsgroup.json [21:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:24:39] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) 05Open→03In progress p:05Triage→03Medium [21:25:45] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) 05In progress→03Resolved Resolving task for now but we need a better mechanism to store these aggregated scores. [21:26:51] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.9/extensions/GrowthExperiments/includes/VariantHooks.php: Backport: [[gerrit:787461|Video landing page: Record campaign parameter for control users (T303785)]] (duration: 00m 54s) [21:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:58] T303785: Account creation: social media landing pages - https://phabricator.wikimedia.org/T303785 [21:27:26] !log UTC late deploys done [21:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:16] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) 05Resolved→03Open Some entries seem to be missing, reopening to amend and update tally [21:34:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26929 and previous config saved to /var/cache/conftool/dbconfig/20220428-213440-ladsgroup.json [21:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26930 and previous config saved to /var/cache/conftool/dbconfig/20220428-213507-ladsgroup.json [21:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:35:24] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10Dzahn) Also see T297140#7886240 where 2 new namespaces were added, one for developer-portal and this over here... [21:36:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10RobH) [21:38:48] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-06: esams - https://phabricator.wikimedia.org/T307145 (10lmata) [21:40:18] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-06: esams - https://phabricator.wikimedia.org/T307145 (10lmata) private - scorecard complete [21:40:34] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-06: esams - https://phabricator.wikimedia.org/T307145 (10lmata) 05Open→03Resolved a:03lmata [21:44:07] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-06: esams eqsin - https://phabricator.wikimedia.org/T307146 (10lmata) [21:44:51] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:46:36] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-06: esams eqsin - https://phabricator.wikimedia.org/T307146 (10lmata) 05Open→03Resolved a:03lmata socrecard published. [21:47:27] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-14_Heavy_outbound_traffic - https://phabricator.wikimedia.org/T307149 (10lmata) [21:49:24] 10SRE: phedenskog uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T307079 (10Dzahn) I did https://gerrit.wikimedia.org/r/c/operations/puppet/+/787551 and talked to Peter about this. He will send a new key tomorrow or so. [21:49:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26931 and previous config saved to /var/cache/conftool/dbconfig/20220428-214945-ladsgroup.json [21:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26932 and previous config saved to /var/cache/conftool/dbconfig/20220428-215012-ladsgroup.json [21:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:31] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-10-14_Heavy_outbound_traffic - https://phabricator.wikimedia.org/T307149 (10lmata) scorecard complete [21:52:04] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-19_esams - https://phabricator.wikimedia.org/T307150 (10lmata) [21:52:11] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:52:35] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-21_upload-lb - https://phabricator.wikimedia.org/T307151 (10lmata) [21:53:10] 10SRE-OnFire (FY2021/2022-Q2): Incident: 2021-11-29_esams - https://phabricator.wikimedia.org/T307152 (10lmata) [21:55:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [21:55:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [21:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T306560)', diff saved to https://phabricator.wikimedia.org/P26933 and previous config saved to /var/cache/conftool/dbconfig/20220428-215547-ladsgroup.json [21:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:57] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:59:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T306560)', diff saved to https://phabricator.wikimedia.org/P26934 and previous config saved to /var/cache/conftool/dbconfig/20220428-215902-ladsgroup.json [21:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:20] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [22:02:00] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) @Papaul I removed you from the ticket and any tags related to dcops though. Still an issue? [22:04:12] (03CR) 10BryanDavis: [C: 03+1] "Very much trusting Timo's testing, but seems worth trying to me. Worst case is we mess up some pages on officewiki and learn something in " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [22:04:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306560)', diff saved to https://phabricator.wikimedia.org/P26935 and previous config saved to /var/cache/conftool/dbconfig/20220428-220450-ladsgroup.json [22:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:57] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:05:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26936 and previous config saved to /var/cache/conftool/dbconfig/20220428-220517-ladsgroup.json [22:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:13] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10Dzahn) namespace image-suggestion has now been created on all 4 clusters, staging and production. [22:14:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26937 and previous config saved to /var/cache/conftool/dbconfig/20220428-221407-ladsgroup.json [22:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:59] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) >>! In T292254#7889424, @lmata wrote: > Some entries seem to be missing, reopening to amend and update tally the table above has been amended and updated with a... [22:18:45] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [22:20:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2001.mgmt.codfw.wmnet with reboot policy FORCED [22:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26938 and previous config saved to /var/cache/conftool/dbconfig/20220428-222022-ladsgroup.json [22:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:20:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [22:20:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [22:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26939 and previous config saved to /var/cache/conftool/dbconfig/20220428-222035-ladsgroup.json [22:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:01] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Incident: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10lmata) [22:21:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2002.mgmt.codfw.wmnet with reboot policy FORCED [22:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:22] 10SRE-OnFire, 10Data-Persistence (Consultation), 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the rest ... - https://phabricator.wikimedia.org/T303499 [22:24:11] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) [22:24:20] (03PS6) 10MarcoAurelio: Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) [22:27:44] (03CR) 10MarcoAurelio: "PS 6 is a manual rebase; please re-check as these are tricky." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [22:29:02] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10lmata) [22:29:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26940 and previous config saved to /var/cache/conftool/dbconfig/20220428-222912-ladsgroup.json [22:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:13] 10SRE-OnFire (FY2021/2022-Q3), 10Data-Persistence (Consultation), 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting ... - https://phabricator.wikimedia.org/T303499 [22:31:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26941 and previous config saved to /var/cache/conftool/dbconfig/20220428-223145-ladsgroup.json [22:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:32:17] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-01_ulsfo_network - https://phabricator.wikimedia.org/T307154 (10lmata) [22:34:45] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-06_wdqs_updater - https://phabricator.wikimedia.org/T307156 (10lmata) [22:36:32] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-10_Envoy_overflow - https://phabricator.wikimedia.org/T307157 (10lmata) [22:37:34] 10SRE-OnFire (FY2021/2022-Q3): Incident: eqiad-eqord saturation - https://phabricator.wikimedia.org/T307158 (10lmata) [22:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:38:07] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-22_vrts - https://phabricator.wikimedia.org/T307159 (10lmata) [22:38:45] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-02-22 wdqs updater codfw - https://phabricator.wikimedia.org/T307160 (10lmata) [22:39:16] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-01_ulsfo_network - https://phabricator.wikimedia.org/T307161 (10lmata) [22:40:11] 10SRE-OnFire (FY2021/2022-Q3): Incident: 2022-03-06 wdqs categories - https://phabricator.wikimedia.org/T307162 (10lmata) [22:44:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T306560)', diff saved to https://phabricator.wikimedia.org/P26942 and previous config saved to /var/cache/conftool/dbconfig/20220428-224417-ladsgroup.json [22:44:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:44:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:26] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:44:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26943 and previous config saved to /var/cache/conftool/dbconfig/20220428-224426-ladsgroup.json [22:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26944 and previous config saved to /var/cache/conftool/dbconfig/20220428-224540-ladsgroup.json [22:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:01] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26945 and previous config saved to /var/cache/conftool/dbconfig/20220428-224650-ladsgroup.json [22:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2001.mgmt.codfw.wmnet with reboot policy FORCED [22:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2002.mgmt.codfw.wmnet with reboot policy FORCED [22:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2003.mgmt.codfw.wmnet with reboot policy FORCED [22:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2004.mgmt.codfw.wmnet with reboot policy FORCED [22:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:23] (03CR) 10Cwhite: "It may be worth outlining the problem and starting a discussion on Phabricator. I don't know enough about the issue we hope to solve to w" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [23:00:38] (03PS6) 10Cwhite: logstash: set partition on legacy indexes [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) [23:00:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P26946 and previous config saved to /var/cache/conftool/dbconfig/20220428-230045-ladsgroup.json [23:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:57] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:01:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26947 and previous config saved to /var/cache/conftool/dbconfig/20220428-230156-ladsgroup.json [23:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:55] (03CR) 10Krinkle: [C: 03+1] Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [23:15:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P26948 and previous config saved to /var/cache/conftool/dbconfig/20220428-231550-ladsgroup.json [23:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26949 and previous config saved to /var/cache/conftool/dbconfig/20220428-231701-ladsgroup.json [23:17:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2004.mgmt.codfw.wmnet with reboot policy FORCED [23:17:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aqs2003.mgmt.codfw.wmnet with reboot policy FORCED [23:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:17:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [23:17:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [23:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26950 and previous config saved to /var/cache/conftool/dbconfig/20220428-231714-ladsgroup.json [23:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2005.mgmt.codfw.wmnet with reboot policy FORCED [23:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2005.mgmt.codfw.wmnet with reboot policy FORCED [23:17:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs2005.mgmt.codfw.wmnet with reboot policy FORCED [23:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2005.mgmt.codfw.wmnet with reboot policy FORCED [23:17:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host aqs2005.mgmt.codfw.wmnet with reboot policy FORCED [23:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host aqs2006.mgmt.codfw.wmnet with reboot policy FORCED [23:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P26951 and previous config saved to /var/cache/conftool/dbconfig/20220428-232805-ladsgroup.json [23:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:30:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26952 and previous config saved to /var/cache/conftool/dbconfig/20220428-233055-ladsgroup.json [23:30:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:30:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [23:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:02] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:31:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26953 and previous config saved to /var/cache/conftool/dbconfig/20220428-233103-ladsgroup.json [23:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T306560)', diff saved to https://phabricator.wikimedia.org/P26954 and previous config saved to /var/cache/conftool/dbconfig/20220428-233317-ladsgroup.json [23:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:19] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (19) node(s) change every puppet run: contint1001, contint2001, ms-be1068, ms-be1069, ms-be1070, ms-be1071, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, releases1002, releases2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23 [23:40:19] ppet_run_changes [23:41:28] (03CR) 10Cwhite: [C: 03+2] logstash: set partition on legacy indexes [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [23:42:07] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/push-notifications: apply [23:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26955 and previous config saved to /var/cache/conftool/dbconfig/20220428-234310-ladsgroup.json [23:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:30] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [23:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:53] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [23:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P26956 and previous config saved to /var/cache/conftool/dbconfig/20220428-234822-ladsgroup.json [23:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:34] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [23:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:19] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [23:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:55] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [23:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:54:35] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:54:57] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:58:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26957 and previous config saved to /var/cache/conftool/dbconfig/20220428-235815-ladsgroup.json [23:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log