[00:17:54] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:43:28] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:00] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:40:22] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [03:22:04] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:16:42] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:23:16] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:18:04] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:30:08] (03PS1) 10Ladsgroup: db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762982 (https://phabricator.wikimedia.org/T300510) [05:31:39] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762982 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [05:43:00] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:44:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:44:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:44:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:46:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:47:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:47:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:48] sorry for the spam [05:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T300510)', diff saved to https://phabricator.wikimedia.org/P20852 and previous config saved to /var/cache/conftool/dbconfig/20220216-054749-ladsgroup.json [05:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:56] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [05:52:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1156.eqiad.wmnet with OS bullseye [05:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage [06:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage [06:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:29] (03PS1) 10Ladsgroup: Clean up flaggedtemplate rows for deleted pages too [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/762911 (https://phabricator.wikimedia.org/T296380) [06:12:37] (03CR) 10Ladsgroup: [C: 03+2] Clean up flaggedtemplate rows for deleted pages too [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/762911 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [06:13:28] (03PS1) 10Marostegui: db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763102 [06:14:10] (03CR) 10Marostegui: [C: 03+2] db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763102 (owner: 10Marostegui) [06:16:10] (03Merged) 10jenkins-bot: Clean up flaggedtemplate rows for deleted pages too [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/762911 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [06:19:01] (03CR) 10Ladsgroup: "I needed to backport this to wmf.21 :face-palm:" [extensions/FlaggedRevs] (wmf/1.38.0-wmf.20) - 10https://gerrit.wikimedia.org/r/762911 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [06:19:10] (03PS1) 10Ladsgroup: Clean up flaggedtemplate rows for deleted pages too [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762912 (https://phabricator.wikimedia.org/T296380) [06:19:15] (03CR) 10Ladsgroup: [C: 03+2] Clean up flaggedtemplate rows for deleted pages too [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762912 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [06:20:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1156.eqiad.wmnet with OS bullseye [06:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:21:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:22:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:54] (03Merged) 10jenkins-bot: Clean up flaggedtemplate rows for deleted pages too [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762912 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [06:24:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:25:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300510)', diff saved to https://phabricator.wikimedia.org/P20853 and previous config saved to /var/cache/conftool/dbconfig/20220216-062610-ladsgroup.json [06:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:16] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [06:26:53] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.21/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:762912|Clean up flaggedtemplate rows for deleted pages too (T296380)]] (duration: 00m 52s) [06:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:59] T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380 [06:27:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:33:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P20854 and previous config saved to /var/cache/conftool/dbconfig/20220216-064115-ladsgroup.json [06:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:16] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751919 (owner: 10PipelineBot) [06:51:18] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751918 (owner: 10PipelineBot) [06:51:20] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/751914 (owner: 10PipelineBot) [06:51:22] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/759722 (owner: 10PipelineBot) [06:51:24] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/759719 (owner: 10PipelineBot) [06:51:26] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/759718 (owner: 10PipelineBot) [06:52:49] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Joe) As a general rule, storing large files inside containers is a bad idea. MaxMind files are small enough that they can fit into an image without increasing its size s... [06:53:35] (03PS1) 10Elukey: logstash::input::kafka: allow a custom truststore path [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) [06:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P20855 and previous config saved to /var/cache/conftool/dbconfig/20220216-065620-ladsgroup.json [06:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1133.eqiad.wmnet with OS bullseye [07:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:22] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [07:06:26] (03PS1) 10Elukey: profile::logstash::beta: move to profile::base::certificate's truststore [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) [07:07:33] (03PS3) 10Legoktm: Use $wgGroupInheritsPermissions for "confirmed" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747977 (https://phabricator.wikimedia.org/T275334) [07:10:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:10:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T300510)', diff saved to https://phabricator.wikimedia.org/P20856 and previous config saved to /var/cache/conftool/dbconfig/20220216-071125-ladsgroup.json [07:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:30] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [07:12:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1133.eqiad.wmnet with reason: host reimage [07:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:45] (03PS1) 10Ladsgroup: Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762913 [07:14:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1133.eqiad.wmnet with reason: host reimage [07:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:55] (03PS2) 10Ladsgroup: Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762913 [07:15:10] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762913 (owner: 10Ladsgroup) [07:16:44] (03PS2) 10Elukey: logstash::input::kafka: allow a custom truststore path [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) [07:16:46] (03PS2) 10Elukey: profile::logstash::beta: move to profile::base::certificate's truststore [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) [07:18:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33814/console" [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:20:22] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [07:21:54] (03PS3) 10Elukey: logstash::input::kafka: allow a custom truststore path [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) [07:21:56] (03PS3) 10Elukey: profile::logstash::beta: move to profile::base::certificate's truststore [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) [07:22:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33815/console" [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:24:43] (03PS4) 10Elukey: logstash::input::kafka: allow a custom truststore path [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) [07:24:45] (03PS4) 10Elukey: profile::logstash::beta: move to profile::base::certificate's truststore [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) [07:28:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33816/console" [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:30:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1133.eqiad.wmnet with OS bullseye [07:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:58] (03CR) 10Elukey: "Folks not sure what instance runs profile::logstash::beta, I looked for something in deployment-prep and I didn't find much. If it is in a" [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:36:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:36:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [07:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:07] (03PS1) 10Urbanecm: Deploy Growth features to 100% of newcomers on most Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763171 (https://phabricator.wikimedia.org/T301820) [07:38:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20857 and previous config saved to /var/cache/conftool/dbconfig/20220216-073818-root.json [07:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:26] (03PS1) 10Elukey: profile::logstash::production: use base truststore [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) [07:39:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33817/console" [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:41:46] (03Abandoned) 10Giuseppe Lavagetto: Rakefile: rationalize task arguments treatment [deployment-charts] - 10https://gerrit.wikimedia.org/r/757976 (owner: 10Giuseppe Lavagetto) [07:41:55] (03PS2) 10Elukey: profile::logstash::production: use base truststore [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) [07:42:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33818/console" [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:45:41] (03PS2) 10Giuseppe Lavagetto: mwdebug: switch to using sockets for fcgi proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/709986 [07:46:02] (03Abandoned) 10Giuseppe Lavagetto: mwdebug: switch to using sockets for fcgi proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/709986 (owner: 10Giuseppe Lavagetto) [07:51:09] (03PS1) 10Legoktm: shellbox: Update to 2022-02-04-153221 [deployment-charts] - 10https://gerrit.wikimedia.org/r/763175 (https://phabricator.wikimedia.org/T298399) [07:53:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20858 and previous config saved to /var/cache/conftool/dbconfig/20220216-075321-root.json [07:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:18] (03PS3) 10Elukey: profile::logstash::production: use base truststore [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) [07:55:50] (03PS1) 10Ladsgroup: db1146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763177 (https://phabricator.wikimedia.org/T300510) [07:56:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33819/console" [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [07:57:13] (03PS2) 10Ladsgroup: db1146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763177 (https://phabricator.wikimedia.org/T300510) [07:57:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763177 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [07:57:55] (03CR) 10Elukey: profile::logstash::beta: move to profile::base::certificate's truststore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:00:05] Amir1, awight, and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T0800). [08:00:05] legoktm: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:22] hello [08:01:01] Hey legoktm [08:02:07] legoktm: I can deploy today. [08:02:16] ty! [08:02:29] this one should be easy to test, we just look at Special:ListGroupRights [08:03:22] (03CR) 10Urbanecm: [C: 03+2] Use $wgGroupInheritsPermissions for "confirmed" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747977 (https://phabricator.wikimedia.org/T275334) (owner: 10Legoktm) [08:03:52] Good to know. [08:04:12] (03Merged) 10jenkins-bot: Use $wgGroupInheritsPermissions for "confirmed" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747977 (https://phabricator.wikimedia.org/T275334) (owner: 10Legoktm) [08:05:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:05:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20859 and previous config saved to /var/cache/conftool/dbconfig/20220216-080531-ladsgroup.json [08:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:38] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [08:06:32] let me know when it's on mwdebug [08:06:34] legoktm: pulled to mwdebug1001 [08:06:37] :D [08:07:13] lgtm on enwp, let me check one more wiki [08:07:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T300510)', diff saved to https://phabricator.wikimedia.org/P20860 and previous config saved to /var/cache/conftool/dbconfig/20220216-080717-ladsgroup.json [08:07:19] lgtm on cswp too [08:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:50] lgtm on commons [08:07:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] helmfiles: log helmfile deploy only once in SAL [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [08:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20861 and previous config saved to /var/cache/conftool/dbconfig/20220216-080825-root.json [08:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:39] urbanecm: go for it [08:08:43] legoktm: logstash's happy too. so, syncing [08:09:35] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 9001a8ce7d94408c9af072d4743e2cc9ab25abbe: Use $wgGroupInheritsPermissions for "confirmed" group (T275334; 1/2) (duration: 00m 51s) [08:09:36] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus1004.eqiad.wmnet [08:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:10:41] T275334: Changing user groups from $wgExtensionFunctions no longer works reliably - https://phabricator.wikimedia.org/T275334 [08:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:10:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T300381)', diff saved to https://phabricator.wikimedia.org/P20862 and previous config saved to /var/cache/conftool/dbconfig/20220216-081056-marostegui.json [08:11:37] check-and-restart-php takes some time today... [08:11:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:11:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [08:12:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:13:02] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:15] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 9001a8ce7d94408c9af072d4743e2cc9ab25abbe: Use $wgGroupInheritsPermissions for "confirmed" group (T275334; 2/2) (duration: 03m 39s) [08:13:18] legoktm: should be live now. anything else? [08:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:41] nope, that's it for today. thanks!! [08:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:55] I really like the new time for this window [08:13:58] no problem! Thanks for removing one instance of an extension function [08:14:04] and I'm glad you like the new schedule too [08:14:35] if you have suggestions for how to move forward on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/710136 that would be appreciated [08:15:20] wrt extension functions, i plan to do T112147 next week (already tech news announced) to remove the oversight/suppress hack in our config [08:15:20] T112147: Rename the oversight group on WMF projects to the MediaWiki standard (whatever that is) - https://phabricator.wikimedia.org/T112147 [08:16:01] * legoktm reads up [08:17:56] nice [08:18:15] urbanecm: if we haven't run it in a while, would be good to check locally that migrateUserGroup.php still works properly [08:18:24] yup, will do too :) [08:18:40] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts prometheus1004.eqiad.wmnet [08:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:16] specifically I wonder if User::invalidateCache() is still good enough these days, or if there's some UserGroupManager cache that needs invalidation too [08:20:14] (03PS1) 10Marostegui: add_tl_target_id_T300775.py: Add ALGORITHM=COPY [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763183 (https://phabricator.wikimedia.org/T300775) [08:20:47] (03CR) 10Filippo Giunchedi: [C: 03+2] cr: remove prometheus[12]00[34] from ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/762827 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:21:07] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: add docker packages to thirdparty/ci for buster [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [08:23:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20863 and previous config saved to /var/cache/conftool/dbconfig/20220216-082329-root.json [08:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:48] legoktm: commented on the other patch with some thoughts i have about it. hope it helps :) [08:26:27] (03PS4) 10Majavah: dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) [08:26:29] ty, will look tomorrow [08:26:41] (03CR) 10Ladsgroup: [C: 03+2] "magic" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763183 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [08:27:10] Amir1: i like that +2 comment :) [08:27:30] (03Merged) 10jenkins-bot: add_tl_target_id_T300775.py: Add ALGORITHM=COPY [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763183 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [08:27:35] marostegui does magical stuff. I'm just stating the facts ;) [08:29:31] (03CR) 10Elukey: ml-services: add arwiki & bnwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [08:31:16] (03CR) 10Elukey: "Hey Kevin, can you rebase your change on top of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/762533 ? It is currently n" [deployment-charts] - 10https://gerrit.wikimedia.org/r/762777 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [08:33:01] !log restarting blazegraph on wdqs1005 (jvm stuck for 4hours) [08:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:13] (03PS1) 10Muehlenhoff: Add missing update config for thirdparty/docker-ci-buster [puppet] - 10https://gerrit.wikimedia.org/r/763184 (https://phabricator.wikimedia.org/T300682) [08:34:53] (03PS1) 10ArielGlenn: Add Atieno (PET manager) to approvers for snapshot/dumps-related access requests [puppet] - 10https://gerrit.wikimedia.org/r/763185 [08:36:31] 10ops-eqiad, 10decommission-hardware: decommission prometheus1005.eqiad.wmnet - https://phabricator.wikimedia.org/T301851 (10fgiunchedi) [08:37:01] 10ops-eqiad, 10decommission-hardware: decommission prometheus1004.eqiad.wmnet - https://phabricator.wikimedia.org/T301851 (10fgiunchedi) [08:37:37] 10ops-codfw, 10decommission-hardware: decommission prometheus2004.codfw.wmnet - https://phabricator.wikimedia.org/T301852 (10fgiunchedi) [08:38:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20864 and previous config saved to /var/cache/conftool/dbconfig/20220216-083832-root.json [08:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:52] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10fgiunchedi) [08:39:06] !log Set an email for developer account Osnard and re-enable it (T301796) [08:39:30] (03CR) 10Muehlenhoff: [C: 03+2] Add missing update config for thirdparty/docker-ci-buster [puppet] - 10https://gerrit.wikimedia.org/r/763184 (https://phabricator.wikimedia.org/T300682) (owner: 10Muehlenhoff) [08:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:42] T301796: Password reset for Toolforge Account - https://phabricator.wikimedia.org/T301796 [08:42:36] (03CR) 10ArielGlenn: [C: 03+2] Add Atieno (PET manager) to approvers for snapshot/dumps-related access requests [puppet] - 10https://gerrit.wikimedia.org/r/763185 (owner: 10ArielGlenn) [08:54:43] (03CR) 10Elukey: ml-services: add arwiki & bnwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [08:57:28] (03PS4) 10Jelto: helmfiles: log helmfile deploy only once in SAL [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 [09:00:04] hashar and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T0900). [09:00:34] still in a meeting, will do the train after [09:01:57] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:46] (03CR) 10Jelto: "amended SAL logging change to all other helmfiles in patch set 4." [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [09:06:37] (03PS1) 10Hashar: group1 wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763187 [09:06:39] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763187 (owner: 10Hashar) [09:06:43] (03PS2) 10Kevin Bazira: ml-services: add bswiki & cawiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/762777 (https://phabricator.wikimedia.org/T301415) [09:07:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:21] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763187 (owner: 10Hashar) [09:07:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300510)', diff saved to https://phabricator.wikimedia.org/P20865 and previous config saved to /var/cache/conftool/dbconfig/20220216-090737-ladsgroup.json [09:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:42] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [09:08:37] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.22 refs T300198 [09:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:42] T300198: 1.38.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T300198 [09:08:57] (03CR) 10Kevin Bazira: ml-services: add bswiki & cawiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/762777 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [09:09:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'T300510', diff saved to https://phabricator.wikimedia.org/P20866 and previous config saved to /var/cache/conftool/dbconfig/20220216-090924-ladsgroup.json [09:09:27] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.22 refs T300198 (duration: 00m 49s) [09:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:10] (03PS1) 10Joal: Temporarily disable traffic data purge [puppet] - 10https://gerrit.wikimedia.org/r/763189 (https://phabricator.wikimedia.org/T300164) [09:13:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:14:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:23] (03CR) 10Elukey: [C: 03+2] "Checked all the swift model URIs, checked the diff from CI, LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/762777 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [09:19:10] 10SRE, 10MediaWiki-extensions-PropertySuggester, 10Wikidata, 10wdwb-tech, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10Joe) Hi, if this service is to be used in the WMF production environment (and given the call graph, it will), it needs to... [09:20:44] (03CR) 10DCausse: "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [09:23:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1146.eqiad.wmnet with OS bullseye [09:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:28] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [09:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:06] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [09:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:45] (03CR) 10Volans: [C: 04-1] "I had forgot to add a comment on the docs. (the -1 is there still just to wait for T276589#7420124 )" [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [09:26:06] (03CR) 10Elukey: [C: 03+1] Temporarily disable traffic data purge [puppet] - 10https://gerrit.wikimedia.org/r/763189 (https://phabricator.wikimedia.org/T300164) (owner: 10Joal) [09:28:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300381)', diff saved to https://phabricator.wikimedia.org/P20867 and previous config saved to /var/cache/conftool/dbconfig/20220216-092832-marostegui.json [09:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:38] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:35:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1146.eqiad.wmnet with reason: host reimage [09:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:33] (03CR) 10DCausse: search-platform: Port alerts from icinga (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [09:37:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1146.eqiad.wmnet with reason: host reimage [09:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:52] (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [09:43:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P20868 and previous config saved to /var/cache/conftool/dbconfig/20220216-094337-marostegui.json [09:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:37] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MMandere) 05Open→03In progress [09:52:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1146.eqiad.wmnet with OS bullseye [09:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:20] I count just 5 errors, a couple SQL deadlocks and 3 sql queries timing out [09:54:23] so probably not a big deal [09:55:47] (03CR) 10Btullis: [C: 03+2] Temporarily disable traffic data purge [puppet] - 10https://gerrit.wikimedia.org/r/763189 (https://phabricator.wikimedia.org/T300164) (owner: 10Joal) [09:55:59] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:57:13] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:58:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P20869 and previous config saved to /var/cache/conftool/dbconfig/20220216-095841-marostegui.json [09:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:07] 10SRE, 10MediaWiki-extensions-PropertySuggester, 10Wikidata, 10wdwb-tech, 10Service-deployment-requests: New Service Request SchemaTree - https://phabricator.wikimedia.org/T301471 (10Michaelcochez) @Joe we have created files for blubber before, I assume what is needed is very similar to that? I am not s... [10:11:40] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-25 eventgate-main outage - https://phabricator.wikimedia.org/T299970 (10akosiaris) 05Open→03Resolved metadata and scorecard filled. Resolving this. [10:13:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300381)', diff saved to https://phabricator.wikimedia.org/P20870 and previous config saved to /var/cache/conftool/dbconfig/20220216-101346-marostegui.json [10:13:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [10:13:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [10:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:52] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T300381)', diff saved to https://phabricator.wikimedia.org/P20871 and previous config saved to /var/cache/conftool/dbconfig/20220216-101354-marostegui.json [10:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:17] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10akosiaris) [10:15:19] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-25 eventgate-main outage - https://phabricator.wikimedia.org/T299970 (10akosiaris) [10:15:36] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10akosiaris) [10:18:38] (03PS1) 10Majavah: codfw1dev network tests: update tools-codfw1dev bastion [puppet] - 10https://gerrit.wikimedia.org/r/763194 [10:20:49] !log installing expat security updates [10:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20872 and previous config saved to /var/cache/conftool/dbconfig/20220216-102302-ladsgroup.json [10:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:08] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [10:31:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:31:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:31:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [10:31:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev network tests: update tools-codfw1dev bastion [puppet] - 10https://gerrit.wikimedia.org/r/763194 (owner: 10Majavah) [10:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [10:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:17] (03PS1) 10Filippo Giunchedi: am: link alerts to their Icinga web page [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763197 (https://phabricator.wikimedia.org/T300859) [10:38:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P20873 and previous config saved to /var/cache/conftool/dbconfig/20220216-103807-ladsgroup.json [10:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: update default aptly-host for wmcs-package-build.py [puppet] - 10https://gerrit.wikimedia.org/r/762961 (owner: 10BryanDavis) [10:43:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:43:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:25] (03PS25) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [10:44:53] (03CR) 10Jbond: "updated, comments/questions inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [10:45:43] (03PS26) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [10:48:30] (03PS1) 10MMandere: admin: Change Kinneret username [puppet] - 10https://gerrit.wikimedia.org/r/763200 (https://phabricator.wikimedia.org/T301098) [10:48:35] (03CR) 10Jelto: [C: 03+2] gitlab: move gitlab test instance to wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [10:51:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/763200 (https://phabricator.wikimedia.org/T301098) (owner: 10MMandere) [10:53:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P20875 and previous config saved to /var/cache/conftool/dbconfig/20220216-105312-ladsgroup.json [10:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:44] (03CR) 10David Caro: [C: 03+1] "LGTM: did not test it though, let me know if you want a more thorough check (ex. running it and testing that it shows up)." [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763197 (https://phabricator.wikimedia.org/T300859) (owner: 10Filippo Giunchedi) [10:54:56] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [10:55:10] (03CR) 10MMandere: [C: 03+2] admin: Change Kinneret username [puppet] - 10https://gerrit.wikimedia.org/r/763200 (https://phabricator.wikimedia.org/T301098) (owner: 10MMandere) [10:55:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:55:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:41] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T300774)', diff saved to https://phabricator.wikimedia.org/P20877 and previous config saved to /var/cache/conftool/dbconfig/20220216-105540-kormat.json [10:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:46] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [10:57:49] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300774)', diff saved to https://phabricator.wikimedia.org/P20878 and previous config saved to /var/cache/conftool/dbconfig/20220216-105748-kormat.json [10:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:24] (03PS2) 10Filippo Giunchedi: am: link alerts to their Icinga web page [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763197 (https://phabricator.wikimedia.org/T300859) [11:04:06] (03PS1) 10Ladsgroup: ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762914 (https://phabricator.wikimedia.org/T301310) [11:04:19] (03PS1) 10Ladsgroup: ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762915 (https://phabricator.wikimedia.org/T301310) [11:04:32] jouncebot: nowandnext [11:04:32] No deployments scheduled for the next 2 hour(s) and 55 minute(s) [11:04:32] In 2 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1400) [11:04:38] niiice [11:04:45] (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762914 (https://phabricator.wikimedia.org/T301310) (owner: 10Ladsgroup) [11:04:48] (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762915 (https://phabricator.wikimedia.org/T301310) (owner: 10Ladsgroup) [11:07:15] !log restarting apache on prometheus nodes to pick up expat security updates [11:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:56] (03CR) 10Filippo Giunchedi: "Thank you David for the quick review!" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763197 (https://phabricator.wikimedia.org/T300859) (owner: 10Filippo Giunchedi) [11:08:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20879 and previous config saved to /var/cache/conftool/dbconfig/20220216-110816-ladsgroup.json [11:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:22] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [11:12:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20880 and previous config saved to /var/cache/conftool/dbconfig/20220216-111253-kormat.json [11:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:14] (03PS1) 10Muehlenhoff: Prometheus: Add conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/763201 [11:19:44] (03CR) 10jerkins-bot: [V: 04-1] ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762914 (https://phabricator.wikimedia.org/T301310) (owner: 10Ladsgroup) [11:20:43] (03PS1) 10Arturo Borrero Gonzalez: cmd-checklist-runner: refresh code [puppet] - 10https://gerrit.wikimedia.org/r/763202 [11:21:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300510)', diff saved to https://phabricator.wikimedia.org/P20881 and previous config saved to /var/cache/conftool/dbconfig/20220216-112145-ladsgroup.json [11:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:53] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [11:22:15] (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus: Add conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/763201 (owner: 10Muehlenhoff) [11:23:00] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [11:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300381)', diff saved to https://phabricator.wikimedia.org/P20882 and previous config saved to /var/cache/conftool/dbconfig/20220216-112326-marostegui.json [11:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:32] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:24:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10MMandere) [11:27:29] (03PS1) 10Volans: setup.py: temporary limit redis library [software/spicerack] - 10https://gerrit.wikimedia.org/r/763203 [11:27:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20883 and previous config saved to /var/cache/conftool/dbconfig/20220216-112758-kormat.json [11:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:09] (03PS1) 10Marostegui: db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763204 (https://phabricator.wikimedia.org/T301848) [11:30:53] (03CR) 10Marostegui: [C: 03+2] db2088: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763204 (https://phabricator.wikimedia.org/T301848) (owner: 10Marostegui) [11:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20884 and previous config saved to /var/cache/conftool/dbconfig/20220216-113650-ladsgroup.json [11:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P20885 and previous config saved to /var/cache/conftool/dbconfig/20220216-113831-marostegui.json [11:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:09] (03PS1) 10Marostegui: add_tl_target_id_T300775.py: Change downtime time [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763205 (https://phabricator.wikimedia.org/T300775) [11:39:46] (03CR) 10Marostegui: [C: 03+2] add_tl_target_id_T300775.py: Change downtime time [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763205 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [11:40:09] (03Merged) 10jenkins-bot: add_tl_target_id_T300775.py: Change downtime time [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763205 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [11:40:13] (03CR) 10Jbond: [C: 03+1] setup.py: temporary limit redis library [software/spicerack] - 10https://gerrit.wikimedia.org/r/763203 (owner: 10Volans) [11:40:50] (03CR) 10Volans: [C: 03+2] setup.py: temporary limit redis library [software/spicerack] - 10https://gerrit.wikimedia.org/r/763203 (owner: 10Volans) [11:41:48] (03PS1) 10Majavah: P:mariadb::cloudinfra: add prometheus mariadb exporter [puppet] - 10https://gerrit.wikimedia.org/r/763206 [11:43:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300774)', diff saved to https://phabricator.wikimedia.org/P20886 and previous config saved to /var/cache/conftool/dbconfig/20220216-114303-kormat.json [11:43:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:43:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:09] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:43:11] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T300774)', diff saved to https://phabricator.wikimedia.org/P20887 and previous config saved to /var/cache/conftool/dbconfig/20220216-114310-kormat.json [11:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:26] (03PS2) 10Majavah: P:mariadb::cloudinfra: add prometheus exporter grants [puppet] - 10https://gerrit.wikimedia.org/r/763206 [11:47:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [11:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:11] (03Merged) 10jenkins-bot: setup.py: temporary limit redis library [software/spicerack] - 10https://gerrit.wikimedia.org/r/763203 (owner: 10Volans) [11:51:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20888 and previous config saved to /var/cache/conftool/dbconfig/20220216-115155-ladsgroup.json [11:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P20889 and previous config saved to /var/cache/conftool/dbconfig/20220216-115336-marostegui.json [11:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [11:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1011.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [11:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:11] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300774)', diff saved to https://phabricator.wikimedia.org/P20890 and previous config saved to /var/cache/conftool/dbconfig/20220216-115711-kormat.json [11:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:16] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [11:57:16] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [11:57:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1011.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [11:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:56] (03PS27) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [12:05:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Kiron Lebeck (klebeck-tmlt) - https://phabricator.wikimedia.org/T301680 (10MMandere) [12:06:10] !log configure ganeti1024/ganeti1027/ganeti1028 as master candidates for eqiad Ganeti cluster [12:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300510)', diff saved to https://phabricator.wikimedia.org/P20891 and previous config saved to /var/cache/conftool/dbconfig/20220216-120659-ladsgroup.json [12:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:06] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [12:08:18] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300381)', diff saved to https://phabricator.wikimedia.org/P20892 and previous config saved to /var/cache/conftool/dbconfig/20220216-120840-marostegui.json [12:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:46] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:08:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [12:08:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [12:08:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [12:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [12:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:39] (03CR) 10Cathal Mooney: [C: 03+2] Adjust CR Internal Anycast BGP Templates (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [12:12:16] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20893 and previous config saved to /var/cache/conftool/dbconfig/20220216-121215-kormat.json [12:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:29] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [12:15:15] (03PS1) 10Marostegui: phabricator_instance.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/763213 (https://phabricator.wikimedia.org/T268869) [12:17:19] (03CR) 10Kosta Harlan: [C: 03+1] Deploy Growth features to 100% of newcomers on most Wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763171 (https://phabricator.wikimedia.org/T301820) (owner: 10Urbanecm) [12:18:49] (03CR) 10Marostegui: [C: 03+2] phabricator_instance.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/763213 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [12:19:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Kiron Lebeck (klebeck-tmlt) - https://phabricator.wikimedia.org/T301680 (10MMandere) 05Open→03In progress Hi there @JBennett @milimetric , please help approve @Klebeck-tmlt request. [12:19:27] (03CR) 10Urbanecm: Deploy Growth features to 100% of newcomers on most Wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763171 (https://phabricator.wikimedia.org/T301820) (owner: 10Urbanecm) [12:22:41] (03PS1) 10Jbond: WIP: Early start on firmware cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [12:24:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:mariadb::cloudinfra: add prometheus exporter grants [puppet] - 10https://gerrit.wikimedia.org/r/763206 (owner: 10Majavah) [12:25:34] (03CR) 10Kosta Harlan: [C: 03+1] Deploy Growth features to 100% of newcomers on most Wikipedias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763171 (https://phabricator.wikimedia.org/T301820) (owner: 10Urbanecm) [12:25:47] jouncebot: next [12:25:47] In 1 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1400) [12:27:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20894 and previous config saved to /var/cache/conftool/dbconfig/20220216-122720-kormat.json [12:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:32] (03PS1) 10Phuedx: Validate Metrics Platform Client configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763216 (https://phabricator.wikimedia.org/T299916) [12:33:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10MMandere) [12:33:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10MMandere) Hi there @JBennett @Milimetric , please help approve @Tom_Magerlein request. [12:42:25] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300774)', diff saved to https://phabricator.wikimedia.org/P20895 and previous config saved to /var/cache/conftool/dbconfig/20220216-124225-kormat.json [12:42:26] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:42:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:31] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:42:33] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T300774)', diff saved to https://phabricator.wikimedia.org/P20896 and previous config saved to /var/cache/conftool/dbconfig/20220216-124232-kormat.json [12:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cmd-checklist-runner: refresh code [puppet] - 10https://gerrit.wikimedia.org/r/763202 (owner: 10Arturo Borrero Gonzalez) [12:46:11] !log installing apache-log4j1.2 security updates [12:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10MMandere) [12:51:28] (03PS5) 10Minato826: Enable RelatedArticles for desktop (non-mobile) view at zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) [12:54:21] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:56:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10MMandere) 05Open→03In progress Hi there @JBennett @Milimetric , please help approve @Damiendf request. [12:56:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Tom Magerlein - https://phabricator.wikimedia.org/T301679 (10MMandere) 05Open→03In progress [13:00:45] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300774)', diff saved to https://phabricator.wikimedia.org/P20897 and previous config saved to /var/cache/conftool/dbconfig/20220216-130044-kormat.json [13:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:51] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:04:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MMandere) Hi there @JBennett , please help approve @skyenet request. [13:05:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10MMandere) 05Open→03In progress [13:08:27] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 (10fgiunchedi) [13:10:37] (03PS1) 10Cathal Mooney: Add CR router IPv6 loopbacks to Bird config for esams [puppet] - 10https://gerrit.wikimedia.org/r/763222 (https://phabricator.wikimedia.org/T301165) [13:12:20] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:41] (03CR) 10JMeybohm: [C: 03+2] Sync cfssl-issuer app and chart versions to latest release [deployment-charts] - 10https://gerrit.wikimedia.org/r/762405 (owner: 10JMeybohm) [13:15:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [13:15:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [13:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:50] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20898 and previous config saved to /var/cache/conftool/dbconfig/20220216-131549-kormat.json [13:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:30] (03Merged) 10jenkins-bot: Sync cfssl-issuer app and chart versions to latest release [deployment-charts] - 10https://gerrit.wikimedia.org/r/762405 (owner: 10JMeybohm) [13:19:45] (03CR) 10JMeybohm: [C: 03+1] "This LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [13:21:03] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:16] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [13:23:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [13:23:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:23:14] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T300775)', diff saved to https://phabricator.wikimedia.org/P20899 and previous config saved to /var/cache/conftool/dbconfig/20220216-132322-marostegui.json [13:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:34] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:23:54] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:43] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10akosiaris) [13:27:17] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:57] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:28:00] 10SRE-swift-storage: swift-ring-builder exits 1 (WARN) on unhandled exceptions - https://phabricator.wikimedia.org/T301875 (10MatthewVernon) [13:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:07] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:28:10] 10SRE-swift-storage: swift-ring-builder exits 1 (WARN) on unhandled exceptions - https://phabricator.wikimedia.org/T301875 (10MatthewVernon) 05Open→03Resolved [13:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:01] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:09] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Enable RelatedArticles for desktop (non-mobile) view at zhwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) (owner: 10Minato826) [13:29:44] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:47] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:30:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20900 and previous config saved to /var/cache/conftool/dbconfig/20220216-133054-kormat.json [13:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:55] (03PS1) 104nn1l2: InitialiseSettings: General cleanup, wgAddGroups (J-P) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763225 (https://phabricator.wikimedia.org/T301647) [13:35:00] (03PS6) 10Minato826: Enable RelatedArticles for desktop (non-mobile) view at zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) [13:38:17] (03CR) 10Minato826: Enable RelatedArticles for desktop (non-mobile) view at zhwikinews (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762761 (https://phabricator.wikimedia.org/T299856) (owner: 10Minato826) [13:40:12] (03CR) 104nn1l2: "Again, sorry for the rather late submit. I'll try to do it sooner tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763225 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [13:43:43] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:44:43] (03CR) 10Ssingh: [C: 03+1] Add CR router IPv6 loopbacks to Bird config for esams [puppet] - 10https://gerrit.wikimedia.org/r/763222 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [13:45:47] PROBLEM - Check systemd state on etherpad1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-etherpad-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300774)', diff saved to https://phabricator.wikimedia.org/P20901 and previous config saved to /var/cache/conftool/dbconfig/20220216-134559-kormat.json [13:46:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:46:03] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:46:05] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:46:08] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T300774)', diff saved to https://phabricator.wikimedia.org/P20902 and previous config saved to /var/cache/conftool/dbconfig/20220216-134612-kormat.json [13:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:05] 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10Hokwelum) [13:48:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] InitialiseSettings: General cleanup, wgAddGroups (J-P) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763225 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [13:49:19] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 (10akosiaris) Hm, this shows up in the logs ` Feb 16 13:40:29 etherpad1003 prometheus-etherpad-exporter[805482]: UnboundLocalError: local variable 'metric_name' referenced before assignm... [13:50:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300774)', diff saved to https://phabricator.wikimedia.org/P20903 and previous config saved to /var/cache/conftool/dbconfig/20220216-135021-kormat.json [13:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:43] (03Abandoned) 10Ssingh: Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [13:50:53] (03CR) 10Cathal Mooney: [C: 03+2] Add CR router IPv6 loopbacks to Bird config for esams [puppet] - 10https://gerrit.wikimedia.org/r/763222 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [13:50:54] 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10ArielGlenn) [13:52:05] (03PS1) 10Alexandros Kosiaris: Make sure that metric_name is defined in all cases [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763231 (https://phabricator.wikimedia.org/T301872) [13:53:33] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 (10akosiaris) There seem to be quite a few new metrics around. ` { "httpStartTime": 1644710413335, "memoryUsage": 328769536, "memoryUsageHeap": 170679824,... [13:55:08] 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10ArielGlenn) shell account name: hokwelum [13:55:16] (03PS1) 10Ssingh: Set anycast_neighbors for Wikidough IPv6 in esams [homer/public] - 10https://gerrit.wikimedia.org/r/763233 (https://phabricator.wikimedia.org/T301165) [13:58:06] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Thanks." [homer/public] - 10https://gerrit.wikimedia.org/r/763233 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [13:58:52] (03CR) 10Cathal Mooney: [C: 03+2] Set anycast_neighbors for Wikidough IPv6 in esams [homer/public] - 10https://gerrit.wikimedia.org/r/763233 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [13:59:27] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Set anycast_neighbors for Wikidough IPv6 in esams [homer/public] - 10https://gerrit.wikimedia.org/r/763233 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:00:05] Lucas_WMDE and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1400). [14:00:05] anoop and nn1l2: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:11] hi [14:00:29] I’m in a meeting unfortunately [14:00:30] hello [14:00:33] might be able to deploy later [14:00:47] but if anyone else can deploy, please do [14:03:15] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:26] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P20905 and previous config saved to /var/cache/conftool/dbconfig/20220216-140526-kormat.json [14:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:50] (03CR) 10Giuseppe Lavagetto: k8s: add module (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [14:06:19] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:08:17] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:10:20] ^ topranks and I are working on this, this is expected [14:11:13] (03PS3) 10Ssingh: hiera: add IPv6 support to Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) [14:11:37] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:12:27] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33822/console" [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:13:07] ACKNOWLEDGEMENT - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast Cathal Mooney Brining up v6 to doh and durum VMs. - The acknowledgement expires at: 2022-02-17 14:12:45. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:17] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:21] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: add IPv6 support to Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/761373 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:13:22] ACKNOWLEDGEMENT - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast Cathal Mooney Brining up v6 to doh and durum VMs. - The acknowledgement expires at: 2022-02-17 14:13:08. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:31] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@8991326]: (no justification provided) [14:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:39] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@8991326]: (no justification provided) (duration: 00m 07s) [14:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:07] RECOVERY - Check systemd state on etherpad1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:14:27] (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:15:22] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:15:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [14:15:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [14:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T300381)', diff saved to https://phabricator.wikimedia.org/P20906 and previous config saved to /var/cache/conftool/dbconfig/20220216-141546-marostegui.json [14:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:54] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [14:16:20] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host elastic2073.mgmt.codfw.wmnet with reboot policy FORCED [14:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:53] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2073.mgmt.codfw.wmnet with reboot policy FORCED [14:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:48] 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10WDoranWMF) As Platform Engineering manager, I approve. [14:17:52] !log failover the ganeti master to ganeti1024 T296721 [14:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:56] T296721: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 [14:17:58] !log disabled puppet on all doh* hosts except doh3001 [14:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:03] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:18:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 88, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:26] (KubernetesCalicoDown) resolved: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:20:31] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P20907 and previous config saved to /var/cache/conftool/dbconfig/20220216-142030-kormat.json [14:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:17] !log migrate instances off ganeti1017 [14:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:23] (03PS28) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [14:22:21] PROBLEM - Check systemd state on doh3001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:40] ^ expected [14:23:43] PROBLEM - ganeti-wconfd running on ganeti1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:26:33] 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10ArielGlenn) >>! In T301876#7714786, @WDoranWMF wrote: > As Platform Engineering manager, I approve. Note that's the approval for the data.yaml side of things, we sti... [14:27:11] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10bking) Thanks Volans, I thought we manually worked around that. Will make a note to reach out next time. [14:29:04] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10Volans) >>! In T298853#7714808, @bking wrote: > Thanks Volans, I thought we manually worked around that. Will make a note to reach out next time. Worked ar... [14:29:18] RECOVERY - Check systemd state on doh3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:26] wb old friend [14:30:37] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:30:43] (03PS1) 10Alexandros Kosiaris: Add new metrics present in 1.8.16 [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763238 (https://phabricator.wikimedia.org/T301872) [14:30:45] (03PS1) 10Alexandros Kosiaris: Release 0.6 prometheus-etherpad-exporter [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763239 (https://phabricator.wikimedia.org/T301872) [14:31:08] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:33:36] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 (10akosiaris) I 've hotpatched this in production to stop the bleeding for now but the proper way to solve this is of course to add support for the new metrics and m... [14:35:35] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300774)', diff saved to https://phabricator.wikimedia.org/P20908 and previous config saved to /var/cache/conftool/dbconfig/20220216-143535-kormat.json [14:35:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [14:35:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [14:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:41] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:01] 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10Atieno) As Hannah's manager, I approve. [14:42:44] (03PS29) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [14:42:56] (03CR) 10Jbond: reposync: add new class to manage syncing repositories (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:43:53] No deploys in this window, I guess :( [14:44:02] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Restarting to pick up Java security updates - hnowlan@cumin1001 [14:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:17] Let's ping Urbanecm. Can you deploy in this window? (only 15 mins left) [14:45:36] I'm also in a meeting, sorry :( [14:46:39] 10SRE, 10SRE-Access-Requests: Requesting access to the production cluster as a deployer for HANNAH OKWELUM - https://phabricator.wikimedia.org/T301876 (10ArielGlenn) [14:47:20] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:47:22] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T300774)', diff saved to https://phabricator.wikimedia.org/P20909 and previous config saved to /var/cache/conftool/dbconfig/20220216-144726-kormat.json [14:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:33] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:47:43] jouncebot: now [14:47:43] For the next 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1400) [14:47:51] Okay, no worries, but please fix the timetable for tomorrow: https://wikitech.wikimedia.org/wiki/Talk:Deployments#Thurdsay I'm confused and don't know how to reschedule my patch for tomorrow [14:49:35] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300774)', diff saved to https://phabricator.wikimedia.org/P20910 and previous config saved to /var/cache/conftool/dbconfig/20220216-144934-kormat.json [14:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:33] (03PS1) 10Ssingh: bird: use IPv6 address as router_id in bird6.conf [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) [14:52:45] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:53:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33823/console" [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:55:02] (03CR) 10Jelto: [C: 03+2] helmfiles: log helmfile deploy only once in SAL [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [14:56:24] (03CR) 10JMeybohm: [C: 03+1] k8s: add module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [14:57:29] (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33825/doh3001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:58:02] nn1l2: I’ll try to fix it later (the script that generates the table was already fixed, so starting next week this shouldn’t happen again) [14:58:15] not enough time in the window left to do the deployment now, I’m afraid [14:58:25] no problem [14:58:47] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:05] (03Merged) 10jenkins-bot: helmfiles: log helmfile deploy only once in SAL [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [15:00:17] (03PS30) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [15:00:24] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [15:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:54] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [15:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:16] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [15:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:03] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [15:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:40] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20911 and previous config saved to /var/cache/conftool/dbconfig/20220216-150439-kormat.json [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [15:09:41] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [15:11:00] (03PS1) 10Papaul: Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) [15:12:05] (03CR) 10jerkins-bot: [V: 04-1] Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) (owner: 10Papaul) [15:12:21] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:18:58] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 (10akosiaris) p:05Triage→03Medium [15:19:44] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20912 and previous config saved to /var/cache/conftool/dbconfig/20220216-151944-kormat.json [15:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:36] (03PS2) 10Papaul: Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) [15:21:33] (03CR) 10jerkins-bot: [V: 04-1] Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) (owner: 10Papaul) [15:25:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300381)', diff saved to https://phabricator.wikimedia.org/P20913 and previous config saved to /var/cache/conftool/dbconfig/20220216-152529-marostegui.json [15:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:43] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [15:26:59] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:27:27] (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:29:59] PROBLEM - Host cp1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:30:00] (03PS3) 10Papaul: Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) [15:31:43] (03CR) 10jerkins-bot: [V: 04-1] Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) (owner: 10Papaul) [15:31:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 88, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:31:47] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:27] (KubernetesCalicoDown) resolved: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:34:49] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300774)', diff saved to https://phabricator.wikimedia.org/P20914 and previous config saved to /var/cache/conftool/dbconfig/20220216-153448-kormat.json [15:34:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:34:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:55] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:34:56] !log kormat@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T300774)', diff saved to https://phabricator.wikimedia.org/P20915 and previous config saved to /var/cache/conftool/dbconfig/20220216-153456-kormat.json [15:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:46] !log installing zsh security updates [15:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:21] RECOVERY - Host cp1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [15:40:14] (03PS2) 10Ssingh: bird: use IPv6 address as router_id in bird6.conf [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) [15:40:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P20916 and previous config saved to /var/cache/conftool/dbconfig/20220216-154037-marostegui.json [15:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33826/console" [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [15:44:38] jouncebot: nowandnext [15:44:38] No deployments scheduled for the next 3 hour(s) and 15 minute(s) [15:44:38] In 3 hour(s) and 15 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1900) [15:44:38] In 3 hour(s) and 15 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1900) [15:44:42] oof [15:45:04] (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762915 (https://phabricator.wikimedia.org/T301310) (owner: 10Ladsgroup) [15:45:08] (03CR) 10Ladsgroup: [C: 03+2] ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762914 (https://phabricator.wikimedia.org/T301310) (owner: 10Ladsgroup) [15:45:38] (03PS4) 10Papaul: Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) [15:46:17] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300774)', diff saved to https://phabricator.wikimedia.org/P20917 and previous config saved to /var/cache/conftool/dbconfig/20220216-154752-kormat.json [15:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:01] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [15:55:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P20918 and previous config saved to /var/cache/conftool/dbconfig/20220216-155542-marostegui.json [15:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:32] (03Merged) 10jenkins-bot: ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762915 (https://phabricator.wikimedia.org/T301310) (owner: 10Ladsgroup) [16:01:21] (03Merged) 10jenkins-bot: ParserOutputAccess: Cache Parsing inside the class as well [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762914 (https://phabricator.wikimedia.org/T301310) (owner: 10Ladsgroup) [16:02:48] (03PS31) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [16:02:57] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P20919 and previous config saved to /var/cache/conftool/dbconfig/20220216-160257-kormat.json [16:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:02] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Need to mention that the description on the CR is inaccurate, purpose of the change is to set the local IP for each BGP session, th" [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [16:03:59] (03CR) 10Accraze: ml-services: add arwiki & bnwiki editquality isvcs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [16:05:39] (03CR) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [16:05:44] (03PS6) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) [16:06:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:57] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.22/includes/page/ParserOutputAccess.php: Backport: [[gerrit:762915|ParserOutputAccess: Cache Parsing inside the class as well (T301310)]] (duration: 00m 54s) [16:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:02] T301310: CommonsMetadata extension is triggering a duplicate parse in commons - https://phabricator.wikimedia.org/T301310 [16:07:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:07:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:31] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.21/includes/page/ParserOutputAccess.php: Backport: [[gerrit:762914|ParserOutputAccess: Cache Parsing inside the class as well (T301310)]] (duration: 00m 52s) [16:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300381)', diff saved to https://phabricator.wikimedia.org/P20920 and previous config saved to /var/cache/conftool/dbconfig/20220216-161047-marostegui.json [16:10:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [16:10:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [16:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:54] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:10:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T300381)', diff saved to https://phabricator.wikimedia.org/P20921 and previous config saved to /var/cache/conftool/dbconfig/20220216-161054-marostegui.json [16:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:31] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [16:13:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:34] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Jdforrester-WMF) [16:14:36] 10SRE, 10Discovery-Search: Migrate Elasticsearch to Debian Buster - https://phabricator.wikimedia.org/T244736 (10Jdforrester-WMF) [16:14:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:14:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:04] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P20922 and previous config saved to /var/cache/conftool/dbconfig/20220216-161803-kormat.json [16:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:49] (03PS1) 10Ladsgroup: Use ParserOutputAccess for accessing ParserOutput [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762925 (https://phabricator.wikimedia.org/T283029) [16:18:54] (03CR) 10Ladsgroup: [C: 03+2] Use ParserOutputAccess for accessing ParserOutput [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762925 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup) [16:22:59] (03Merged) 10jenkins-bot: Use ParserOutputAccess for accessing ParserOutput [extensions/FlaggedRevs] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/762925 (https://phabricator.wikimedia.org/T283029) (owner: 10Ladsgroup) [16:26:41] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.21/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:762925|Use ParserOutputAccess for accessing ParserOutput (T283029)]] (duration: 00m 49s) [16:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:48] T283029: FlaggableWikiPage::preloadPreparedEdit() does not actually carry over the parser output, leading to double parses on save - https://phabricator.wikimedia.org/T283029 [16:27:02] (03PS1) 10Cathal Mooney: Modify vars for esams to announce IPv6 Anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/763270 (https://phabricator.wikimedia.org/T301165) [16:29:00] (03PS3) 10AOkoth: vrts: rename module class variables [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) [16:31:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:31:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:04] (03CR) 10Cathal Mooney: [C: 03+2] Modify vars for esams to announce IPv6 Anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/763270 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [16:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:10] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:32:16] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Modify vars for esams to announce IPv6 Anycast range [homer/public] - 10https://gerrit.wikimedia.org/r/763270 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [16:33:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300774)', diff saved to https://phabricator.wikimedia.org/P20923 and previous config saved to /var/cache/conftool/dbconfig/20220216-163308-kormat.json [16:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:14] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [16:34:58] (03CR) 10Dzahn: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [16:36:36] (03CR) 10Papaul: [C: 03+2] Add conting2002 and gerrit2002 to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/763252 (https://phabricator.wikimedia.org/T299575) (owner: 10Papaul) [16:37:10] (03PS1) 10Cathal Mooney: Add IPv6 Anycast range to knams public announcements. [homer/public] - 10https://gerrit.wikimedia.org/r/763272 (https://phabricator.wikimedia.org/T301165) [16:38:48] (03CR) 10Cathal Mooney: [C: 03+2] Add IPv6 Anycast range to knams public announcements. [homer/public] - 10https://gerrit.wikimedia.org/r/763272 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [16:39:05] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Add IPv6 Anycast range to knams public announcements. [homer/public] - 10https://gerrit.wikimedia.org/r/763272 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [16:40:12] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:41:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS buster [16:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host contint2002.wikimedia.org with OS buster [16:46:20] 10SRE, 10Scap: scap fails deployments on bullseye/python 3.9 - https://phabricator.wikimedia.org/T299501 (10dancy) 05Open→03Resolved a:03dancy Fixed in scap 4.2.0. Currently deployed scap is 4.3.1. [16:47:09] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) @MatthewVernon - ping regarding the question above about how to access file objects from publicly [16:48:38] (03CR) 10Dzahn: [C: 03+2] contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [16:51:15] !log contint2001 - temp disabled puppet (active CI server) - contint1001 - attempting to install newer docker version (gerrit:758987 T300682) [16:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:21] T300682: contint1001 and contint2001 need a newer version of Docker installed - https://phabricator.wikimedia.org/T300682 [16:51:34] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission prometheus1004.eqiad.wmnet - https://phabricator.wikimedia.org/T301851 (10wiki_willy) a:03Cmjohnson [16:53:05] 10SRE, 10Data-Engineering, 10Metrics-Platform, 10Traffic: VarnishKafka to propagate user agent client hints headers to webrequest - https://phabricator.wikimedia.org/T299401 (10jbond) p:05Triage→03Medium [16:54:33] 10SRE, 10Infrastructure-Foundations, 10observability: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560 (10jbond) [16:54:39] 10SRE, 10Infrastructure-Foundations, 10observability: Enable drbd collector on ganeti nodes - https://phabricator.wikimedia.org/T299560 (10jbond) p:05Triage→03Medium [16:54:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint2002.wikimedia.org with reason: host reimage [16:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:12] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10jbond) p:05Triage→03Medium [16:56:00] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Maps, 10netbox: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond) p:05Triage→03Medium [16:56:33] 10SRE, 10Data-Engineering, 10observability, 10serviceops: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10jbond) p:05Triage→03Medium [16:56:45] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10jbond) p:05Triage→03Medium [16:57:52] 10SRE, 10DNS, 10Traffic, 10WMSE (IT): Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10MRamirez_WMF) Hi @jbond , these required settings remain the same. Thanks for the assist! TXT name ‎@‎ (or skip if not supported by provider) TXT value Copy recordMS=08C5... [16:57:58] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q3), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10jbond) p:05Triage→03Medium [16:58:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint2002.wikimedia.org with reason: host reimage [16:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gerrit2002.wikimedia.org with OS bullseye [16:58:10] (03PS1) 10Elukey: role::ml_k8s::master: enable Priority plugin [puppet] - 10https://gerrit.wikimedia.org/r/763277 (https://phabricator.wikimedia.org/T289131) [16:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:16] 10Puppet, 10SRE, 10Infrastructure-Foundations: Refactor P:base::firewall to pull host directly from puppetdb - https://phabricator.wikimedia.org/T300957 (10jbond) p:05Triage→03Low [16:58:31] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10jbond) p:05Triage→03Low [16:58:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye [16:58:57] 10SRE, 10Observability-Logging, 10User-fgiunchedi: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10jbond) p:05Triage→03Medium [16:59:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33827/console" [puppet] - 10https://gerrit.wikimedia.org/r/763277 (https://phabricator.wikimedia.org/T289131) (owner: 10Elukey) [17:00:14] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10jbond) p:05Triage→03Medium [17:01:47] (03CR) 10Elukey: role::ml_k8s::master: enable Priority plugin [puppet] - 10https://gerrit.wikimedia.org/r/763277 (https://phabricator.wikimedia.org/T289131) (owner: 10Elukey) [17:03:20] (03PS3) 10Accraze: ml-services: add arwiki & bnwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) [17:07:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint2002.wikimedia.org with OS buster [17:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host contint2002.wikimedia.org with OS buster completed: - c... [17:08:20] (03PS1) 10Marostegui: Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763286 [17:08:27] (03PS32) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [17:09:13] (03CR) 10Elukey: [C: 03+2] ml-services: add arwiki & bnwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [17:10:03] (03CR) 10Marostegui: [C: 03+2] Revert "db2088: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/763286 (owner: 10Marostegui) [17:12:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [17:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:46] (03PS1) 10Marostegui: db2126,db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763280 (https://phabricator.wikimedia.org/T301848) [17:13:29] !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [17:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:34] (03CR) 10Marostegui: [C: 03+2] db2126,db2095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/763280 (https://phabricator.wikimedia.org/T301848) (owner: 10Marostegui) [17:13:58] !log accraze@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [17:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit2002.wikimedia.org with reason: host reimage [17:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:19:46] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [17:21:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [17:25:52] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Restarting to pick up Java security updates - hnowlan@cumin1001 [17:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gerrit2002.wikimedia.org with OS bullseye [17:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:44] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gerrit2002.wikimedia.org with OS bullseye completed: - gerrit2002 (**PASS**)... [17:28:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:28:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:28:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:30:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [17:30:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300381)', diff saved to https://phabricator.wikimedia.org/P20925 and previous config saved to /var/cache/conftool/dbconfig/20220216-173137-marostegui.json [17:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:42] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [17:32:43] (03PS1) 10Dzahn: Revert "contint: Install docker 20.10 from thirdparty/ci on buster" [puppet] - 10https://gerrit.wikimedia.org/r/763287 [17:35:35] (03CR) 10Dzahn: [C: 03+2] Revert "contint: Install docker 20.10 from thirdparty/ci on buster" [puppet] - 10https://gerrit.wikimedia.org/r/763287 (owner: 10Dzahn) [17:36:53] (03CR) 10Phuedx: [C: 04-1] "See inline." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [17:38:14] (03CR) 10Jdlrobson: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [17:40:30] (03PS1) 10Marostegui: add_tl_target_id_T300775.py: Remove COPY from the index creation [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763284 (https://phabricator.wikimedia.org/T300775) [17:41:06] (03CR) 10Marostegui: [C: 03+2] add_tl_target_id_T300775.py: Remove COPY from the index creation [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763284 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [17:41:13] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:41:30] (03Merged) 10jenkins-bot: add_tl_target_id_T300775.py: Remove COPY from the index creation [software/schema-changes] - 10https://gerrit.wikimedia.org/r/763284 (https://phabricator.wikimedia.org/T300775) (owner: 10Marostegui) [17:46:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P20926 and previous config saved to /var/cache/conftool/dbconfig/20220216-174641-marostegui.json [17:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:28] jouncebot now [17:50:28] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [17:50:39] I'm going to run a test on deployment.eqiad.wmnet [17:50:49] (03PS4) 10Razzi: analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) [17:52:17] and done. [17:59:12] (03CR) 10AOkoth: [C: 03+2] vrts: rename module class variables [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:59:43] (03CR) 10AOkoth: [C: 03+2] vrts: rename module class variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:59:56] (03PS5) 10Razzi: analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) [18:00:04] (03CR) 10AOkoth: [C: 03+2] vrts: rename module class variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [18:00:49] (03PS1) 10JHathaway: Remove ordered_json function [puppet] - 10https://gerrit.wikimedia.org/r/763309 [18:01:06] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) [18:01:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P20929 and previous config saved to /var/cache/conftool/dbconfig/20220216-180146-marostegui.json [18:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:59] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) 05Open→03Resolved @Dzahn @akosiaris this is complete [18:03:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10Papaul) [18:04:32] (03PS1) 10AOkoth: Revert "vrts: rename module class variables" [puppet] - 10https://gerrit.wikimedia.org/r/763288 [18:07:27] (03CR) 10AOkoth: [C: 03+2] Revert "vrts: rename module class variables" [puppet] - 10https://gerrit.wikimedia.org/r/763288 (owner: 10AOkoth) [18:09:02] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager.py: increase total backup timeout [puppet] - 10https://gerrit.wikimedia.org/r/763310 (https://phabricator.wikimedia.org/T292546) [18:10:11] (03PS1) 10JHathaway: ini(), php_ini(): convert to modern Ruby function API [puppet] - 10https://gerrit.wikimedia.org/r/763311 [18:11:05] (03CR) 10jerkins-bot: [V: 04-1] ini(), php_ini(): convert to modern Ruby function API [puppet] - 10https://gerrit.wikimedia.org/r/763311 (owner: 10JHathaway) [18:15:05] (03PS6) 10Razzi: analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) [18:15:49] (03PS2) 10JHathaway: ini(), php_ini(): convert to modern Ruby function API [puppet] - 10https://gerrit.wikimedia.org/r/763311 [18:15:56] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33829/console" [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [18:16:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300381)', diff saved to https://phabricator.wikimedia.org/P20930 and previous config saved to /var/cache/conftool/dbconfig/20220216-181651-marostegui.json [18:16:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [18:16:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [18:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:16:57] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [18:17:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T300381)', diff saved to https://phabricator.wikimedia.org/P20931 and previous config saved to /var/cache/conftool/dbconfig/20220216-181706-marostegui.json [18:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:24:07] 10SRE, 10Scap, 10Release-Engineering-Team (Radar): Scap deployers should have the ability to depool and restart HHVM - https://phabricator.wikimedia.org/T208813 (10dancy) 05Open→03Declined HHVM is no longer in use. [18:33:10] 10SRE, 10Observability-Metrics, 10observability, 10Graphite: unused grafana-dashboard indices on elasticsearch / logstash - https://phabricator.wikimedia.org/T174172 (10colewhite) 05Open→03Resolved a:03colewhite It appears this task was completed some time ago as the described indexes do not exist on... [18:33:13] 10SRE, 10Observability-Logging, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [18:43:04] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager.py: increase total backup timeout [puppet] - 10https://gerrit.wikimedia.org/r/763310 (https://phabricator.wikimedia.org/T292546) (owner: 10Andrew Bogott) [18:49:17] !log deploying OTRS config change [18:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:54] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:54:06] (03PS3) 10Ssingh: bird: for bird6, set local IP to IPv6 instead of IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) [18:56:02] (03CR) 10Ssingh: bird: for bird6, set local IP to IPv6 instead of IPv4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [18:59:35] (03CR) 10Ssingh: [C: 03+2] bird: for bird6, set local IP to IPv6 instead of IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/763244 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [19:00:05] hashar and jeena: I, the Bot under the Fountain, call upon thee, The Deployer, to do Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1900). [19:00:05] hashar and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T1900). [19:10:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Milimetric) [19:11:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10Milimetric) [19:12:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damiendf - https://phabricator.wikimedia.org/T301659 (10Milimetric) Approved in Andrew Otto's absence. I'm not sure if he then merged any relevant patches (like the `analytics-privatedata-users` membership, I can take care of t... [19:14:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:15:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:58] (03PS1) 10Cathal Mooney: Rename variable used for router id setting in Bird config [puppet] - 10https://gerrit.wikimedia.org/r/763319 [19:21:35] (03CR) 10jerkins-bot: [V: 04-1] Rename variable used for router id setting in Bird config [puppet] - 10https://gerrit.wikimedia.org/r/763319 (owner: 10Cathal Mooney) [19:24:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300381)', diff saved to https://phabricator.wikimedia.org/P20932 and previous config saved to /var/cache/conftool/dbconfig/20220216-192400-marostegui.json [19:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:07] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [19:24:19] (03PS1) 10Jbond: wikimedia.org: Add MS O365 txt record [dns] - 10https://gerrit.wikimedia.org/r/763323 (https://phabricator.wikimedia.org/T300076) [19:24:23] (03PS2) 10Cathal Mooney: Rename variable used for router id setting in Bird config [puppet] - 10https://gerrit.wikimedia.org/r/763319 (https://phabricator.wikimedia.org/T301165) [19:25:16] (03PS1) 10CDanis: Revert "text-lb: normalize unused query param" [puppet] - 10https://gerrit.wikimedia.org/r/763289 [19:25:25] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10WMSE (IT): Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10jbond) I have created the change and will make sure it gets deployed tomorrow. > P.S I'll probably need to request addition of another set of DNS re... [19:25:36] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review, 10WMSE (IT): Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10jbond) p:05Triage→03Medium [19:25:57] (03CR) 10RLazarus: [C: 03+1] Revert "text-lb: normalize unused query param" [puppet] - 10https://gerrit.wikimedia.org/r/763289 (owner: 10CDanis) [19:31:15] (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33830/doh3001.wikimedia.org/index.html PCC looks happy, no change to config!" [puppet] - 10https://gerrit.wikimedia.org/r/763319 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [19:33:09] (03CR) 10Cathal Mooney: [C: 03+2] Rename variable used for router id setting in Bird config [puppet] - 10https://gerrit.wikimedia.org/r/763319 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [19:33:29] !log removing 28 files for legal compliance [19:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P20933 and previous config saved to /var/cache/conftool/dbconfig/20220216-193905-marostegui.json [19:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:06] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:53:58] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) Up and running from esams :) ` cathal@nbgw:~$ dig -b 2001:470:1f09:32c::103 +nsid +https www.toutless.com @wikimedia-dns.org. ; <<>> DiG 9.17.2... [19:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P20934 and previous config saved to /var/cache/conftool/dbconfig/20220216-195410-marostegui.json [19:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:39] (03PS1) 10Gergő Tisza: GrowthExperiments: Enable image recommendations on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763326 (https://phabricator.wikimedia.org/T301276) [19:58:57] (03CR) 10Kosta Harlan: [C: 03+1] "thanks for the cleanup!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763326 (https://phabricator.wikimedia.org/T301276) (owner: 10Gergő Tisza) [19:59:10] (03PS1) 10Clare Ming: Add back flex-grow for sticky header search bar [skins/Vector] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/763290 [20:01:20] (03CR) 10EllenR: "Came back around, lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [20:07:53] anyone around who could review https://gerrit.wikimedia.org/r/c/mediawiki/core/+/763282 ? i'd like to backport it today [20:09:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300381)', diff saved to https://phabricator.wikimedia.org/P20936 and previous config saved to /var/cache/conftool/dbconfig/20220216-200914-marostegui.json [20:09:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [20:09:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [20:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:21] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [20:09:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T300381)', diff saved to https://phabricator.wikimedia.org/P20937 and previous config saved to /var/cache/conftool/dbconfig/20220216-200922-marostegui.json [20:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:28] (03PS1) 10Cwhite: grafana-next: set grafana codfw base domain to grafana next [puppet] - 10https://gerrit.wikimedia.org/r/763329 (https://phabricator.wikimedia.org/T282863) [20:32:19] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/33831/" [puppet] - 10https://gerrit.wikimedia.org/r/763329 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [20:39:26] (03CR) 10Razzi: [V: 03+1] "Might need some further tweaks, especially regarding alerting and security. Let me know how it looks for now though!" [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [20:44:16] (03PS2) 10Jforrester: Drop CodeReview, Part I: Stop loading it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593350 (https://phabricator.wikimedia.org/T116948) [20:44:22] (03PS2) 10Jforrester: Drop CodeReview, Part II: Stop configuring it anywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593351 (https://phabricator.wikimedia.org/T116948) [20:44:26] (03PS2) 10Jforrester: Drop CodeReview, Part III: Drop from i18n build step [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593352 (https://phabricator.wikimedia.org/T116948) [20:53:18] (03PS1) 10Ssingh: Add all doh* and durum* hosts to anycast_neighbors to enable IPv6 [homer/public] - 10https://gerrit.wikimedia.org/r/763331 (https://phabricator.wikimedia.org/T301165) [20:55:21] (03CR) 10Tchanders: [C: 04-1] Update Event Stream for IPInfo events (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [20:58:14] (03PS1) 10Bartosz Dziewoński: EditPage: Parse wikitext in the usual way in the copyright message [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763291 (https://phabricator.wikimedia.org/T301890) [20:59:13] (03PS1) 10Bartosz Dziewoński: Add Ӷ and Ԥ to Abkhaz collation [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/763292 (https://phabricator.wikimedia.org/T298309) [20:59:36] (03PS1) 10Bartosz Dziewoński: Add Ӷ and Ԥ to Abkhaz collation [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763293 (https://phabricator.wikimedia.org/T298309) [21:00:05] chrisalbon and accraze: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T2100). [21:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T2100). [21:00:05] nn1l2 and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] hi [21:00:12] o/ [21:00:26] i also have some late additions [21:00:29] (sorry) [21:02:12] I can do the backport [21:02:40] MatmaRex: can you add them to the wiki page? [21:02:43] (added to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220216T2100) [21:02:49] tgr: yeah, i was just editing [21:05:05] (03CR) 10Gergő Tisza: [C: 03+2] InitialiseSettings: General cleanup, wgAddGroups (J-P) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763225 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [21:06:07] (03Merged) 10jenkins-bot: InitialiseSettings: General cleanup, wgAddGroups (J-P) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763225 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [21:06:41] (03CR) 10Herron: [C: 03+1] grafana-next: set grafana codfw base domain to grafana next [puppet] - 10https://gerrit.wikimedia.org/r/763329 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [21:06:44] tgr: they're mediawiki/core backports, we might want to +2 them earlier so that we don't have to wait for the merges later [21:06:51] (be right back) [21:08:18] (03CR) 10Gergő Tisza: [C: 03+2] Add Ӷ and Ԥ to Abkhaz collation [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/763292 (https://phabricator.wikimedia.org/T298309) (owner: 10Bartosz Dziewoński) [21:08:25] (03CR) 10Gergő Tisza: [C: 03+2] Add Ӷ and Ԥ to Abkhaz collation [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763293 (https://phabricator.wikimedia.org/T298309) (owner: 10Bartosz Dziewoński) [21:09:44] (03CR) 10Gergő Tisza: [C: 03+2] EditPage: Parse wikitext in the usual way in the copyright message [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763291 (https://phabricator.wikimedia.org/T301890) (owner: 10Bartosz Dziewoński) [21:10:02] good point, thanks [21:10:26] (03PS2) 10Gergő Tisza: GrowthExperiments: Enable image recommendations on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763326 (https://phabricator.wikimedia.org/T301276) [21:10:36] thanks for deploying :) [21:10:45] (03CR) 10Tchanders: [C: 04-1] Update Event Stream for IPInfo events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [21:12:58] (03CR) 10jerkins-bot: [V: 04-1] EditPage: Parse wikitext in the usual way in the copyright message [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763291 (https://phabricator.wikimedia.org/T301890) (owner: 10Bartosz Dziewoński) [21:13:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:52] no deployers available in this window? [21:14:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:14:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:14:34] (03CR) 10Btullis: "Looking good. Can you do a pcc run against datahubsearch1001, to see if it install cleanly?" [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [21:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:28] (03Abandoned) 10Clare Ming: Add back flex-grow for sticky header search bar [skins/Vector] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/763290 (owner: 10Clare Ming) [21:16:33] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:763225|InitialiseSettings: General cleanup, wgAddGroups (J-P) (T301647)]] (duration: 00m 51s) [21:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:39] T301647: Clean up InitialiseSettings - https://phabricator.wikimedia.org/T301647 [21:16:55] nn1l2: it's live [21:17:00] (03PS1) 10Cwhite: remove deprecated piechart plugin [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763334 (https://phabricator.wikimedia.org/T282863) [21:17:02] (03PS1) 10Cwhite: update grafana-image-renderer to 3.3.0 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763335 (https://phabricator.wikimedia.org/T282863) [21:17:15] thanks! [21:17:31] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable image recommendations on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763326 (https://phabricator.wikimedia.org/T301276) (owner: 10Gergő Tisza) [21:18:26] (03Merged) 10jenkins-bot: GrowthExperiments: Enable image recommendations on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763326 (https://phabricator.wikimedia.org/T301276) (owner: 10Gergő Tisza) [21:18:38] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:19:20] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Mstyles) Thanks @Joe, is there a hard limit on file sizes that can be stored inside the container? We might have other options with the file sizes. Do you have any guid... [21:23:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20938 and previous config saved to /var/cache/conftool/dbconfig/20220216-212315-root.json [21:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:24] (03Merged) 10jenkins-bot: Add Ӷ and Ԥ to Abkhaz collation [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/763292 (https://phabricator.wikimedia.org/T298309) (owner: 10Bartosz Dziewoński) [21:23:30] (03Merged) 10jenkins-bot: Add Ӷ and Ԥ to Abkhaz collation [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763293 (https://phabricator.wikimedia.org/T298309) (owner: 10Bartosz Dziewoński) [21:24:11] 10SRE, 10Security-Team, 10Performance-Team (Radar), 10Security: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10Mstyles) It seems like the simplest way forward for us would be to use the existing [[ https://wikitech.wikimedia.org/wiki/Cassandra | Cassandra cluster ]] with a new ac... [21:24:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [21:24:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [21:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:48] (03PS1) 10Andrew Bogott: backy2: on Bullseye, hack around a silly package name mismatch [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) [21:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:52] (03Merged) 10jenkins-bot: EditPage: Parse wikitext in the usual way in the copyright message [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/763291 (https://phabricator.wikimedia.org/T301890) (owner: 10Bartosz Dziewoński) [21:24:53] (03PS1) 10Cwhite: update grafana-simple-json-datasource to 1.4.2 [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/763337 (https://phabricator.wikimedia.org/T282863) [21:25:34] (03CR) 10jerkins-bot: [V: 04-1] backy2: on Bullseye, hack around a silly package name mismatch [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [21:25:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:27:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:54] (03PS2) 10Andrew Bogott: backy2: on Bullseye, hack around a silly package name mismatch [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) [21:28:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:30] (03CR) 10jerkins-bot: [V: 04-1] backy2: on Bullseye, hack around a silly package name mismatch [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [21:29:06] (03PS1) 10Ssingh: Add CR router IPv6 loopbacks to Bird for {eqsin,eqiad,ulsfo,codwf} [puppet] - 10https://gerrit.wikimedia.org/r/763339 (https://phabricator.wikimedia.org/T301165) [21:29:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300381)', diff saved to https://phabricator.wikimedia.org/P20939 and previous config saved to /var/cache/conftool/dbconfig/20220216-212934-marostegui.json [21:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:40] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [21:30:47] tgr: you there? [21:31:36] (03PS3) 10Andrew Bogott: backy2: on Bullseye, hack around a silly package name mismatch [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) [21:31:38] had to hop around to find and ISP not under an agressive range block for testing account creation :( [21:31:59] heh, oops [21:32:15] * AntiComposite blocks tgr for block evasion [21:32:16] no hurry, i just wanted to make sure you're still doing the deployment [21:33:11] tgr: also i could probably test the whatever account creation thing if you can't? [21:33:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:00] my third option worked [21:34:12] (03CR) 10Andrew Bogott: [C: 03+2] backy2: on Bullseye, hack around a silly package name mismatch [puppet] - 10https://gerrit.wikimedia.org/r/763336 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [21:34:31] now I just need to get lucky with the A/B test [21:34:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:34:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:44] ...aaand get around the signup throttling... [21:36:50] testing account creation is fun. [21:37:42] (03PS1) 10Andrew Bogott: backy2: /fully/qualify/grep because puppet likes it that way [puppet] - 10https://gerrit.wikimedia.org/r/763341 (https://phabricator.wikimedia.org/T301909) [21:37:58] (03CR) 10Dzahn: [C: 03+2] Make sure that metric_name is defined in all cases [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763231 (https://phabricator.wikimedia.org/T301872) (owner: 10Alexandros Kosiaris) [21:38:02] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Make sure that metric_name is defined in all cases [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763231 (https://phabricator.wikimedia.org/T301872) (owner: 10Alexandros Kosiaris) [21:38:14] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:38:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20940 and previous config saved to /var/cache/conftool/dbconfig/20220216-213819-root.json [21:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:27] (03CR) 10Andrew Bogott: [C: 03+2] backy2: /fully/qualify/grep because puppet likes it that way [puppet] - 10https://gerrit.wikimedia.org/r/763341 (https://phabricator.wikimedia.org/T301909) (owner: 10Andrew Bogott) [21:40:51] (03CR) 10Dzahn: [C: 03+2] Add new metrics present in 1.8.16 [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763238 (https://phabricator.wikimedia.org/T301872) (owner: 10Alexandros Kosiaris) [21:40:55] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Add new metrics present in 1.8.16 [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763238 (https://phabricator.wikimedia.org/T301872) (owner: 10Alexandros Kosiaris) [21:40:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:33] not hitting a 40% window out of 8 tries, that doesn't look good. [21:41:44] (03CR) 10Dzahn: [C: 03+2] Release 0.6 prometheus-etherpad-exporter [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/763239 (https://phabricator.wikimedia.org/T301872) (owner: 10Alexandros Kosiaris) [21:41:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:41:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:00] (03PS2) 10Ssingh: Add CR router IPv6 loopbacks to Bird for {eqsin,eqiad,ulsfo,codwf} [puppet] - 10https://gerrit.wikimedia.org/r/763339 (https://phabricator.wikimedia.org/T301165) [21:42:22] oh well, let's do the backports first. [21:42:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:15] MatmaRex: updateCollation uses the db, right? [21:43:23] so I assume that can't be tested on mwdebug [21:43:26] tgr: yes [21:44:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P20941 and previous config saved to /var/cache/conftool/dbconfig/20220216-214439-marostegui.json [21:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:35] MatmaRex: the code is on mwdebug1001 [21:47:07] the EditPage change? [21:47:30] oh, yes, i see it now [21:47:34] looks good on https://zh.wikivoyage.org/w/index.php?title=Wikivoyage:互助客栈&action=edit [21:48:11] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.21/includes/collation/AbkhazUppercaseCollation.php: Backport: [[gerrit:763292|Add Ӷ and Ԥ to Abkhaz collation (T298309)]] (duration: 00m 49s) [21:48:13] i can actually verify that the collation change has an effect if you only deploy on mwdebug, but it won't display correctly [21:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:16] T298309: Rename the Abkhaz language - https://phabricator.wikimedia.org/T298309 [21:48:25] here's a category that should work for testing: https://ab.wikipedia.org/wiki/Акатегориа:Аԥсны_ажурналистцәа [21:48:35] after deploying the change, the headings will be all wrong :) [21:48:50] and after running the script too, they will be right again, plus "Ԥ" will appear after "П" [21:49:41] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.22/includes/collation/AbkhazUppercaseCollation.php: Backport: [[gerrit:763293|Add Ӷ and Ԥ to Abkhaz collation (T298309)]] (duration: 00m 49s) [21:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:15] (the script should only take a couple seconds) [21:51:00] should I use updateCollation.php --remote? [21:51:06] "Use Shellbox to calculate the new sort keys remotely" [21:51:22] not sure how mature that option is [21:51:24] uhh, probably not? it's a new thing, i don't know if it works yet [21:51:32] it shouldn't be needed [21:51:55] (03CR) 10Cwhite: [C: 03+1] "Thanks for the refactor!" [puppet] - 10https://gerrit.wikimedia.org/r/763110 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [21:51:59] (i mean, i trust that the code works, but i don't know if we have everything for it configured in production) [21:52:51] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Checked the IPs against reverse DNS and it matches the devices mentioned in the comments." [puppet] - 10https://gerrit.wikimedia.org/r/763339 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [21:52:56] !log ran mwscript updateCollation.php abwiki --force [21:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20942 and previous config saved to /var/cache/conftool/dbconfig/20220216-215322-root.json [21:53:43] (03CR) 10Ssingh: [C: 03+2] Add CR router IPv6 loopbacks to Bird for {eqsin,eqiad,ulsfo,codwf} [puppet] - 10https://gerrit.wikimedia.org/r/763339 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [21:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:23] tgr: thanks. looks good [21:54:40] !log merged Alex's changes, built prometheus-etherpad-exporter_0.6 on deneb, imported on apt1001, ran reprepro export, installed new version on etherpad1003 T301872 [21:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:47] T301872: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 [21:55:13] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.22/includes/EditPage.php: Backport: [[gerrit:763291|EditPage: Parse wikitext in the usual way in the copyright message (T301890)]] (duration: 00m 49s) [21:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:18] T301890: Wikimedia-copyrightwarning no longer support wikitext table syntax in 1.38.0-wmf.22 - https://phabricator.wikimedia.org/T301890 [21:55:24] there will be one more patch, I'll add it to the wiki in a moment [21:56:09] (plus I still have to figure out if the GrowthExperiments patch is actually working) [21:56:27] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763197 (https://phabricator.wikimedia.org/T300859) (owner: 10Filippo Giunchedi) [21:57:10] (03PS9) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [21:57:44] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10Patch-For-Review: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 (10Dzahn) ` [apt1001:~] $ sudo -E reprepro ls prometheus-etherpad-exporter prometheus-etherpad-exporter | 0.3 | buster-wikimedia | amd64, i386, source prometheus-e... [21:57:46] (03PS1) 10Andrew Bogott: backy2: don't back up shelved instances [puppet] - 10https://gerrit.wikimedia.org/r/763345 [21:59:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P20943 and previous config saved to /var/cache/conftool/dbconfig/20220216-215944-marostegui.json [21:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:48] (03CR) 10Cathal Mooney: [C: 03+1] "Yep looks good. thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/763331 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [22:01:22] (03CR) 10Ssingh: [C: 03+2] Add all doh* and durum* hosts to anycast_neighbors to enable IPv6 [homer/public] - 10https://gerrit.wikimedia.org/r/763331 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [22:01:45] (03PS10) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [22:02:41] (03Merged) 10jenkins-bot: Add all doh* and durum* hosts to anycast_neighbors to enable IPv6 [homer/public] - 10https://gerrit.wikimedia.org/r/763331 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [22:05:48] (03PS3) 10Ssingh: durum: add support for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/762521 (https://phabricator.wikimedia.org/T301165) [22:05:51] (03PS11) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [22:08:02] (03CR) 10Ssingh: [C: 03+2] durum: add support for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/762521 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [22:08:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20944 and previous config saved to /var/cache/conftool/dbconfig/20220216-220826-root.json [22:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:30] PROBLEM - Check systemd state on doh6002 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:08] PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: bird6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:21] ^ topranks ha [22:11:46] I think we should just ACK this particular case and not worry about it since we are not serving traffic anyway [22:12:05] agreed [22:12:07] ok [22:12:17] adding the far side, it'll need a restart. [22:12:20] can you ack it? [22:12:30] yep will do [22:13:02] optionally: systemctl stop .. systemctl mask .. systemctl reset-failed probably makes it happy [22:13:34] yeah that's not a bad idea as well but I think there will be durum errors as well, so we need to do all four hosts [22:13:52] do you guys know how I can find alerts in alertmanager that are ..not alerting [22:14:10] I want to find the instance=etherpad1003 but cant [22:14:24] ACK, sukhe [22:14:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300381)', diff saved to https://phabricator.wikimedia.org/P20945 and previous config saved to /var/cache/conftool/dbconfig/20220216-221448-marostegui.json [22:14:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [22:14:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [22:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:55] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [22:14:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T300381)', diff saved to https://phabricator.wikimedia.org/P20946 and previous config saved to /var/cache/conftool/dbconfig/20220216-221456-marostegui.json [22:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:04] that's auto-manuel working [22:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:28] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on durum[6001-6002].drmrs.wmnet with reason: T301165; errors expected, not serving any traffic [22:15:30] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on durum[6001-6002].drmrs.wmnet with reason: T301165; errors expected, not serving any traffic [22:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:33] T301165: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 [22:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:38] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:15:43] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: T301165; errors expected, not serving any traffic [22:15:45] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on doh[6001-6002].wikimedia.org with reason: T301165; errors expected, not serving any traffic [22:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:00] (JobUnavailable) firing: Reduced availability for job mjolnir in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:18:17] (03CR) 10AGueyte: Update Event Stream for IPInfo events (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [22:19:52] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:21:58] RECOVERY - Host check.wikimedia-dns.org is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [22:22:08] ^ wb [22:23:10] https://www.youtube.com/watch?v=FPQlXNH36mI [22:23:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20948 and previous config saved to /var/cache/conftool/dbconfig/20220216-222329-root.json [22:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:53] sukhe: lol [22:28:39] mutante: credit goes to topranks :) [22:28:50] (for all the work, not the video!) [22:30:41] video annotations are always fun :) [22:34:44] 10SRE, 10Wikimedia-Etherpad, 10serviceops: Prometheus etherpad scrape failure - https://phabricator.wikimedia.org/T301872 (10Dzahn) 05Open→03Resolved @akosiaris :) reviewed / merged patches, built package on deneb, uploaded package on apt1001, imported with reprepro, exported indices, installed package... [22:34:47] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Dzahn) [22:34:53] 10SRE: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10Aklapper) Following the steps at https://www.mediawiki.org/wiki/Phabricator/Bots#Phabricator_admins:_Steps_to_perform ; https://phabricator.wikimedia.org/settings/user/ProdPasteBot/page/apitokens/ does not list an... [22:38:47] (03PS1) 10Zabe: Enable huwiki 500K articles milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763353 (https://phabricator.wikimedia.org/T301923) [22:39:32] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:40:03] (03PS1) 10Gergő Tisza: Add huwiki 500k milestone logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763354 (https://phabricator.wikimedia.org/T301923) [22:40:05] (03PS1) 10Gergő Tisza: Use huwiki 500k milestone logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763355 (https://phabricator.wikimedia.org/T301923) [22:49:18] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:763326|GrowthExperiments: Enable image recommendations on eswiki (T301276)]] (duration: 00m 52s) [22:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:24] T301276: Turn Add an image on at Spanish Wikipedia - https://phabricator.wikimedia.org/T301276 [22:49:55] (03Abandoned) 10Zabe: Enable huwiki 500K articles milestone logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763353 (https://phabricator.wikimedia.org/T301923) (owner: 10Zabe) [22:51:32] (03CR) 10Cwhite: profile::logstash::beta: move to profile::base::certificate's truststore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [22:53:53] (03CR) 10Gergő Tisza: [C: 03+2] Add huwiki 500k milestone logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763354 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [22:55:05] (03Merged) 10jenkins-bot: Add huwiki 500k milestone logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763354 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [22:57:32] !log tgr@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:763354|Add huwiki 500k milestone logos (T301923)]] (duration: 00m 50s) [22:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:38] T301923: Enable milestone logo for hu.wikipedia - 500K articles - https://phabricator.wikimedia.org/T301923 [22:58:14] (03CR) 10Cwhite: [C: 03+1] "This looks right to me." [puppet] - 10https://gerrit.wikimedia.org/r/763172 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [22:58:29] !log tgr@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:763354|Add huwiki 500k milestone logos (T301923)]] (duration: 00m 49s) [22:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:35] (03CR) 10Gergő Tisza: [C: 03+2] Use huwiki 500k milestone logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763355 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [22:59:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:59:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:27] (03Merged) 10jenkins-bot: Use huwiki 500k milestone logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/763355 (https://phabricator.wikimedia.org/T301923) (owner: 10Gergő Tisza) [23:00:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:07:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:45] !log tgr@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:763355|Use huwiki 500k milestone logos (T301923)]] (duration: 00m 49s) [23:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:51] T301923: Enable milestone logo for hu.wikipedia - 500K articles - https://phabricator.wikimedia.org/T301923 [23:12:11] (03CR) 10Cwhite: profile::logstash::beta: move to profile::base::certificate's truststore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/763113 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [23:18:04] (03PS1) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [23:19:55] (03PS2) 10JHathaway: Remove ordered_json function [puppet] - 10https://gerrit.wikimedia.org/r/763309 [23:20:42] (03CR) 10jerkins-bot: [V: 04-1] Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [23:27:11] (03PS2) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [23:28:54] !log test reboot of lsw1-e1-eqiad - not in service. [23:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:53] (03CR) 10jerkins-bot: [V: 04-1] Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [23:31:04] (03PS1) 10JHathaway: Remove puppet:///files and move files to modules [puppet] - 10https://gerrit.wikimedia.org/r/763370 [23:33:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300381)', diff saved to https://phabricator.wikimedia.org/P20949 and previous config saved to /var/cache/conftool/dbconfig/20220216-233345-marostegui.json [23:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:52] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [23:37:24] (03PS3) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [23:38:42] (03PS4) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [23:39:25] (03PS5) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [23:40:10] RECOVERY - Check systemd state on doh6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:42] (03CR) 10jerkins-bot: [V: 04-1] Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [23:48:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P20950 and previous config saved to /var/cache/conftool/dbconfig/20220216-234850-marostegui.json [23:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:22] (03PS6) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [23:54:09] (03CR) 10jerkins-bot: [V: 04-1] Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway) [23:55:50] (03PS7) 10JHathaway: Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 [23:56:36] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:58:44] (03CR) 10jerkins-bot: [V: 04-1] Remove ordered_yaml function [puppet] - 10https://gerrit.wikimedia.org/r/763362 (owner: 10JHathaway)