[00:00:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [00:00:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [00:00:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T370903)', diff saved to https://phabricator.wikimedia.org/P71271 and previous config saved to /var/cache/conftool/dbconfig/20241128-000023-ladsgroup.json [00:00:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance [00:00:32] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:00:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance [00:00:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T370903)', diff saved to https://phabricator.wikimedia.org/P71272 and previous config saved to /var/cache/conftool/dbconfig/20241128-000046-ladsgroup.json [00:01:11] (03Merged) 10jenkins-bot: Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [00:01:14] (03Merged) 10jenkins-bot: Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling) [00:01:47] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1094126|Move default main page text for new wikis to config (T352113)]], [[gerrit:1096839|Introduce preinstall.dblist for wikis that haven't been installed yet (T352113)]] [00:01:51] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [00:07:21] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1094126|Move default main page text for new wikis to config (T352113)]], [[gerrit:1096839|Introduce preinstall.dblist for wikis that haven't been installed yet (T352113)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:07:25] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [00:09:50] !log tstarling@deploy2002 tstarling: Continuing with sync [00:15:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T370903)', diff saved to https://phabricator.wikimedia.org/P71273 and previous config saved to /var/cache/conftool/dbconfig/20241128-001528-ladsgroup.json [00:15:33] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:16:29] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1094126|Move default main page text for new wikis to config (T352113)]], [[gerrit:1096839|Introduce preinstall.dblist for wikis that haven't been installed yet (T352113)]] (duration: 14m 42s) [00:16:33] T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113 [00:30:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P71274 and previous config saved to /var/cache/conftool/dbconfig/20241128-003035-ladsgroup.json [00:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098646 [00:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098646 (owner: 10TrainBranchBot) [00:38:54] (03CR) 10Jdlrobson: [C:04-1] Allow defaulting to Parsoid Read Views when MobileFrontEnd is active (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 (https://phabricator.wikimedia.org/T381002) (owner: 10C. Scott Ananian) [00:45:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P71275 and previous config saved to /var/cache/conftool/dbconfig/20241128-004542-ladsgroup.json [00:47:11] (03PS10) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) [00:47:11] (03PS1) 10BryanDavis: deployment-prep: Add PHP 8.1 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1098647 (https://phabricator.wikimedia.org/T378752) [00:56:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098646 (owner: 10TrainBranchBot) [01:00:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T370903)', diff saved to https://phabricator.wikimedia.org/P71276 and previous config saved to /var/cache/conftool/dbconfig/20241128-010049-ladsgroup.json [01:00:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance [01:00:55] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [01:01:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance [01:01:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T370903)', diff saved to https://phabricator.wikimedia.org/P71277 and previous config saved to /var/cache/conftool/dbconfig/20241128-010112-ladsgroup.json [01:03:36] PROBLEM - dump of x1 in codfw on backupmon1001 is CRITICAL: dump for x1 at codfw (db2197) taken more than a week ago: Most recent backup 2024-11-19 00:49:20 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:08:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098651 [01:08:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098651 (owner: 10TrainBranchBot) [01:16:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T370903)', diff saved to https://phabricator.wikimedia.org/P71278 and previous config saved to /var/cache/conftool/dbconfig/20241128-011559-ladsgroup.json [01:16:05] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [01:26:24] (03PS1) 10Tim Starling: Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652 [01:27:22] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098651 (owner: 10TrainBranchBot) [01:31:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P71279 and previous config saved to /var/cache/conftool/dbconfig/20241128-013106-ladsgroup.json [01:46:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P71280 and previous config saved to /var/cache/conftool/dbconfig/20241128-014613-ladsgroup.json [01:47:22] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/1c839c80f5364bbf427963aee48b37467b14b9aa844afef0d7b69339d3615845/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:01:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T370903)', diff saved to https://phabricator.wikimedia.org/P71281 and previous config saved to /var/cache/conftool/dbconfig/20241128-020120-ladsgroup.json [02:01:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance [02:01:26] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:01:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance [02:01:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T370903)', diff saved to https://phabricator.wikimedia.org/P71282 and previous config saved to /var/cache/conftool/dbconfig/20241128-020143-ladsgroup.json [02:02:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:22] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:08:58] (03CR) 10Samwilson: [C:03+1] Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652 (owner: 10Tim Starling) [02:09:27] (03CR) 10Tim Starling: [C:03+2] Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652 (owner: 10Tim Starling) [02:10:09] (03Merged) 10jenkins-bot: Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652 (owner: 10Tim Starling) [02:16:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T370903)', diff saved to https://phabricator.wikimedia.org/P71283 and previous config saved to /var/cache/conftool/dbconfig/20241128-021629-ladsgroup.json [02:16:34] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:31:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P71284 and previous config saved to /var/cache/conftool/dbconfig/20241128-023136-ladsgroup.json [02:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:39:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P71285 and previous config saved to /var/cache/conftool/dbconfig/20241128-024644-ladsgroup.json [03:01:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T370903)', diff saved to https://phabricator.wikimedia.org/P71286 and previous config saved to /var/cache/conftool/dbconfig/20241128-030151-ladsgroup.json [03:01:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance [03:01:56] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [03:02:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance [03:02:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T370903)', diff saved to https://phabricator.wikimedia.org/P71287 and previous config saved to /var/cache/conftool/dbconfig/20241128-030213-ladsgroup.json [03:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:17:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T370903)', diff saved to https://phabricator.wikimedia.org/P71288 and previous config saved to /var/cache/conftool/dbconfig/20241128-031726-ladsgroup.json [03:17:33] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [03:32:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P71289 and previous config saved to /var/cache/conftool/dbconfig/20241128-033234-ladsgroup.json [03:39:11] FIRING: [13x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:47:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P71290 and previous config saved to /var/cache/conftool/dbconfig/20241128-034741-ladsgroup.json [04:02:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T370903)', diff saved to https://phabricator.wikimedia.org/P71291 and previous config saved to /var/cache/conftool/dbconfig/20241128-040248-ladsgroup.json [04:02:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance [04:02:53] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:03:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance [04:03:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:03:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [04:03:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T370903)', diff saved to https://phabricator.wikimedia.org/P71292 and previous config saved to /var/cache/conftool/dbconfig/20241128-040326-ladsgroup.json [04:18:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T370903)', diff saved to https://phabricator.wikimedia.org/P71294 and previous config saved to /var/cache/conftool/dbconfig/20241128-041807-ladsgroup.json [04:18:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [04:33:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P71296 and previous config saved to /var/cache/conftool/dbconfig/20241128-043314-ladsgroup.json [04:38:00] (03PS1) 10Santhosh: recommendation-api: Fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098684 [04:48:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P71297 and previous config saved to /var/cache/conftool/dbconfig/20241128-044822-ladsgroup.json [04:59:41] (03CR) 10KartikMistry: [C:03+2] recommendation-api: Fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098684 (owner: 10Santhosh) [05:01:03] (03Merged) 10jenkins-bot: recommendation-api: Fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098684 (owner: 10Santhosh) [05:03:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T370903)', diff saved to https://phabricator.wikimedia.org/P71298 and previous config saved to /var/cache/conftool/dbconfig/20241128-050329-ladsgroup.json [05:03:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance [05:03:34] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [05:03:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance [05:03:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T370903)', diff saved to https://phabricator.wikimedia.org/P71299 and previous config saved to /var/cache/conftool/dbconfig/20241128-050352-ladsgroup.json [05:06:58] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [05:16:14] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1098652|Add frwiki on labs for new addWiki.php test]] [05:18:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T370903)', diff saved to https://phabricator.wikimedia.org/P71300 and previous config saved to /var/cache/conftool/dbconfig/20241128-051833-ladsgroup.json [05:18:39] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [05:22:00] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1098652|Add frwiki on labs for new addWiki.php test]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:23:15] !log tstarling@deploy2002 tstarling: Continuing with sync [05:26:36] RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2197) taken on 2024-11-28 04:58:13 (361 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:29:56] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098652|Add frwiki on labs for new addWiki.php test]] (duration: 13m 41s) [05:33:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P71301 and previous config saved to /var/cache/conftool/dbconfig/20241128-053340-ladsgroup.json [05:37:38] (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-28-052541-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098692 [05:41:44] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-11-28-052541-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098692 (owner: 10KartikMistry) [05:42:46] (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-28-052541-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098692 (owner: 10KartikMistry) [05:48:39] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [05:48:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P71302 and previous config saved to /var/cache/conftool/dbconfig/20241128-054847-ladsgroup.json [06:03:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T370903)', diff saved to https://phabricator.wikimedia.org/P71303 and previous config saved to /var/cache/conftool/dbconfig/20241128-060355-ladsgroup.json [06:03:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [06:04:00] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [06:04:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance [06:04:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T370903)', diff saved to https://phabricator.wikimedia.org/P71304 and previous config saved to /var/cache/conftool/dbconfig/20241128-060418-ladsgroup.json [06:16:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T370903)', diff saved to https://phabricator.wikimedia.org/P71305 and previous config saved to /var/cache/conftool/dbconfig/20241128-061647-ladsgroup.json [06:16:52] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [06:31:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P71306 and previous config saved to /var/cache/conftool/dbconfig/20241128-063155-ladsgroup.json [06:33:17] (03PS1) 10KartikMistry: recommendation-api: Increase helm timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098811 [06:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:38:53] (03CR) 10KartikMistry: [C:03+2] recommendation-api: Increase helm timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098811 (owner: 10KartikMistry) [06:39:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:07] (03Merged) 10jenkins-bot: recommendation-api: Increase helm timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098811 (owner: 10KartikMistry) [06:47:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P71307 and previous config saved to /var/cache/conftool/dbconfig/20241128-064702-ladsgroup.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0700) [07:00:05] marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0700). nyaa~ [07:02:08] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:02:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T370903)', diff saved to https://phabricator.wikimedia.org/P71308 and previous config saved to /var/cache/conftool/dbconfig/20241128-070209-ladsgroup.json [07:02:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance [07:02:16] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [07:02:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance [07:02:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T370903)', diff saved to https://phabricator.wikimedia.org/P71309 and previous config saved to /var/cache/conftool/dbconfig/20241128-070231-ladsgroup.json [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:07:32] PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 174649 MB (4% inode=92%): /srv/swift-storage/sdc1 151377 MB (3% inode=91%): /srv/swift-storage/sdf1 169531 MB (4% inode=91%): /srv/swift-storage/sdd1 177193 MB (4% inode=92%): /srv/swift-storage/sdg1 179316 MB (4% inode=92%): /srv/swift-storage/sdh1 163322 MB (4% inode=91%): /srv/swift-storage/sdi1 211835 MB (5% inode=92%): /srv/swift-st [07:07:32] j1 163074 MB (4% inode=92%): /srv/swift-storage/sdk1 162405 MB (4% inode=91%): /srv/swift-storage/sdm1 175927 MB (4% inode=92%): /srv/swift-storage/sdn1 186693 MB (4% inode=92%): /srv/swift-storage/sdl1 152737 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:32] (03PS1) 10Giuseppe Lavagetto: hiddenparma: add CSRF token config [puppet] - 10https://gerrit.wikimedia.org/r/1098819 [07:13:05] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4604/co" [puppet] - 10https://gerrit.wikimedia.org/r/1098819 (owner: 10Giuseppe Lavagetto) [07:13:53] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] hiddenparma: add CSRF token config [puppet] - 10https://gerrit.wikimedia.org/r/1098819 (owner: 10Giuseppe Lavagetto) [07:15:12] FIRING: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:17:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T370903)', diff saved to https://phabricator.wikimedia.org/P71310 and previous config saved to /var/cache/conftool/dbconfig/20241128-071700-ladsgroup.json [07:17:07] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [07:22:11] (03PS1) 10Giuseppe Lavagetto: Release CSRF token support, some UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098863 [07:22:35] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Release CSRF token support, some UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098863 (owner: 10Giuseppe Lavagetto) [07:22:59] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "CSRF token support - oblivian@cumin1002" [07:23:02] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: CSRF token support - oblivian@cumin1002 [07:23:37] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: CSRF token support - oblivian@cumin1002 [07:23:38] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "CSRF token support - oblivian@cumin1002" [07:24:28] (03PS1) 10Varnent: Add foundation to list of wikis Office Wiki can import from. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098865 (https://phabricator.wikimedia.org/T381063) [07:25:12] RESOLVED: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:31:31] (03PS1) 10Varnent: Enable Wikilove extension on Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098867 (https://phabricator.wikimedia.org/T381065) [07:32:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P71312 and previous config saved to /var/cache/conftool/dbconfig/20241128-073207-ladsgroup.json [07:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:42:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:43:43] (03PS1) 10Varnent: Allow importing from Commons and English Wikipedia to Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098868 (https://phabricator.wikimedia.org/T381066) [07:44:48] (03PS1) 10KartikMistry: Revert "recommendation-api: Increase helm timeout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098869 [07:46:19] (03CR) 10KartikMistry: [C:03+2] Revert "recommendation-api: Increase helm timeout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098869 (owner: 10KartikMistry) [07:47:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P71313 and previous config saved to /var/cache/conftool/dbconfig/20241128-074714-ladsgroup.json [07:47:22] (03Merged) 10jenkins-bot: Revert "recommendation-api: Increase helm timeout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098869 (owner: 10KartikMistry) [07:56:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet [07:56:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10364628 (10ops-monitoring-bot) Draining ganeti1022.eqiad.wmnet of running VMs [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T370903)', diff saved to https://phabricator.wikimedia.org/P71314 and previous config saved to /var/cache/conftool/dbconfig/20241128-080221-ladsgroup.json [08:02:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2195.codfw.wmnet with reason: Maintenance [08:02:26] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [08:02:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2195.codfw.wmnet with reason: Maintenance [08:02:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T370903)', diff saved to https://phabricator.wikimedia.org/P71315 and previous config saved to /var/cache/conftool/dbconfig/20241128-080244-ladsgroup.json [08:15:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T370903)', diff saved to https://phabricator.wikimedia.org/P71316 and previous config saved to /var/cache/conftool/dbconfig/20241128-081514-ladsgroup.json [08:15:20] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [08:25:23] (03CR) 10AOkoth: [C:03+2] mailman: run tasks every 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/1098489 (https://phabricator.wikimedia.org/T377045) (owner: 10AOkoth) [08:25:51] (03PS1) 10JMeybohm: jayme: Add basic cookbook bash completion [puppet] - 10https://gerrit.wikimedia.org/r/1098875 [08:27:10] (03CR) 10JMeybohm: [C:03+2] jayme: Add basic cookbook bash completion [puppet] - 10https://gerrit.wikimedia.org/r/1098875 (owner: 10JMeybohm) [08:30:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P71317 and previous config saved to /var/cache/conftool/dbconfig/20241128-083021-ladsgroup.json [08:32:48] (03PS1) 10Ilias Sarantopoulos: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877 [08:35:56] (03PS2) 10Ilias Sarantopoulos: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877 [08:41:31] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:43:17] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [08:44:36] (03PS2) 10Varnent: Allow importing from Commons and English Wikipedia to Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098868 (https://phabricator.wikimedia.org/T381066) [08:45:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P71318 and previous config saved to /var/cache/conftool/dbconfig/20241128-084528-ladsgroup.json [08:45:49] (03PS3) 10Ilias Sarantopoulos: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877 [08:46:58] (03CR) 10Vgutierrez: [C:03+1] benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:48:04] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:48:53] (03PS10) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [08:49:20] (03PS1) 10Slyngshede: Only show sign in link for anonymous users [software/bitu] - 10https://gerrit.wikimedia.org/r/1098881 (https://phabricator.wikimedia.org/T380998) [08:53:40] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [08:54:12] FIRING: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:55:45] (03PS1) 10Slyngshede: Remove potential caching issue with permission info [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884 [08:59:12] RESOLVED: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:00:05] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0900) [09:00:11] (03PS1) 10DCausse: flink-app: add a component label to the flink-app configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098885 [09:00:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T370903)', diff saved to https://phabricator.wikimedia.org/P71319 and previous config saved to /var/cache/conftool/dbconfig/20241128-090035-ladsgroup.json [09:00:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [09:00:41] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [09:00:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance [09:06:02] o/ [09:06:18] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:08:25] I am going to promote all wikis [09:08:55] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098887 (https://phabricator.wikimedia.org/T375664) [09:08:57] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098887 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [09:09:11] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:09:40] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098887 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [09:10:53] (03CR) 10Slyngshede: [C:03+2] Blocking: Allow multiple account managers groups [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede) [09:11:42] (03CR) 10KartikMistry: [C:03+2] ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877 (owner: 10Ilias Sarantopoulos) [09:12:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884 (owner: 10Slyngshede) [09:12:45] (03Merged) 10jenkins-bot: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877 (owner: 10Ilias Sarantopoulos) [09:12:55] (03CR) 10Harroyo-wmf: [C:03+1] "I've found this useful to understand what this setting does: https://phabricator.wikimedia.org/diffusion/EEVB/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó) [09:13:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance [09:14:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance [09:14:15] (03Merged) 10jenkins-bot: Blocking: Allow multiple account managers groups [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede) [09:15:31] (03CR) 10Slyngshede: [C:03+2] Remove potential caching issue with permission info [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884 (owner: 10Slyngshede) [09:17:55] (03Merged) 10jenkins-bot: Remove potential caching issue with permission info [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884 (owner: 10Slyngshede) [09:18:36] (03PS11) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [09:18:47] (03CR) 10Kosta Harlan: [C:03+1] Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó) [09:19:15] (03PS1) 10Ilias Sarantopoulos: ml-services: recapi increase readiness prob in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098890 [09:20:19] (03PS12) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [09:20:33] (03CR) 10Slyngshede: [C:03+2] Show CN as signed in username [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede) [09:20:41] (03PS9) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [09:20:58] (03CR) 10KartikMistry: [C:03+2] ml-services: recapi increase readiness prob in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098890 (owner: 10Ilias Sarantopoulos) [09:21:17] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10364803 (10dcaro) >>! In T379927#10354355, @Andrew wrote: > From Gerrit, @dcaro writes: > > >> >> Did a quick test, there's thre... [09:22:12] (03Merged) 10jenkins-bot: ml-services: recapi increase readiness prob in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098890 (owner: 10Ilias Sarantopoulos) [09:22:19] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.5 refs T375664 [09:22:24] T375664: 1.44.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T375664 [09:23:09] (03Merged) 10jenkins-bot: Show CN as signed in username [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede) [09:23:41] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:26:58] (03PS10) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [09:30:50] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases2003.codfw.wmnet [09:31:43] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases2003.codfw.wmnet (duration: 01m 27s) [09:33:37] (03CR) 10Slyngshede: [C:03+2] Blocking: Show current user LDAP status [software/bitu] - 10https://gerrit.wikimedia.org/r/1097378 (owner: 10Slyngshede) [09:35:13] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases1003.eqiad.wmnet [09:35:47] (03PS1) 10Jelto: trafficserver: switch query-scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1098891 (https://phabricator.wikimedia.org/T350793) [09:36:17] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases1003.eqiad.wmnet (duration: 01m 22s) [09:36:37] (03Merged) 10jenkins-bot: Blocking: Show current user LDAP status [software/bitu] - 10https://gerrit.wikimedia.org/r/1097378 (owner: 10Slyngshede) [09:39:48] (03CR) 10JMeybohm: [C:03+1] trafficserver: switch query-scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1098891 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:42:19] train is rather quiet as far as I can see [09:45:46] (03CR) 10JMeybohm: [C:03+1] modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:46:38] (03PS13) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [09:47:04] (03PS11) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [09:47:20] (03CR) 10JMeybohm: modules: add health checks to the mesh's _tcp_cluster config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:47:37] (03CR) 10JMeybohm: [C:03+1] charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:49:16] (03PS1) 10Alexandros Kosiaris: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [09:49:16] (03CR) 10Alexandros Kosiaris: [C:04-1] "A couple of questions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [09:52:41] (03PS7) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [09:52:42] (03PS6) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [09:52:42] (03PS8) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [09:53:07] (03CR) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:53:52] (03PS8) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [09:53:52] (03PS7) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [09:53:52] (03PS9) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [09:54:41] (03PS9) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [09:54:41] (03PS8) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [09:54:41] (03PS10) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [09:55:32] (03PS10) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [09:55:32] (03PS9) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [09:55:32] (03PS11) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [09:55:48] (03CR) 10Elukey: "ok now it should be done :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:58:18] (03CR) 10JMeybohm: services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [09:59:02] (03CR) 10JMeybohm: [C:03+1] modules: add health checks to the mesh's _tcp_cluster config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [10:10:44] (03CR) 10Elukey: services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [10:11:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [10:11:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:11:47] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [10:11:47] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [10:11:47] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [10:11:47] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [10:11:47] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [10:11:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:12:23] here [10:12:49] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [10:12:51] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.782 second response time https://wikitech.wikimedia.org/wiki/Swift [10:12:55] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.398 second response time https://wikitech.wikimedia.org/wiki/Swift [10:13:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:13:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [10:13:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [10:13:49] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.819 second response time https://wikitech.wikimedia.org/wiki/Swift [10:13:49] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [10:13:57] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.841 second response time https://wikitech.wikimedia.org/wiki/Swift [10:14:05] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.383 second response time https://wikitech.wikimedia.org/wiki/Swift [10:14:12] FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:26] Emperor: ^^ swift is struggling in codfw? [10:14:51] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.189 second response time https://wikitech.wikimedia.org/wiki/Swift [10:14:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:15:00] some kind of write timeouts? [10:15:05] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [10:15:30] seeing a bunch of swift.common.exceptions.ChunkWriteTimeout: 60.0 seconds [10:15:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:15:49] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [10:15:56] (03PS1) 10Slyngshede: data.yaml: Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1098900 [10:16:09] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.549 second response time https://wikitech.wikimedia.org/wiki/Swift [10:16:49] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift [10:16:51] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.969 second response time https://wikitech.wikimedia.org/wiki/Swift [10:17:21] the swift backends look impaired too [10:17:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:17:59] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:18:49] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Swift [10:18:53] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.201 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:11] FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:13] yeah Swift had a hard time apparently [10:19:23] the FileOperation logging bucket has an elevated rate of errors [10:19:37] https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-level=ERROR&var-channel=FileOperation [10:19:49] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [10:20:05] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [10:20:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:20:39] [{reqId}] {exception_url} Wikimedia\FileBackend\FileBackendError: Iterator page I/O error. :) [10:20:49] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [10:20:59] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.428 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [10:21:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:21:59] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:59] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:21:59] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:22:01] could it be related to the swift proxies in need of a restart? [10:22:49] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [10:22:49] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift [10:22:49] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [10:22:49] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [10:22:49] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [10:22:59] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:23:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:23:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:23:49] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [10:23:55] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.355 second response time https://wikitech.wikimedia.org/wiki/Swift [10:23:55] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.593 second response time https://wikitech.wikimedia.org/wiki/Swift [10:23:57] or maybe an ms-be misbehaving [10:24:16] elukey: see -private [10:24:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:26:05] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [10:26:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:26:51] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift [10:26:51] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [10:26:57] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.254 second response time https://wikitech.wikimedia.org/wiki/Swift [10:26:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:26:59] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:27:06] FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:27:09] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.204 second response time https://wikitech.wikimedia.org/wiki/Swift [10:27:25] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:27:27] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:27:53] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.839 second response time https://wikitech.wikimedia.org/wiki/Swift [10:28:49] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [10:29:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:29:57] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.473 second response time https://wikitech.wikimedia.org/wiki/Swift [10:30:34] !incidents [10:30:35] 5492 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [10:30:35] 5493 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [10:30:35] 5494 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [10:30:35] 5495 (ACKED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:30:36] 5491 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:30:36] 5482 (RESOLVED) [10x] ProbeDown sre (probes/service magru) [10:30:49] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift [10:30:59] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:31:33] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:31:49] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [10:32:05] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [10:32:27] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:32:33] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [10:32:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:51] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [10:32:51] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [10:33:33] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:33:44] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [10:33:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [10:33:49] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift [10:34:49] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [10:34:51] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [10:34:59] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [10:35:05] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [10:35:49] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Swift [10:35:49] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [10:36:43] RECOVERY - snapshot of x1 in eqiad on backupmon1001 is OK: Last snapshot for x1 at eqiad (db1216) taken on 2024-11-28 10:17:50 (326 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:36:48] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:37:06] FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:40:45] 06SRE, 10Incident-Reporting-System (Pilot wiki release December 2024), 10Trust and Safety Product Sprint (Sprint Gong (November 18 - December 6)): Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908#10364928 (10kostajh) 05In progres... [10:41:45] !incidents [10:41:45] 5494 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [10:41:45] 5493 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [10:41:46] 5492 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [10:41:46] 5495 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:41:46] 5491 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [10:41:46] 5482 (RESOLVED) [10x] ProbeDown sre (probes/service magru) [10:42:12] (03PS1) 10Gerrit maintenance bot: Add zh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119) [10:43:33] (03CR) 10CI reject: [V:04-1] Add zh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119) (owner: 10Gerrit maintenance bot) [10:44:54] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365001 (10fnegri) @dcaro thanks for that analysis! I had a look at the [source code for Resolv::DNS](https://github.com/ruby/ruby/... [10:48:02] jouncebot: next [10:48:02] In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1100) [10:49:51] RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:50:45] (03PS1) 10Giuseppe Lavagetto: Bugfix for commit [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098909 [10:50:53] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix for commit [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098909 (owner: 10Giuseppe Lavagetto) [10:51:15] !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix commit bug - oblivian@cumin1002" [10:51:17] !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix commit bug - oblivian@cumin1002 [10:51:49] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix commit bug - oblivian@cumin1002 [10:51:50] !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix commit bug - oblivian@cumin1002" [10:52:28] (03PS1) 10Muehlenhoff: Extend access for baitolykin [puppet] - 10https://gerrit.wikimedia.org/r/1098910 [10:57:23] (03PS12) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [10:57:42] (03CR) 10Elukey: services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1100) [11:00:18] (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: recapi increase readiness prob in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098911 [11:00:25] (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: increase readiness prob" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098912 [11:03:28] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2237 gradually with 4 steps - Maint over (T379813) [11:03:32] T379813: Wikimedia\Rdbms\DBQueryError: Error 1034: Index for table 'wbc_entity_usage' is corrupt; try to repair itFunction: Wikibase\Client\Usage\Sql\EntityUsageTable::queryUsagesQuery: SELECT eu_aspect,eu_entity_id FROM `wbc_entity - https://phabricator.wikimedia.org/T379813 [11:03:56] (03CR) 10JMeybohm: [C:03+1] services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [11:04:18] (03PS1) 10Máté Szabó: ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) [11:04:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: Maintenance [11:04:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: Maintenance [11:04:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2033 (T376905)', diff saved to https://phabricator.wikimedia.org/P71324 and previous config saved to /var/cache/conftool/dbconfig/20241128-110457-ladsgroup.json [11:06:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó) [11:08:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2204.codfw.wmnet with reason: Maintenance [11:08:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2204.codfw.wmnet with reason: Maintenance [11:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:10:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:10:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet [11:10:51] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365134 (10fnegri) [11:11:29] !log removing ganeti1022 from active Ganeti nodes T378921 [11:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:33] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [11:11:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033 (T376905)', diff saved to https://phabricator.wikimedia.org/P71325 and previous config saved to /var/cache/conftool/dbconfig/20241128-111154-ladsgroup.json [11:12:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance [11:12:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10365139 (10MoritzMuehlenhoff) [11:12:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance [11:13:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T370903)', diff saved to https://phabricator.wikimedia.org/P71326 and previous config saved to /var/cache/conftool/dbconfig/20241128-111300-ladsgroup.json [11:13:06] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:13:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [11:14:01] PROBLEM - ganeti-noded running on ganeti1022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:14:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet [11:14:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10365147 (10ops-monitoring-bot) Draining ganeti1018.eqiad.wmnet of running VMs [11:14:35] PROBLEM - ganeti-confd running on ganeti1022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:15:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T370903)', diff saved to https://phabricator.wikimedia.org/P71327 and previous config saved to /var/cache/conftool/dbconfig/20241128-111510-ladsgroup.json [11:17:06] FIRING: [13x] ProbeDown: Service ganeti1022:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:21:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet [11:22:06] FIRING: [13x] ProbeDown: Service ganeti1022:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:31] jouncebot: nowandnext [11:23:31] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1100) [11:23:31] In 1 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1300) [11:27:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033', diff saved to https://phabricator.wikimedia.org/P71329 and previous config saved to /var/cache/conftool/dbconfig/20241128-112701-ladsgroup.json [11:29:14] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098910 (owner: 10Muehlenhoff) [11:29:22] (03Abandoned) 10Slyngshede: data.yaml: Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1098900 (owner: 10Slyngshede) [11:30:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P71330 and previous config saved to /var/cache/conftool/dbconfig/20241128-113017-ladsgroup.json [11:30:24] (03PS1) 10Ladsgroup: Bump ratio of new parsercache key spec to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098914 (https://phabricator.wikimedia.org/T373037) [11:30:25] (03CR) 10Muehlenhoff: [C:03+2] Extend access for baitolykin [puppet] - 10https://gerrit.wikimedia.org/r/1098910 (owner: 10Muehlenhoff) [11:31:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet [11:32:32] (03PS1) 10Tim Starling: addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 [11:32:32] (03PS1) 10Tim Starling: Run dumpInterwiki.php locally with no changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 [11:32:32] (03PS1) 10Tim Starling: Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) [11:32:35] (03PS1) 10Tim Starling: Activate id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) [11:33:23] (03CR) 10CI reject: [V:04-1] Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [11:34:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10365209 (10ops-monitoring-bot) Draining ganeti1018.eqiad.wmnet of running VMs [11:34:44] (03CR) 10Ladsgroup: [C:03+1] addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 (owner: 10Tim Starling) [11:38:27] (03CR) 10Ladsgroup: Run dumpInterwiki.php locally with no changes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 (owner: 10Tim Starling) [11:39:27] (03PS2) 10Tim Starling: addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 [11:39:27] (03PS2) 10Tim Starling: Run dumpInterwiki.php locally with no changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 [11:39:27] (03PS2) 10Tim Starling: Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) [11:39:28] (03PS2) 10Tim Starling: Activate id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) [11:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:41:25] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365244 (10dcaro) Nice! I tried with: ` resolver = Resolv::DNS.new( :nameserver => '127.0.0.1', :raise_timeout_erros =>... [11:41:29] (03CR) 10Ladsgroup: [C:03+1] addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 (owner: 10Tim Starling) [11:42:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033', diff saved to https://phabricator.wikimedia.org/P71333 and previous config saved to /var/cache/conftool/dbconfig/20241128-114208-ladsgroup.json [11:44:42] 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365253 (10Nemoralis) I also encountered this error when uploading a bulk file recently. > An unknown error occurred in storage backend "local-swift-codfw" [11:45:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P71334 and previous config saved to /var/cache/conftool/dbconfig/20241128-114524-ladsgroup.json [11:48:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2237 gradually with 4 steps - Maint over (T379813) [11:48:56] T379813: Wikimedia\Rdbms\DBQueryError: Error 1034: Index for table 'wbc_entity_usage' is corrupt; try to repair itFunction: Wikibase\Client\Usage\Sql\EntityUsageTable::queryUsagesQuery: SELECT eu_aspect,eu_entity_id FROM `wbc_entity - https://phabricator.wikimedia.org/T379813 [11:50:12] (03CR) 10Muehlenhoff: "Looks good in general, two comments/questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [11:50:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098914 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:51:34] (03Merged) 10jenkins-bot: Bump ratio of new parsercache key spec to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098914 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:51:57] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1098914|Bump ratio of new parsercache key spec to 2 (T373037)]] [11:52:02] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:53:09] (03CR) 10Muehlenhoff: "Well, no your patch needs to be merged first :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [11:54:55] (03PS3) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) [11:57:12] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1098914|Bump ratio of new parsercache key spec to 2 (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:57:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033 (T376905)', diff saved to https://phabricator.wikimedia.org/P71336 and previous config saved to /var/cache/conftool/dbconfig/20241128-115715-ladsgroup.json [11:57:17] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:57:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2031.codfw.wmnet with reason: Maintenance [11:57:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2031.codfw.wmnet with reason: Maintenance [11:57:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2031 (T376905)', diff saved to https://phabricator.wikimedia.org/P71337 and previous config saved to /var/cache/conftool/dbconfig/20241128-115741-ladsgroup.json [11:57:46] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:58:32] (03CR) 10Kosta Harlan: [C:03+1] ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [11:59:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:00:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T370903)', diff saved to https://phabricator.wikimedia.org/P71338 and previous config saved to /var/cache/conftool/dbconfig/20241128-120031-ladsgroup.json [12:00:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:04:35] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098914|Bump ratio of new parsercache key spec to 2 (T373037)]] (duration: 12m 37s) [12:04:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031 (T376905)', diff saved to https://phabricator.wikimedia.org/P71339 and previous config saved to /var/cache/conftool/dbconfig/20241128-120437-ladsgroup.json [12:04:39] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [12:05:19] 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365302 (10Yann) Again after the last chunk :((( `02957: FAILED: internal_api_error_DBQueryError: [6d8711c6-4dea-4c57-a254-3e8c35471315] Caught exception of type Wikimedia\Rdbms\DBQueryError` [12:11:15] (03PS14) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [12:11:41] (03PS12) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [12:17:20] (03PS1) 10Máté Szabó: ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) [12:19:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031', diff saved to https://phabricator.wikimedia.org/P71340 and previous config saved to /var/cache/conftool/dbconfig/20241128-121943-ladsgroup.json [12:21:21] 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365335 (10MatthewVernon) There was an incident that impacted codfw swift earlier today (from around 09:55 to 10:55 UTC); this seems likely a consequence of that, so I'd expect a retry would now be succe... [12:23:48] !log klausman@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:23:52] (03PS9) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [12:24:49] (03CR) 10Cathal Mooney: [C:03+1] LVS: enable do_ipv6_ra_primary in all sites [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [12:25:16] (03Abandoned) 10Cathal Mooney: Temporarily change cumin installserver alias to not include mgaru [puppet] - 10https://gerrit.wikimedia.org/r/1093322 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney) [12:28:37] (03CR) 10Muehlenhoff: [C:04-1] cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [12:29:05] (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:29:35] (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:34:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031', diff saved to https://phabricator.wikimedia.org/P71342 and previous config saved to /var/cache/conftool/dbconfig/20241128-123451-ladsgroup.json [12:39:56] (03PS1) 10Jaime Nuche: scap target: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1098933 (https://phabricator.wikimedia.org/T378769) [12:42:43] (03CR) 10Jaime Nuche: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098933 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [12:49:50] (03PS1) 10Muehlenhoff: Add ferm macro/nftables set for loadbalancer nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098936 [12:49:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031 (T376905)', diff saved to https://phabricator.wikimedia.org/P71343 and previous config saved to /var/cache/conftool/dbconfig/20241128-124957-ladsgroup.json [12:52:27] (03CR) 10Arturo Borrero Gonzalez: toolforge::prometheus: add exporter for the k8s cert expiry (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [12:53:38] (03PS1) 10Btullis: Add keytab files for the hadoop workers in the analytics horizon project [labs/private] - 10https://gerrit.wikimedia.org/r/1098937 (https://phabricator.wikimedia.org/T381087) [12:54:36] (03CR) 10Btullis: [V:03+2 C:03+2] Add keytab files for the hadoop workers in the analytics horizon project [labs/private] - 10https://gerrit.wikimedia.org/r/1098937 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [12:55:33] (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [12:56:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:56:49] PROBLEM - Kafka broker TLS certificate validity on kafka-main1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [12:57:08] PROBLEM - Kafka Broker Server #page on kafka-main1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [12:57:08] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [12:57:21] !incidents [12:57:21] 5496 (UNACKED) kafka-main1002/Kafka Broker Server (paged) [12:57:21] 5494 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [12:57:22] 5493 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:57:22] 5492 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [12:57:22] 5495 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [12:57:22] 5491 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [12:57:27] (03PS16) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [12:57:30] !ack 5496 [12:57:30] 5496 (ACKED) kafka-main1002/Kafka Broker Server (paged) [12:58:08] (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [12:58:14] effie: might this be related to your work? [12:58:40] but kafka-main1002 is now a spare [12:58:41] sigh [12:58:48] downtime just expired [12:59:26] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:59:42] sure, but, since yesterday the server is a spare server [12:59:56] puppet is disabled [13:00:03] so that hasn't taken effect [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1300) [13:00:08] wait [13:00:24] "Puppet is disabled. Hardware [13:00:59] ok ok my miss, however I still have questions [13:01:02] I will run puppet now [13:01:34] what's going on with kafka-main? it's impacting the CDN [13:03:39] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10365419 (10Krd) Please unbreak now. [13:04:46] effie has been refreshing some of the hosts but it shouldn't be impacting the CDN. vgutierrez: where can I see the impact? [13:04:51] (03CR) 10Arturo Borrero Gonzalez: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [13:05:23] vgutierrez: we have switched to 1007 since yesterday [13:05:52] we've been getting lag alerts this morning [13:06:17] vgutierrez: can you please elaborate ? [13:06:22] and some yesterday too, but as they were localized in magru I thought was due to the activities [13:06:56] (03PS1) 10NMW03: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 [13:07:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 (owner: 10NMW03) [13:08:02] effie: -sre-private [13:08:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [13:09:36] (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [13:09:50] (03PS2) 10NMW03: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 [13:10:14] effie: sorry for the double channel switch, my bad, the alert we received were about lag: `PurgedHighEventLag: High event process lag with purged on cp5017` and such [13:10:45] no problem [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:40] (03CR) 10Muehlenhoff: [C:03+2] Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [13:12:18] jouncebot: next [13:12:18] In 0 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1400) [13:12:46] jouncebot: now [13:12:46] For the next 0 hour(s) and 47 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1300) [13:13:15] (03PS1) 10Cathal Mooney: Remove JIO direct path via peering from AVOIDED-PATHS in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1098942 (https://phabricator.wikimedia.org/T373015) [13:14:15] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Packet loss reflected in NELs for traffic to Reliance Jio Infocomm Ltd over BBIX Singapore - https://phabricator.wikimedia.org/T373015#10365435 (10cmooney) I tested removing this as-path from being avoided on cr2-eqsin and there was no pack... [13:14:43] (03CR) 10Cathal Mooney: [C:03+2] Remove JIO direct path via peering from AVOIDED-PATHS in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1098942 (https://phabricator.wikimedia.org/T373015) (owner: 10Cathal Mooney) [13:15:29] (03Merged) 10jenkins-bot: Remove JIO direct path via peering from AVOIDED-PATHS in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1098942 (https://phabricator.wikimedia.org/T373015) (owner: 10Cathal Mooney) [13:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:57] (03CR) 10Arturo Borrero Gonzalez: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [13:24:54] (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [13:27:14] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365459 (10fnegri) > kinda weird behavior if you ask me I agree this is quite confusing and also poorly documented. One thing I d... [13:30:28] (03PS17) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [13:31:09] (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [13:37:03] (03PS5) 10Krinkle: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:39:02] (03CR) 10Krinkle: [C:03+1] "I've folded the code into cluster_fe_hash and added a doc comment indicating the restrictions and caveats learned from last time this was " [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:39:25] (03CR) 10Zabe: [C:04-1] "nope" [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119) (owner: 10Gerrit maintenance bot) [13:39:41] (03PS6) 10Krinkle: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:45:06] (03PS1) 10Muehlenhoff: ganeti1022: update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1098951 [13:46:39] (03PS7) 10Krinkle: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:46:49] (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:47:09] (03CR) 10Krinkle: [C:03+1] "I've tried to summarise the situation in the commit message as best I can, for review by SRE. @Derick/Gergo is this accurate?" [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:48:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance [13:48:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance [13:49:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2028 (T376905)', diff saved to https://phabricator.wikimedia.org/P71344 and previous config saved to /var/cache/conftool/dbconfig/20241128-134859-ladsgroup.json [13:49:07] (03CR) 10Muehlenhoff: [C:03+2] ganeti1022: update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1098951 (owner: 10Muehlenhoff) [13:49:57] (03CR) 10Krinkle: [C:03+1] [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [13:54:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T376905)', diff saved to https://phabricator.wikimedia.org/P71345 and previous config saved to /var/cache/conftool/dbconfig/20241128-135451-ladsgroup.json [13:55:23] (03PS1) 10Muehlenhoff: cloudweb/codfw1dev: Use firewall::service for firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/1098952 [13:56:50] (03CR) 10Elukey: [C:03+2] modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [13:57:52] (03Merged) 10jenkins-bot: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [13:58:11] (03CR) 10Elukey: [C:03+2] modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [13:58:18] (03CR) 10CI reject: [V:04-1] modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [13:58:20] (03PS11) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [13:58:26] (03PS10) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [13:58:31] (03PS13) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [13:58:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098952 (owner: 10Muehlenhoff) [13:58:56] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [13:59:18] (03PS18) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [13:59:57] (03Merged) 10jenkins-bot: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [14:00:00] (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1400). nyaa~ [14:00:05] tgr, mszabo, abijeet, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:07] (03CR) 10Elukey: [C:03+2] charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [14:00:14] (03CR) 10Elukey: [C:03+2] services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey) [14:00:42] o/ [14:00:44] o/ [14:01:41] (03PS19) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [14:02:17] (03CR) 10Majavah: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [14:02:25] (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [14:02:26] I can probably deploy in a few minutes [14:02:27] (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [14:02:30] o/ [14:02:42] thanks Lucas [14:02:55] tgr|away: I’d say feel free to start if you want to self-service :) [14:02:55] i can deploy now if you want me to? [14:02:59] or that [14:03:02] sure! [14:03:11] tgr's patches are backports, so they'd take 20 mins on CI anyway [14:03:15] (03PS20) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) [14:03:17] (or well, maybe not for CA) [14:03:20] (03CR) 10Urbanecm: [C:03+2] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [14:03:20] (03CR) 10Urbanecm: [C:03+2] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [14:03:51] mszabo: i have to say, starting commit messages with "Allow IRS to record" prompts whole other meanings in my head [14:04:06] (03PS2) 10Máté Szabó: Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) [14:04:10] (03CR) 10Urbanecm: [C:03+2] Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó) [14:04:24] urbanecm: I've cracked that joke a few times in meetings but nobody laughed, in part possibly due to negative past interactions with said abbreviation [14:04:40] hehe [14:04:56] (03Merged) 10jenkins-bot: Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó) [14:04:57] "server-side events now subject to 3.5% VAT" [14:05:14] or a 10% tariff... [14:05:35] (03CR) 10Urbanecm: ReportIncident: Enable instrumentation on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [14:05:51] submit a W-1776 to report an incident [14:05:52] now we need a backronym for HMRC [14:06:13] mszabo: i can never remember those numbers [14:06:34] (03CR) 10David Caro: [C:03+2] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro) [14:06:38] (03PS3) 10NMW03: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 [14:06:42] (03CR) 10Urbanecm: [C:03+2] Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 (owner: 10NMW03) [14:06:49] urbanecm: yeah, I'm happy not to have to deal with that [14:06:51] !log klausman@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:06:54] yep yep [14:07:06] mszabo: on a serious note, can you take a look at my comment at https://gerrit.wikimedia.org/r/1098913, please? [14:07:23] abijeet: hi, around too? [14:07:24] (03Merged) 10jenkins-bot: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 (owner: 10NMW03) [14:07:25] urbanecm: sure, one sec [14:08:11] tgr|away: phan fails for one of your backports (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1098622), can you take a look, please? [14:08:27] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1098561|Allow IRS to record server-side interaction events (T380599)]], [[gerrit:1098939|Revert^2 "Add contact form for U4C"]] [14:08:27] (03PS1) 10ArielGlenn: extend account creation lookup service to cover forced creations by others [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401) [14:08:32] T380599: Record server-side interaction event for IRS non-emergency flow submissions - https://phabricator.wikimedia.org/T380599 [14:08:39] * Lucas_WMDE is now also available for deployment if needed [14:08:55] although i guess `Error cloning https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser to /workspace/src/extensions/CheckUser` might be transient... [14:09:31] (03CR) 10CI reject: [V:04-1] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [14:09:48] (03CR) 10Urbanecm: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [14:09:52] i'll try again [14:09:54] (03CR) 10Urbanecm: [C:03+2] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [14:09:56] thx [14:09:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P71346 and previous config saved to /var/cache/conftool/dbconfig/20241128-140958-ladsgroup.json [14:10:02] (03PS2) 10Máté Szabó: ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) [14:10:07] (03CR) 10Máté Szabó: ReportIncident: Enable instrumentation on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [14:10:14] urbanecm: done [14:10:33] mszabo: ty! [14:11:37] (03CR) 10Urbanecm: [C:03+2] ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [14:12:20] (03Merged) 10jenkins-bot: ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [14:12:39] mszabo: should be deployed on beta automatically [14:13:41] (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [14:14:28] (03Merged) 10jenkins-bot: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [14:14:32] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10365530 (10MoritzMuehlenhoff) [14:14:37] !log urbanecm@deploy2002 nmw03, mszabo, urbanecm: Backport for [[gerrit:1098561|Allow IRS to record server-side interaction events (T380599)]], [[gerrit:1098939|Revert^2 "Add contact form for U4C"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:14:42] T380599: Record server-side interaction event for IRS non-emergency flow submissions - https://phabricator.wikimedia.org/T380599 [14:14:45] mszabo: Nemoralis: can you test at mwdebug, please? [14:14:48] sure [14:14:57] !log installing apr security updates [14:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:37] LGTM! [14:15:46] ty! [14:15:47] urbanecm: lgtm [14:15:49] ty [14:15:50] !log urbanecm@deploy2002 nmw03, mszabo, urbanecm: Continuing with sync [14:15:53] proceeding [14:16:29] abijeet: hi, around for deployment? [14:17:47] urbanecm, hey, I'm here [14:17:57] hello :) [14:18:05] (03CR) 10Urbanecm: [C:03+2] Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:18:06] urbanecm, sorry for the delayed response [14:18:09] let's deploy then [14:18:10] no worries [14:18:26] tgr|away: while you're waiting on your patches: your +2 on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1090920 failed to go through ("This change depends on a change that failed to merge."), can you kick it again? [14:18:57] (03Merged) 10jenkins-bot: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:20:08] (03Merged) 10jenkins-bot: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [14:20:35] urbanecm: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is what I need to track to see the progress of the beta update right? [14:21:44] apergos: of, right, because dependency is not limited by branch [14:22:03] ty for the kick [14:22:27] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [14:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:35] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098561|Allow IRS to record server-side interaction events (T380599)]], [[gerrit:1098939|Revert^2 "Add contact form for U4C"]] (duration: 14m 07s) [14:22:40] T380599: Record server-side interaction event for IRS non-emergency flow submissions - https://phabricator.wikimedia.org/T380599 [14:23:20] mszabo: that and the sync job [14:23:29] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1098623|Use `useformat` query param for device detection or mobile domain (m.) (T380646 T375788)]], [[gerrit:1098913|ReportIncident: Enable instrumentation on labs (T372823)]], [[gerrit:1098509|Enable message group subscription feature for some wikis (T372386)]], [[gerrit:1098622|Use `useformat` query param for device detection or mobile domain (m.) ( [14:23:29] T380646 T375788)]] [14:23:30] awesome, thanks [14:23:31] https://integration.wikimedia.org/ci/job/beta-scap-sync-world/, which would be triggered once this one finishes [14:23:38] T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646 [14:23:38] T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788 [14:23:39] T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823 [14:23:39] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:24:34] still here under a slightly different name :-) [14:25:02] !log Started MediaModeration scanning scripts to run again over all wikis [14:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P71347 and previous config saved to /var/cache/conftool/dbconfig/20241128-142505-ladsgroup.json [14:27:56] but why isn't it in the gate-and-submit queue now? does jenkins need to be told the equiv of recheck? [14:28:11] tgr|away: ^^ [14:28:27] apergos: unmet dependencies [14:28:34] it's now waiting on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1098956 [14:28:41] !log urbanecm@deploy2002 urbanecm, tgr, abi, mszabo: Backport for [[gerrit:1098623|Use `useformat` query param for device detection or mobile domain (m.) (T380646 T375788)]], [[gerrit:1098913|ReportIncident: Enable instrumentation on labs (T372823)]], [[gerrit:1098509|Enable message group subscription feature for some wikis (T372386)]], [[gerrit:1098622|Use `useformat` query param for device detection or mobile domain (m. [14:28:42] ) (T380646 T375788)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:49] T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646 [14:28:50] T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788 [14:28:50] T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823 [14:28:50] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:28:53] tgr|away: apergos: please test at mwdebug [14:28:54] eh [14:28:58] *abijeet_ ^^ [14:29:00] (03PS1) 10Muehlenhoff: cloudcontrol/codfw1dev:: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991) [14:29:12] urbanecm, ok [14:30:02] (not much to test, it's a service, which is only exercised by a maintenance script. however, the wikis still work :-P ) [14:30:17] urbanecm: on it, will take a bit [14:30:51] ack [14:30:56] (but I'm not the one with the currently scapped change, only tgr) [14:31:46] urbanecm, looks good [14:32:03] ty abijeet_ [14:33:08] !log installing node-es-module-lexer updates from Bookworm point release [14:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:26] (03PS1) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) [14:36:25] !log [urbanecm@deploy2002 ~]$ mwscript-k8s -f extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php -- --wiki=bswiki # T378827 [14:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:30] T378827: Run Flow migration script at *Phase 1* wikis - https://phabricator.wikimedia.org/T378827 [14:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:39:13] (03PS1) 10Brouberol: airflow: fix typo in the REQUESTS_CA_BUNDLE env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098966 [14:39:42] (03CR) 10Stevemunene: [C:03+1] airflow: fix typo in the REQUESTS_CA_BUNDLE env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098966 (owner: 10Brouberol) [14:39:54] !log [urbanecm@deploy2002 ~]$ while read wiki; do echo "== $wiki"; mwscript-k8s extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php -- --wiki=$wiki; done < wikis.txt # wikis.txt is at P71349 # T378827 [14:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T376905)', diff saved to https://phabricator.wikimedia.org/P71350 and previous config saved to /var/cache/conftool/dbconfig/20241128-144012-ladsgroup.json [14:40:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: Maintenance [14:40:30] are autocreations disallowed on wikitech? [14:40:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: Maintenance [14:40:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2032 (T376905)', diff saved to https://phabricator.wikimedia.org/P71351 and previous config saved to /var/cache/conftool/dbconfig/20241128-144039-ladsgroup.json [14:41:16] I get " [14:41:18] Auto-creation of a local account failed: You are not allowed to execute the action you have requested." [14:41:38] that doesn't feel right [14:42:21] anyone has a unified account on wikitech and willing to do a quick test? [14:42:50] tgr|away: i get the same error [14:42:53] i have two unified accs [14:43:07] and https://wikitech.wikimedia.org/wiki/Special:ListGroupRights says `createaccount` is not assigned to anyone [14:43:45] ...looking at commits, i see `labswiki: Disallow account autocreation` [14:43:48] authored by MYSELF [14:44:04] ha [14:44:13] maybe you have an evil clone [14:44:41] (03CR) 10Brouberol: [C:03+2] airflow: fix typo in the REQUESTS_CA_BUNDLE env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098966 (owner: 10Brouberol) [14:44:44] anyway, what's the test? [14:44:46] urbanecm: can you do a login on the mobile interface of wikitech, and check if you got automatically logged in to, say, en.m.wikipedia.org as well? [14:44:54] with mwdebug i presume [14:45:08] if you use firefox you'll have to disable extended tracking protection first [14:45:11] yeah [14:45:15] chrome [14:45:21] then its fine [14:45:55] ...actually, that's not what should be tested, sorry [14:46:13] tgr|away: it does not work [14:46:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032 (T376905)', diff saved to https://phabricator.wikimedia.org/P71352 and previous config saved to /var/cache/conftool/dbconfig/20241128-144641-ladsgroup.json [14:46:46] (03PS1) 10Btullis: Add a keystore password for analytics-hadoop-labs [labs/private] - 10https://gerrit.wikimedia.org/r/1098967 (https://phabricator.wikimedia.org/T381087) [14:46:47] tgr|away: but also note `wgCentralAuthCookies = false` for labswiki [14:46:55] ooh [14:46:59] never mind then [14:47:08] thanks for checking that [14:47:15] the patch is good to go then [14:47:19] okay, proceeding [14:47:23] (03CR) 10Btullis: [V:03+2 C:03+2] Add a keystore password for analytics-hadoop-labs [labs/private] - 10https://gerrit.wikimedia.org/r/1098967 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis) [14:47:23] !log urbanecm@deploy2002 urbanecm, tgr, abi, mszabo: Continuing with sync [14:51:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:51:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:51:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:51:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:51:59] (03CR) 10Muehlenhoff: [C:03+2] turnilo: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1094420 (owner: 10Muehlenhoff) [14:52:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:52:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:52:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:52:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:52:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:53:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: Maintenance [14:53:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:53:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:53:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:53:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:54:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1231.eqiad.wmnet with reason: Maintenance [14:54:03] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098623|Use `useformat` query param for device detection or mobile domain (m.) (T380646 T375788)]], [[gerrit:1098913|ReportIncident: Enable instrumentation on labs (T372823)]], [[gerrit:1098509|Enable message group subscription feature for some wikis (T372386)]], [[gerrit:1098622|Use `useformat` query param for device detection or mobile domain (m.) [14:54:03] (T380646 T375788)]] (duration: 30m 33s) [14:54:09] should be live [14:54:11] T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646 [14:54:12] T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788 [14:54:12] T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823 [14:54:12] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:54:13] anything else? [14:54:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1231.eqiad.wmnet with reason: Maintenance [14:54:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:54:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:54:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:54:57] Amir1: the downtime cookbook accepts cumin queries to match multiple hosts at once if that helps ;) [14:55:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:55:06] thanks! [14:55:20] if you or tgr could help me untangle the merge-depends-on-backport issue, that would be lovely; should I abandon the one backport and wait for the merge on the other patch (or will it need to be kicked again) or...? [14:55:20] volans: it's a bit complicated [14:55:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:55:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:55:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:55:43] it's not parallel, it's serial but just fats (the table is small) [14:55:49] *fast [14:55:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:55:51] apergos: you mean the https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1090920 one? [14:56:08] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:56:10] ah ok [14:56:13] that's the one I would like to merge and won't right now. yep [14:56:19] e.g. if I run the same script on s3, it's gonna take an hour between each [14:56:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:56:25] looked too fast for usual DB maintenance :D [14:56:28] probably just get rid of the Depends-On line [14:56:31] apergos: on that change, i'd just remove depends-on [14:56:39] yeah, the table is tiny [14:56:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2180.codfw.wmnet with reason: Maintenance [14:56:42] it tries to ensures it is merged in all branches it exists in [14:56:43] I don't think it's recoverable otherwise [14:56:50] good point, since it's in master anyways [14:56:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: Maintenance [14:57:00] which you actually don't want [14:57:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2193.codfw.wmnet with reason: Maintenance [14:57:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2193.codfw.wmnet with reason: Maintenance [14:57:21] (abandoning backport, re-+2ing and restoring would probably also work, but there's little point in doing that) [14:57:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance [14:57:29] (see https://www.mediawiki.org/wiki/Gerrit/Cross-repo_dependencies#Possible_problems ) [14:57:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance [14:57:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2217.codfw.wmnet with reason: Maintenance [14:58:03] You can just link the dependency in freetext. [14:58:07] (I shall) [14:58:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2217.codfw.wmnet with reason: Maintenance [14:58:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2224.codfw.wmnet with reason: Maintenance [14:58:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: Maintenance [14:58:44] the schema change is idempotent, I probably can even run it on master with replication but I'm just nervous about it :D [14:58:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2229.codfw.wmnet with reason: Maintenance [14:59:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: Maintenance [14:59:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1201.eqiad.wmnet with reason: Maintenance [14:59:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: Maintenance [14:59:45] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10365694 (10MoritzMuehlenhoff) [14:59:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2214.codfw.wmnet with reason: Maintenance [15:00:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2214.codfw.wmnet with reason: Maintenance [15:00:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:00:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:01:03] 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108 (10Nikerabbit) 03NEW [15:01:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032', diff saved to https://phabricator.wikimedia.org/P71369 and previous config saved to /var/cache/conftool/dbconfig/20241128-150148-ladsgroup.json [15:02:01] (03Abandoned) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [15:02:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:02:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:04:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:04:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:06:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:06:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:07:10] 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365747 (10Yann) Now with another file, I got `03494: FAILED: stashfailed: Could not acquire lock. Somebody else is doing something to this file.` [15:08:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: Maintenance [15:08:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: Maintenance [15:08:51] only two minutes on all of s3, that's cool [15:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1212.eqiad.wmnet with reason: Maintenance [15:10:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: Maintenance [15:10:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:10:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:12:01] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523#10365763 (10Aklapper) 05Resolved→03Open Reopening per second bullet point on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group [15:12:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:13:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:13:20] 10SRE-swift-storage, 06Commons, 10UploadWizard: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365766 (10MatthewVernon) That's usually a sign that something has gone wrong above the swift level, I'm afraid (and previously when I've had it reported it has self-resolv... [15:13:25] sigh that broke replication to wikireplicas [15:15:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:15:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:15:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet [15:16:38] !log gmodena@deploy2002 Started deploy [analytics/refinery@ac87303]: Gobblin config changes [analytics/refinery@ac873037] [15:16:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032', diff saved to https://phabricator.wikimedia.org/P71370 and previous config saved to /var/cache/conftool/dbconfig/20241128-151655-ladsgroup.json [15:18:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:18:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:19:15] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [15:19:44] !log gmodena@deploy2002 Finished deploy [analytics/refinery@ac87303]: Gobblin config changes [analytics/refinery@ac873037] (duration: 03m 05s) [15:20:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:20:32] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [15:20:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:20:42] tgr|away: still no gate and submit jobs, with the dependency line removed, do you need to remove your vote and redo again? (sorry) [15:21:38] !log removing ganeti1018 from active Ganeti nodes T378921 [15:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:42] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [15:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:32] thanks yet again... [15:24:08] (03CR) 10Muehlenhoff: [C:03+2] Switch idp-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1094426 (owner: 10Muehlenhoff) [15:24:17] PROBLEM - ganeti-confd running on ganeti1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:24:47] PROBLEM - ganeti-noded running on ganeti1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:25:00] !log gmodena@deploy2002 Started deploy [analytics/refinery@ac87303] (thin): Gobblin config changes THIN [analytics/refinery@ac873037] [15:25:30] !log gmodena@deploy2002 Finished deploy [analytics/refinery@ac87303] (thin): Gobblin config changes THIN [analytics/refinery@ac873037] (duration: 00m 30s) [15:26:22] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10365819 (10LSobanski) The potential fix was merged in https://gitlab.com/mailman/mailman/-/issues/1151 and is included in Mailman version 3.3.10. [15:26:39] !log gmodena@deploy2002 Started deploy [analytics/refinery@ac87303] (hadoop-test): Gobblin config changes [analytics/refinery@ac873037] [15:27:05] !log gmodena@deploy2002 Finished deploy [analytics/refinery@ac87303] (hadoop-test): Gobblin config changes [analytics/refinery@ac873037] (duration: 00m 26s) [15:27:06] FIRING: [13x] ProbeDown: Service ganeti1018:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:30:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:32:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032 (T376905)', diff saved to https://phabricator.wikimedia.org/P71371 and previous config saved to /var/cache/conftool/dbconfig/20241128-153202-ladsgroup.json [15:32:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org [15:33:13] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523#10365848 (10elukey) 05Open→03Resolved TIL, already done thanks! [15:36:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org [15:37:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2004.wikimedia.org [15:39:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:39:42] (03Restored) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [15:39:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: Maintenance [15:39:42] (03PS1) 10Muehlenhoff: ganeti1018: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1098981 [15:39:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:39:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:40:07] (03PS2) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) [15:44:22] (03PS1) 10Muehlenhoff: Fix keytab locations [labs/private] - 10https://gerrit.wikimedia.org/r/1098982 [15:46:05] (03CR) 10Cathal Mooney: [C:03+1] hiera: set do_ipv6_primary_ra for all LVS in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1093957 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [15:46:25] (03CR) 10Ssingh: [C:03+2] hiera: set do_ipv6_primary_ra for all LVS in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1093957 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh) [15:46:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp-test2004.wikimedia.org [15:48:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:48:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:50:08] (03PS1) 10Muehlenhoff: Add component/maps [puppet] - 10https://gerrit.wikimedia.org/r/1098984 [15:50:48] (03CR) 10CI reject: [V:04-1] extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [15:55:08] (03PS10) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [15:56:12] (03PS1) 10Gmodena: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040) [15:56:35] (03CR) 10Brouberol: [C:03+1] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/1098982 (owner: 10Muehlenhoff) [15:57:28] (03CR) 10Stevemunene: [C:03+1] Fix keytab locations [labs/private] - 10https://gerrit.wikimedia.org/r/1098982 (owner: 10Muehlenhoff) [15:58:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2190.codfw.wmnet with reason: Maintenance [15:58:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: Maintenance [15:58:40] (03CR) 10Muehlenhoff: tftpboot: squash puppetserver log warning. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway) [16:00:05] hashar and andre: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1600) [16:00:18] oh true [16:00:19] well [16:01:36] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [16:01:42] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [16:04:29] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Fix keytab locations [labs/private] - 10https://gerrit.wikimedia.org/r/1098982 (owner: 10Muehlenhoff) [16:04:38] (03PS1) 10Muehlenhoff: Update cloudcephmon secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1098988 (https://phabricator.wikimedia.org/T364870) [16:07:48] (03PS1) 10Muehlenhoff: Add missing entry for recent LDAP addition [puppet] - 10https://gerrit.wikimedia.org/r/1098989 (https://phabricator.wikimedia.org/T380091) [16:07:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: Maintenance [16:07:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: Maintenance [16:08:04] (03PS11) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) [16:08:18] (03CR) 10Btullis: [C:03+1] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040) (owner: 10Gmodena) [16:09:31] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040) (owner: 10Gmodena) [16:11:21] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098989 (https://phabricator.wikimedia.org/T380091) (owner: 10Muehlenhoff) [16:11:46] (03Merged) 10jenkins-bot: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040) (owner: 10Gmodena) [16:13:51] (03PS1) 10Bartosz Dziewoński: Localisation updates (November 26) [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175) [16:14:06] (03CR) 10Brouberol: "Looks good, with a tiny nit!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [16:14:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński) [16:16:04] (03CR) 10Elukey: [C:03+2] "ah snap sorry! Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/1098989 (https://phabricator.wikimedia.org/T380091) (owner: 10Muehlenhoff) [16:17:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2205.codfw.wmnet with reason: Maintenance [16:17:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: Maintenance [16:19:28] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:19:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:19:53] (03PS12) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) [16:19:57] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:20:00] (03CR) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [16:20:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:20:44] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:21:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:21:20] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:21:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:22:04] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2085.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:22:06] FIRING: [13x] ProbeDown: Service ganeti1018:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:22:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2085.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:22:39] !log gmodena@deploy2002 Started deploy [airflow-dags/analytics@d7c0f58]: webrequest_frontend post deployment fixes [16:22:39] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2086.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:23:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2086.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:23:38] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2087.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:23:55] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2087.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:24:09] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:24:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:24:35] !log gmodena@deploy2002 Finished deploy [airflow-dags/analytics@d7c0f58]: webrequest_frontend post deployment fixes (duration: 02m 22s) [16:24:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10366013 (10elukey) Re-ran provision on all those, we are good, no changes registered. Now it is the turn of reimages, I'll kick off some. [16:27:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2227.codfw.wmnet with reason: Maintenance [16:27:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2227.codfw.wmnet with reason: Maintenance [16:28:21] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye [16:37:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:37:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:38:23] (03CR) 10Kamila Součková: [C:03+1] "go go go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098499 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [16:39:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2209.codfw.wmnet with reason: Maintenance [16:39:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2209.codfw.wmnet with reason: Maintenance [16:41:29] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage [16:41:39] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10366043 (10MatthewVernon) We had a problem with codfw swift this morning, with the sort of load pattern that I'd normally expect to "just" result in swift filling a network connectio... [16:42:19] (03CR) 10Vgutierrez: Add ferm macro/nftables set for loadbalancer nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff) [16:43:02] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366047 (10elukey) Tried megactl (packaged by Moritz) on ms-be2082, this is the result: ` elukey@ms-be2082:~$ su... [16:44:37] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage [16:46:06] (03CR) 10Elukey: [C:03+1] Add component/maps [puppet] - 10https://gerrit.wikimedia.org/r/1098984 (owner: 10Muehlenhoff) [16:47:40] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10366052 (10Ladsgroup) I'm not saying it's impossible but it's unlikely. The number of scripts per host is quite small (6-7) and they are mostly I/O bound waiting for the backends to... [16:47:42] win 14 [16:48:41] (03PS3) 10Hashar: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [16:49:33] (03CR) 10Kosta Harlan: [C:03+1] ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó) [16:51:14] !log depool/restart swift/repool ms-fe2009 [16:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:31] !log depool/restart swift/repool ms-fe2014 [16:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:22] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10366070 (10MatthewVernon) No, all frontends had problems, the entire cluster was very sad cf [[ https://grafana.wikimedia.org/goto/Lapva97NR?orgId=1 | envoy on graphana ]], which is... [16:55:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10366079 (10elukey) ms-be2081 done reimaged! [16:55:09] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366080 (10MatthewVernon) Megactl is correct that the battery is missing, but obviously on nodes where we expect... [16:57:59] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366085 (10elukey) >>! In T377853#10366080, @MatthewVernon wrote: > Megactl is correct that the battery is missin... [16:59:26] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:04] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:57] I have closed the train blocker task and I have claimed 1.44.0-wmf.5 to be a successful rollout [17:06:08] (03PS3) 10Ssingh: trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 [17:06:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2081.codfw.wmnet with OS bullseye [17:07:45] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4605/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [17:08:24] (03PS4) 10Ssingh: trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 [17:09:54] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4606/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [17:13:07] (03CR) 10Vgutierrez: [C:03+1] trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [17:15:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [17:16:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [17:17:11] (03CR) 10Vgutierrez: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [17:45:14] !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1276-1277].eqiad.wmnet} and (A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw or A:wikikube-staging-master-eqiad or A:wikikube-staging-worker-eqiad or A:wikikube-master-codfw or A:wikikube-worker-codfw or A:wikikube-master-eqiad or A:wikikube-worker-eqiad or A:ml-serve-master-eqiad or A:ml-serve-worker- [17:45:14] eqiad or A:ml-serve-master-codfw or A:ml-serve-worker-codfw or A:ml-staging-master or A:ml-staging-worker or A:dse-k8s-master or A:dse-k8s-worker or A:aux-master or A:aux-worker) [17:47:11] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1276.eqiad.wmnet with OS bookworm [17:50:56] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [17:51:02] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:52:15] (03PS15) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [17:52:30] (03PS13) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [17:57:32] (03PS1) 10Joal: Move hourly gobblin event start-time later [puppet] - 10https://gerrit.wikimedia.org/r/1099010 (https://phabricator.wikimedia.org/T376144) [17:57:45] (03CR) 10David Caro: [C:03+1] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/1098988 (https://phabricator.wikimedia.org/T364870) (owner: 10Muehlenhoff) [18:00:05] bd808: Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1800). Please do the needful. [18:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1800) [18:00:56] 06SRE, 06Infrastructure-Foundations, 10netops: Packet loss reflected in NELs for traffic to Reliance Jio Infocomm Ltd over BBIX Singapore - https://phabricator.wikimedia.org/T373015#10366155 (10cmooney) 05Open→03Resolved [18:00:58] not today jouncebot. I'm "on holiday" [18:06:37] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage [18:09:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage [18:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10366167 (10phaultfinder) [18:28:03] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:28:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1276.eqiad.wmnet with OS bookworm [18:33:39] thanks urbanecm for deploying the configuration change. :-) [18:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:41:51] (03CR) 10Thiemo Kreuz (WMDE): "Looks good. 👍 For reference: This is basically a revert of Idce1027." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098045 (https://phabricator.wikimedia.org/T377809) (owner: 10Joely Rooke WMDE) [19:05:25] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366216 (10MoritzMuehlenhoff) It differentiates states already, ms-be2082 has "module missing, pack missing, char... [19:08:31] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1277.eqiad.wmnet with OS bookworm [19:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:05] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:47] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366224 (10cmooney) [19:23:59] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366225 (10ssingh) > Which will hopefully verify everything is consistent. In terms of the wider work to integrate with Netbox and get data onto our authdns hosts I will need to wor... [19:26:05] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366232 (10cmooney) [19:27:58] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage [19:28:07] (03Abandoned) 10Ssingh: dnsbox: set anycast-hc num_backups to one [puppet] - 10https://gerrit.wikimedia.org/r/1051381 (owner: 10Ssingh) [19:28:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó) [19:29:55] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366237 (10cmooney) [19:31:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage [19:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:39:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:40:18] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:08] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:43:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:50:08] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:50:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1277.eqiad.wmnet with OS bookworm [19:50:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1276-1277].eqiad.wmnet} and (A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw or A:wikikube-staging-master-eqiad or A:wikikube-staging-worker-eqiad or A:wikikube-master-codfw or A:wikikube-worker-codfw or A:wikikube-master-eqiad or A:wikikube-worker-eqiad or A:ml-serve-master-eqiad or [19:50:34] A:ml-serve-worker-eqiad or A:ml-serve-master-codfw or A:ml-serve-worker-codfw or A:ml-staging-master or A:ml-staging-worker or A:dse-k8s-master or A:dse-k8s-worker or A:aux-master or A:aux-worker) [19:51:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:52:22] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:38] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:05:52] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 8.401 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:28] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:09:51] (03PS1) 10Arlolra: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034 [20:13:33] jouncebot: nowandnext [20:13:33] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [20:13:33] In 0 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T2100) [20:13:52] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene) [20:15:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó) [20:16:38] (03Merged) 10jenkins-bot: ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó) [20:16:55] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy (T380277)]] [20:17:00] T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277 [20:22:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:02] !log kharlan@deploy2002 kharlan, mszabo: Backport for [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy (T380277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:07] T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277 [20:23:17] !log kharlan@deploy2002 kharlan, mszabo: Continuing with sync [20:30:04] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy (T380277)]] (duration: 13m 08s) [20:30:11] T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277 [20:36:46] (03CR) 10MSantos: [C:03+1] Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034 (owner: 10Arlolra) [20:36:50] I'm done with my backport [20:38:32] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:39:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:41:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:41:44] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:32] PROBLEM - Disk space on serpens is CRITICAL: DISK CRITICAL - free space: / 448 MB (2% inode=92%): /tmp 448 MB (2% inode=92%): /var/tmp 448 MB (2% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=serpens&var-datasource=codfw+prometheus/ops [20:54:47] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123 (10ABreault-WMF) 03NEW [20:55:12] (03PS2) 10Arlolra: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034 (https://phabricator.wikimedia.org/T381123) [20:58:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:59:26] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:59:26] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T2100). [21:00:05] danisztls, MatmaRex, apergos, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:17] o/ [21:00:20] hi [21:00:20] here, believe it or not :-P [21:01:55] FIRING: MaxConntrack: Max conntrack at 99.26% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:04:52] I wonder who's running the window [21:05:36] I can do if there are no takers [21:06:03] ah you're here anyways [21:06:29] just got here [21:06:55] RESOLVED: MaxConntrack: Max conntrack at 99.26% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:07:29] I'd say if none of the named suspects shows in the next couple minutes, go ahead and run it [21:07:45] danisztls: can the two config patches be deployed together? [21:08:23] tgr|away: yes [21:09:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:09:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:09:57] (03CR) 10Gergő Tisza: [C:03+2] Localisation updates (November 26) [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński) [21:10:25] (03Merged) 10jenkins-bot: Reader Survey: Undeploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:10:28] (03Merged) 10jenkins-bot: Reader Survey: Deploy on multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:10:44] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1098617|Reader Survey: Undeploy on enwiki (T378660)]], [[gerrit:1098627|Reader Survey: Deploy on multiple wikis (T378660)]] [21:10:50] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:11:18] apergos: you don't need to test the patch, I assume? [21:12:20] not really. I mean "did it break normal wiki operation" for the checkuser service, I guess, but certainly not the centralauth script [21:12:41] the service is not used by anything else, right? [21:13:13] no, and ci should have caught any changes in servicewiring that would be a problem [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:48] I'll just deploy it together with something else then [21:14:03] (03CR) 10Gergő Tisza: [C:03+2] extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [21:14:12] (03CR) 10Gergő Tisza: [C:03+2] extend account creation lookup service to cover forced creations by others [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [21:15:31] MatmaRex: will you need to test the VE changes? [21:15:55] tgr|away: not really, although i could [21:15:59] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@6d38940]: Generate canary events faster in Airflow [21:16:12] but it's a localisation backport, can't really brak anything [21:16:20] !log tgr@deploy2002 tgr, dani: Backport for [[gerrit:1098617|Reader Survey: Undeploy on enwiki (T378660)]], [[gerrit:1098627|Reader Survey: Deploy on multiple wikis (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:16:21] break* [21:16:24] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:17:38] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@6d38940]: Generate canary events faster in Airflow (duration: 01m 39s) [21:18:40] TheresNoTime: all looks good [21:18:50] !log tgr@deploy2002 tgr, dani: Continuing with sync [21:24:22] tgr|away: thanks [21:25:27] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098617|Reader Survey: Undeploy on enwiki (T378660)]], [[gerrit:1098627|Reader Survey: Deploy on multiple wikis (T378660)]] (duration: 14m 43s) [21:25:32] T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660 [21:31:06] (03Merged) 10jenkins-bot: Localisation updates (November 26) [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński) [21:32:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:36:20] (03Merged) 10jenkins-bot: extend account creation lookup service to cover forced creations by others [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [21:36:21] (03Merged) 10jenkins-bot: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn) [21:39:25] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1098990|Localisation updates (November 26) (T372175)]], [[gerrit:1098956|extend account creation lookup service to cover forced creations by others (T378401)]], [[gerrit:1098965|extend account creation backfill script to forced account creations by others (T378401)]], [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deplo [21:39:25] y (T380277)]] [21:39:30] T372175: Allow tag labels and links to be translateable separately - https://phabricator.wikimedia.org/T372175 [21:39:31] T378401: Start running backfillLocalAccounts.php - https://phabricator.wikimedia.org/T378401 [21:39:31] T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277 [21:41:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:50:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1178 depool (T361627)', diff saved to https://phabricator.wikimedia.org/P71373 and previous config saved to /var/cache/conftool/dbconfig/20241128-215026-ladsgroup.json [21:50:31] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [21:51:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: Schema change (T361627) [21:51:21] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: Schema change (T361627) [21:53:42] !log tgr@deploy2002 tgr, ariel, matmarex, mszabo: Backport for [[gerrit:1098990|Localisation updates (November 26) (T372175)]], [[gerrit:1098956|extend account creation lookup service to cover forced creations by others (T378401)]], [[gerrit:1098965|extend account creation backfill script to forced account creations by others (T378401)]], [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot [21:53:42] deploy (T380277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:53:47] T372175: Allow tag labels and links to be translateable separately - https://phabricator.wikimedia.org/T372175 [21:53:47] T378401: Start running backfillLocalAccounts.php - https://phabricator.wikimedia.org/T378401 [21:53:48] T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277 [21:54:42] tgr|away: my change looks good on mwdebug [21:56:08] kostajh: do you want to test the patch? [21:56:48] all is fine here (checked reads, recentchanges, edit :-P) [22:01:01] per https://phabricator.wikimedia.org/T380277#10366334 I suppose the answer is no [22:03:56] (03CR) 10Ladsgroup: [C:03+1] Run dumpInterwiki.php locally with no changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 (owner: 10Tim Starling) [22:04:12] (03CR) 10Ladsgroup: [C:03+1] Activate id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [22:04:20] (03CR) 10Ladsgroup: [C:03+1] Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling) [22:05:28] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1178 gradually with 4 steps - Maint over (T361627) [22:05:32] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [22:06:50] I guess that patch got deployed already outside the window? [22:06:56] that was confusing [22:07:06] !log tgr@deploy2002 tgr, ariel, matmarex, mszabo: Continuing with sync [22:07:49] I saw the checkmark and had no idea what that was about (in the deployment calendar) [22:08:19] I didn't even notice that [22:14:58] Sorry. I wrote in the channel and added the “Done” check mark in the calendar [22:15:12] Should I have removed it from the calendar? [22:15:52] ah that was the source of the Done! :-D maybe a strikethrough if you wanted a record of the deployment to be someplace...? [22:15:54] or I should have paid more attention, I guess [22:17:03] having it in the calendar is generally useful for people trying to see what changed (not so much for this patch, but if it's a code change that can break something, it's better to have a paper trail) [22:17:13] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098990|Localisation updates (November 26) (T372175)]], [[gerrit:1098956|extend account creation lookup service to cover forced creations by others (T378401)]], [[gerrit:1098965|extend account creation backfill script to forced account creations by others (T378401)]], [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot depl [22:17:13] oy (T380277)]] (duration: 37m 48s) [22:17:19] T372175: Allow tag labels and links to be translateable separately - https://phabricator.wikimedia.org/T372175 [22:17:19] T378401: Start running backfillLocalAccounts.php - https://phabricator.wikimedia.org/T378401 [22:17:20] T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277 [22:17:53] !log UTC late deploys done [22:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:16] thanks for deploying tgr|away [22:18:34] yep thanks fr the deploys [22:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10366490 (10phaultfinder) [22:22:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [22:22:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [22:22:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T328817)', diff saved to https://phabricator.wikimedia.org/P71376 and previous config saved to /var/cache/conftool/dbconfig/20241128-222250-ladsgroup.json [22:22:56] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:27:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T328817)', diff saved to https://phabricator.wikimedia.org/P71377 and previous config saved to /var/cache/conftool/dbconfig/20241128-222751-ladsgroup.json [22:27:57] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:33:57] (03PS1) 10Tim Starling: Fix various installPreConfigured bugs [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099059 (https://phabricator.wikimedia.org/T352113) [22:34:35] (03PS1) 10Tim Starling: installer: Fix failure to install blobs table [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099060 [22:35:29] (03PS1) 10Tim Starling: Convert addWiki.php to a wrapper around core installPreConfigured.php [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099061 (https://phabricator.wikimedia.org/T352113) [22:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:39:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [22:39:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [22:39:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:39:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:40:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P71379 and previous config saved to /var/cache/conftool/dbconfig/20241128-223959-ladsgroup.json [22:42:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P71380 and previous config saved to /var/cache/conftool/dbconfig/20241128-224258-ladsgroup.json [22:45:56] (03PS2) 10Tim Starling: Convert addWiki.php to a wrapper around core installPreConfigured.php [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099061 (https://phabricator.wikimedia.org/T352113) [22:45:56] (03PS1) 10Tim Starling: addWiki: Add UpdateSearchIndexConfig [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099064 [22:45:56] (03PS1) 10Tim Starling: dumpInterwiki: read from preinstall.dblist [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099065 (https://phabricator.wikimedia.org/T352113) [22:45:57] (03PS1) 10Tim Starling: addWiki: Move DB_ADMIN to core [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099066 [22:49:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P71381 and previous config saved to /var/cache/conftool/dbconfig/20241128-224905-ladsgroup.json [22:50:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1178 gradually with 4 steps - Maint over (T361627) [22:50:55] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [22:56:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:56:56] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P71383 and previous config saved to /var/cache/conftool/dbconfig/20241128-225805-ladsgroup.json [22:58:26] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:46] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P71384 and previous config saved to /var/cache/conftool/dbconfig/20241128-230412-ladsgroup.json [23:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:13:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T328817)', diff saved to https://phabricator.wikimedia.org/P71385 and previous config saved to /var/cache/conftool/dbconfig/20241128-231312-ladsgroup.json [23:13:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [23:13:17] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [23:13:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [23:13:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:13:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:13:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T328817)', diff saved to https://phabricator.wikimedia.org/P71386 and previous config saved to /var/cache/conftool/dbconfig/20241128-231350-ladsgroup.json [23:16:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T328817)', diff saved to https://phabricator.wikimedia.org/P71387 and previous config saved to /var/cache/conftool/dbconfig/20241128-231650-ladsgroup.json [23:19:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P71388 and previous config saved to /var/cache/conftool/dbconfig/20241128-231919-ladsgroup.json [23:31:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P71389 and previous config saved to /var/cache/conftool/dbconfig/20241128-233157-ladsgroup.json [23:34:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P71390 and previous config saved to /var/cache/conftool/dbconfig/20241128-233426-ladsgroup.json [23:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:47:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P71391 and previous config saved to /var/cache/conftool/dbconfig/20241128-234704-ladsgroup.json [23:47:25] FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:50:10] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.017e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad