[00:00:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling)
[00:00:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling)
[00:00:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T370903)', diff saved to https://phabricator.wikimedia.org/P71271 and previous config saved to /var/cache/conftool/dbconfig/20241128-000023-ladsgroup.json
[00:00:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance
[00:00:32] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[00:00:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2154.codfw.wmnet with reason: Maintenance
[00:00:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T370903)', diff saved to https://phabricator.wikimedia.org/P71272 and previous config saved to /var/cache/conftool/dbconfig/20241128-000046-ladsgroup.json
[00:01:11] <wikibugs>	 (03Merged) 10jenkins-bot: Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling)
[00:01:14] <wikibugs>	 (03Merged) 10jenkins-bot: Introduce preinstall.dblist for wikis that haven't been installed yet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1096839 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling)
[00:01:47] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1094126|Move default main page text for new wikis to config (T352113)]], [[gerrit:1096839|Introduce preinstall.dblist for wikis that haven't been installed yet (T352113)]]
[00:01:51] <stashbot>	 T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113
[00:07:21] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1094126|Move default main page text for new wikis to config (T352113)]], [[gerrit:1096839|Introduce preinstall.dblist for wikis that haven't been installed yet (T352113)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[00:07:25] <stashbot>	 T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113
[00:09:50] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[00:15:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T370903)', diff saved to https://phabricator.wikimedia.org/P71273 and previous config saved to /var/cache/conftool/dbconfig/20241128-001528-ladsgroup.json
[00:15:33] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[00:16:29] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1094126|Move default main page text for new wikis to config (T352113)]], [[gerrit:1096839|Introduce preinstall.dblist for wikis that haven't been installed yet (T352113)]] (duration: 14m 42s)
[00:16:33] <stashbot>	 T352113: Move the addWiki.php maintenance script from WikimediaMaintenance into MediaWiki core - https://phabricator.wikimedia.org/T352113
[00:30:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P71274 and previous config saved to /var/cache/conftool/dbconfig/20241128-003035-ladsgroup.json
[00:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098646
[00:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098646 (owner: 10TrainBranchBot)
[00:38:54] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Allow defaulting to Parsoid Read Views when MobileFrontEnd is active (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 (https://phabricator.wikimedia.org/T381002) (owner: 10C. Scott Ananian)
[00:45:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P71275 and previous config saved to /var/cache/conftool/dbconfig/20241128-004542-ladsgroup.json
[00:47:11] <wikibugs>	 (03PS10) 10BryanDavis: [WIP] Allow provisioning MediaWiki with PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752)
[00:47:11] <wikibugs>	 (03PS1) 10BryanDavis: deployment-prep: Add PHP 8.1 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1098647 (https://phabricator.wikimedia.org/T378752)
[00:56:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098646 (owner: 10TrainBranchBot)
[01:00:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T370903)', diff saved to https://phabricator.wikimedia.org/P71276 and previous config saved to /var/cache/conftool/dbconfig/20241128-010049-ladsgroup.json
[01:00:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance
[01:00:55] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[01:01:05] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2161.codfw.wmnet with reason: Maintenance
[01:01:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T370903)', diff saved to https://phabricator.wikimedia.org/P71277 and previous config saved to /var/cache/conftool/dbconfig/20241128-010112-ladsgroup.json
[01:03:36] <icinga-wm>	 PROBLEM - dump of x1 in codfw on backupmon1001 is CRITICAL: dump for x1 at codfw (db2197) taken more than a week ago: Most recent backup 2024-11-19 00:49:20 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:08:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098651
[01:08:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098651 (owner: 10TrainBranchBot)
[01:16:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T370903)', diff saved to https://phabricator.wikimedia.org/P71278 and previous config saved to /var/cache/conftool/dbconfig/20241128-011559-ladsgroup.json
[01:16:05] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[01:26:24] <wikibugs>	 (03PS1) 10Tim Starling: Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652
[01:27:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098651 (owner: 10TrainBranchBot)
[01:31:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P71279 and previous config saved to /var/cache/conftool/dbconfig/20241128-013106-ladsgroup.json
[01:46:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P71280 and previous config saved to /var/cache/conftool/dbconfig/20241128-014613-ladsgroup.json
[01:47:22] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/1c839c80f5364bbf427963aee48b37467b14b9aa844afef0d7b69339d3615845/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:01:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T370903)', diff saved to https://phabricator.wikimedia.org/P71281 and previous config saved to /var/cache/conftool/dbconfig/20241128-020120-ladsgroup.json
[02:01:23] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance
[02:01:26] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[02:01:37] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2162.codfw.wmnet with reason: Maintenance
[02:01:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T370903)', diff saved to https://phabricator.wikimedia.org/P71282 and previous config saved to /var/cache/conftool/dbconfig/20241128-020143-ladsgroup.json
[02:02:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:07:22] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:08:58] <wikibugs>	 (03CR) 10Samwilson: [C:03+1] Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652 (owner: 10Tim Starling)
[02:09:27] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652 (owner: 10Tim Starling)
[02:10:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add frwiki on labs for new addWiki.php test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098652 (owner: 10Tim Starling)
[02:16:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T370903)', diff saved to https://phabricator.wikimedia.org/P71283 and previous config saved to /var/cache/conftool/dbconfig/20241128-021629-ladsgroup.json
[02:16:34] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[02:31:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P71284 and previous config saved to /var/cache/conftool/dbconfig/20241128-023136-ladsgroup.json
[02:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[02:39:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:46:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P71285 and previous config saved to /var/cache/conftool/dbconfig/20241128-024644-ladsgroup.json
[03:01:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T370903)', diff saved to https://phabricator.wikimedia.org/P71286 and previous config saved to /var/cache/conftool/dbconfig/20241128-030151-ladsgroup.json
[03:01:53] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance
[03:01:56] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[03:02:07] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2163.codfw.wmnet with reason: Maintenance
[03:02:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T370903)', diff saved to https://phabricator.wikimedia.org/P71287 and previous config saved to /var/cache/conftool/dbconfig/20241128-030213-ladsgroup.json
[03:09:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:17:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T370903)', diff saved to https://phabricator.wikimedia.org/P71288 and previous config saved to /var/cache/conftool/dbconfig/20241128-031726-ladsgroup.json
[03:17:33] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[03:32:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P71289 and previous config saved to /var/cache/conftool/dbconfig/20241128-033234-ladsgroup.json
[03:39:11] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:47:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P71290 and previous config saved to /var/cache/conftool/dbconfig/20241128-034741-ladsgroup.json
[04:02:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T370903)', diff saved to https://phabricator.wikimedia.org/P71291 and previous config saved to /var/cache/conftool/dbconfig/20241128-040248-ladsgroup.json
[04:02:51] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance
[04:02:53] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[04:03:04] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2164.codfw.wmnet with reason: Maintenance
[04:03:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[04:03:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance
[04:03:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T370903)', diff saved to https://phabricator.wikimedia.org/P71292 and previous config saved to /var/cache/conftool/dbconfig/20241128-040326-ladsgroup.json
[04:18:08] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T370903)', diff saved to https://phabricator.wikimedia.org/P71294 and previous config saved to /var/cache/conftool/dbconfig/20241128-041807-ladsgroup.json
[04:18:12] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[04:33:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P71296 and previous config saved to /var/cache/conftool/dbconfig/20241128-043314-ladsgroup.json
[04:38:00] <wikibugs>	 (03PS1) 10Santhosh: recommendation-api: Fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098684
[04:48:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P71297 and previous config saved to /var/cache/conftool/dbconfig/20241128-044822-ladsgroup.json
[04:59:41] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] recommendation-api: Fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098684 (owner: 10Santhosh)
[05:01:03] <wikibugs>	 (03Merged) 10jenkins-bot: recommendation-api: Fix entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098684 (owner: 10Santhosh)
[05:03:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T370903)', diff saved to https://phabricator.wikimedia.org/P71298 and previous config saved to /var/cache/conftool/dbconfig/20241128-050329-ladsgroup.json
[05:03:31] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance
[05:03:34] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[05:03:45] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2166.codfw.wmnet with reason: Maintenance
[05:03:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T370903)', diff saved to https://phabricator.wikimedia.org/P71299 and previous config saved to /var/cache/conftool/dbconfig/20241128-050352-ladsgroup.json
[05:06:58] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[05:16:14] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1098652|Add frwiki on labs for new addWiki.php test]]
[05:18:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T370903)', diff saved to https://phabricator.wikimedia.org/P71300 and previous config saved to /var/cache/conftool/dbconfig/20241128-051833-ladsgroup.json
[05:18:39] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[05:22:00] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1098652|Add frwiki on labs for new addWiki.php test]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[05:23:15] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[05:26:36] <icinga-wm>	 RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2197) taken on 2024-11-28 04:58:13 (361 GiB, +0.4 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:29:56] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098652|Add frwiki on labs for new addWiki.php test]] (duration: 13m 41s)
[05:33:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P71301 and previous config saved to /var/cache/conftool/dbconfig/20241128-053340-ladsgroup.json
[05:37:38] <wikibugs>	 (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-28-052541-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098692
[05:41:44] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-11-28-052541-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098692 (owner: 10KartikMistry)
[05:42:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-28-052541-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098692 (owner: 10KartikMistry)
[05:48:39] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[05:48:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P71302 and previous config saved to /var/cache/conftool/dbconfig/20241128-054847-ladsgroup.json
[06:03:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T370903)', diff saved to https://phabricator.wikimedia.org/P71303 and previous config saved to /var/cache/conftool/dbconfig/20241128-060355-ladsgroup.json
[06:03:58] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[06:04:00] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[06:04:12] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2167.codfw.wmnet with reason: Maintenance
[06:04:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T370903)', diff saved to https://phabricator.wikimedia.org/P71304 and previous config saved to /var/cache/conftool/dbconfig/20241128-060418-ladsgroup.json
[06:16:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T370903)', diff saved to https://phabricator.wikimedia.org/P71305 and previous config saved to /var/cache/conftool/dbconfig/20241128-061647-ladsgroup.json
[06:16:52] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[06:31:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P71306 and previous config saved to /var/cache/conftool/dbconfig/20241128-063155-ladsgroup.json
[06:33:17] <wikibugs>	 (03PS1) 10KartikMistry: recommendation-api: Increase helm timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098811
[06:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[06:38:53] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] recommendation-api: Increase helm timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098811 (owner: 10KartikMistry)
[06:39:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:40:07] <wikibugs>	 (03Merged) 10jenkins-bot: recommendation-api: Increase helm timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098811 (owner: 10KartikMistry)
[06:47:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P71307 and previous config saved to /var/cache/conftool/dbconfig/20241128-064702-ladsgroup.json
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0700). nyaa~
[07:02:08] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[07:02:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T370903)', diff saved to https://phabricator.wikimedia.org/P71308 and previous config saved to /var/cache/conftool/dbconfig/20241128-070209-ladsgroup.json
[07:02:11] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance
[07:02:16] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[07:02:25] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2181.codfw.wmnet with reason: Maintenance
[07:02:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T370903)', diff saved to https://phabricator.wikimedia.org/P71309 and previous config saved to /var/cache/conftool/dbconfig/20241128-070231-ladsgroup.json
[07:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:07:32] <icinga-wm>	 PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 174649 MB (4% inode=92%): /srv/swift-storage/sdc1 151377 MB (3% inode=91%): /srv/swift-storage/sdf1 169531 MB (4% inode=91%): /srv/swift-storage/sdd1 177193 MB (4% inode=92%): /srv/swift-storage/sdg1 179316 MB (4% inode=92%): /srv/swift-storage/sdh1 163322 MB (4% inode=91%): /srv/swift-storage/sdi1 211835 MB (5% inode=92%): /srv/swift-st
[07:07:32] <icinga-wm>	 j1 163074 MB (4% inode=92%): /srv/swift-storage/sdk1 162405 MB (4% inode=91%): /srv/swift-storage/sdm1 175927 MB (4% inode=92%): /srv/swift-storage/sdn1 186693 MB (4% inode=92%): /srv/swift-storage/sdl1 152737 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops
[07:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:09:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:09:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: hiddenparma: add CSRF token config [puppet] - 10https://gerrit.wikimedia.org/r/1098819
[07:13:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4604/co" [puppet] - 10https://gerrit.wikimedia.org/r/1098819 (owner: 10Giuseppe Lavagetto)
[07:13:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] hiddenparma: add CSRF token config [puppet] - 10https://gerrit.wikimedia.org/r/1098819 (owner: 10Giuseppe Lavagetto)
[07:15:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:17:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T370903)', diff saved to https://phabricator.wikimedia.org/P71310 and previous config saved to /var/cache/conftool/dbconfig/20241128-071700-ladsgroup.json
[07:17:07] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[07:22:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Release CSRF token support, some UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098863
[07:22:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Release CSRF token support, some UI improvements [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098863 (owner: 10Giuseppe Lavagetto)
[07:22:59] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "CSRF token support - oblivian@cumin1002"
[07:23:02] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: CSRF token support - oblivian@cumin1002
[07:23:37] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: CSRF token support - oblivian@cumin1002
[07:23:38] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "CSRF token support - oblivian@cumin1002"
[07:24:28] <wikibugs>	 (03PS1) 10Varnent: Add foundation to list of wikis Office Wiki can import from. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098865 (https://phabricator.wikimedia.org/T381063)
[07:25:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[07:31:31] <wikibugs>	 (03PS1) 10Varnent: Enable Wikilove extension on Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098867 (https://phabricator.wikimedia.org/T381065)
[07:32:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P71312 and previous config saved to /var/cache/conftool/dbconfig/20241128-073207-ladsgroup.json
[07:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:42:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:43:43] <wikibugs>	 (03PS1) 10Varnent: Allow importing from Commons and English Wikipedia to Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098868 (https://phabricator.wikimedia.org/T381066)
[07:44:48] <wikibugs>	 (03PS1) 10KartikMistry: Revert "recommendation-api: Increase helm timeout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098869
[07:46:19] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Revert "recommendation-api: Increase helm timeout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098869 (owner: 10KartikMistry)
[07:47:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P71313 and previous config saved to /var/cache/conftool/dbconfig/20241128-074714-ladsgroup.json
[07:47:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "recommendation-api: Increase helm timeout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098869 (owner: 10KartikMistry)
[07:56:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet
[07:56:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10364628 (10ops-monitoring-bot) Draining ganeti1022.eqiad.wmnet of running VMs
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:02:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T370903)', diff saved to https://phabricator.wikimedia.org/P71314 and previous config saved to /var/cache/conftool/dbconfig/20241128-080221-ladsgroup.json
[08:02:23] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2195.codfw.wmnet with reason: Maintenance
[08:02:26] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[08:02:37] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2195.codfw.wmnet with reason: Maintenance
[08:02:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T370903)', diff saved to https://phabricator.wikimedia.org/P71315 and previous config saved to /var/cache/conftool/dbconfig/20241128-080244-ladsgroup.json
[08:15:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T370903)', diff saved to https://phabricator.wikimedia.org/P71316 and previous config saved to /var/cache/conftool/dbconfig/20241128-081514-ladsgroup.json
[08:15:20] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[08:25:23] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] mailman: run tasks every 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/1098489 (https://phabricator.wikimedia.org/T377045) (owner: 10AOkoth)
[08:25:51] <wikibugs>	 (03PS1) 10JMeybohm: jayme: Add basic cookbook bash completion [puppet] - 10https://gerrit.wikimedia.org/r/1098875
[08:27:10] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] jayme: Add basic cookbook bash completion [puppet] - 10https://gerrit.wikimedia.org/r/1098875 (owner: 10JMeybohm)
[08:30:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P71317 and previous config saved to /var/cache/conftool/dbconfig/20241128-083021-ladsgroup.json
[08:32:48] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877
[08:35:56] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877
[08:41:31] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[08:43:17] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[08:44:36] <wikibugs>	 (03PS2) 10Varnent: Allow importing from Commons and English Wikipedia to Foundation Governance Wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098868 (https://phabricator.wikimedia.org/T381066)
[08:45:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P71318 and previous config saved to /var/cache/conftool/dbconfig/20241128-084528-ladsgroup.json
[08:45:49] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877
[08:46:58] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:48:04] <wikibugs>	 (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:48:53] <wikibugs>	 (03PS10) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[08:49:20] <wikibugs>	 (03PS1) 10Slyngshede: Only show sign in link for anonymous users [software/bitu] - 10https://gerrit.wikimedia.org/r/1098881 (https://phabricator.wikimedia.org/T380998)
[08:53:40] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney)
[08:54:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:55:45] <wikibugs>	 (03PS1) 10Slyngshede: Remove potential caching issue with permission info [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884
[08:59:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release recommendation-api-ng/main on k8s-mlstaging@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=recommendation-api-ng - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:00:05] <jouncebot>	 hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T0900)
[09:00:11] <wikibugs>	 (03PS1) 10DCausse: flink-app: add a component label to the flink-app configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098885
[09:00:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T370903)', diff saved to https://phabricator.wikimedia.org/P71319 and previous config saved to /var/cache/conftool/dbconfig/20241128-090035-ladsgroup.json
[09:00:38] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance
[09:00:41] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[09:00:51] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2198.codfw.wmnet with reason: Maintenance
[09:06:02] <hashar>	 o/
[09:06:18] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:08:25] <hashar>	 I am going to promote all wikis
[09:08:55] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098887 (https://phabricator.wikimedia.org/T375664)
[09:08:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098887 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot)
[09:09:11] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:09:40] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098887 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot)
[09:10:53] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Blocking: Allow multiple account managers groups [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede)
[09:11:42] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877 (owner: 10Ilias Sarantopoulos)
[09:12:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884 (owner: 10Slyngshede)
[09:12:45] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase readiness prob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098877 (owner: 10Ilias Sarantopoulos)
[09:12:55] <wikibugs>	 (03CR) 10Harroyo-wmf: [C:03+1] "I've found this useful to understand what this setting does: https://phabricator.wikimedia.org/diffusion/EEVB/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó)
[09:13:54] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance
[09:14:08] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2200.codfw.wmnet with reason: Maintenance
[09:14:15] <wikibugs>	 (03Merged) 10jenkins-bot: Blocking: Allow multiple account managers groups [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede)
[09:15:31] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Remove potential caching issue with permission info [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884 (owner: 10Slyngshede)
[09:17:55] <wikibugs>	 (03Merged) 10jenkins-bot: Remove potential caching issue with permission info [software/bitu] - 10https://gerrit.wikimedia.org/r/1098884 (owner: 10Slyngshede)
[09:18:36] <wikibugs>	 (03PS11) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[09:18:47] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó)
[09:19:15] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: recapi increase readiness prob in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098890
[09:20:19] <wikibugs>	 (03PS12) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[09:20:33] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Show CN as signed in username [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede)
[09:20:41] <wikibugs>	 (03PS9) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[09:20:58] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] ml-services: recapi increase readiness prob in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098890 (owner: 10Ilias Sarantopoulos)
[09:21:17] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10364803 (10dcaro) >>! In T379927#10354355, @Andrew wrote: > From Gerrit, @dcaro writes: >  >  >>  >> Did a quick test, there's thre...
[09:22:12] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: recapi increase readiness prob in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098890 (owner: 10Ilias Sarantopoulos)
[09:22:19] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.5  refs T375664
[09:22:24] <stashbot>	 T375664: 1.44.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T375664
[09:23:09] <wikibugs>	 (03Merged) 10jenkins-bot: Show CN as signed in username [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede)
[09:23:41] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:26:58] <wikibugs>	 (03PS10) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[09:30:50] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases2003.codfw.wmnet
[09:31:43] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases2003.codfw.wmnet (duration: 01m 27s)
[09:33:37] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Blocking: Show current user LDAP status [software/bitu] - 10https://gerrit.wikimedia.org/r/1097378 (owner: 10Slyngshede)
[09:35:13] <logmsgbot>	 !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases1003.eqiad.wmnet
[09:35:47] <wikibugs>	 (03PS1) 10Jelto: trafficserver: switch query-scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1098891 (https://phabricator.wikimedia.org/T350793)
[09:36:17] <logmsgbot>	 !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@34b35a5] (releasing): Update Jenkins version on releases1003.eqiad.wmnet (duration: 01m 22s)
[09:36:37] <wikibugs>	 (03Merged) 10jenkins-bot: Blocking: Show current user LDAP status [software/bitu] - 10https://gerrit.wikimedia.org/r/1097378 (owner: 10Slyngshede)
[09:39:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] trafficserver: switch query-scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1098891 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[09:42:19] <hashar>	 train is rather quiet as far as I can see
[09:45:46] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[09:46:38] <wikibugs>	 (03PS13) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[09:47:04] <wikibugs>	 (03PS11) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[09:47:20] <wikibugs>	 (03CR) 10JMeybohm: modules: add health checks to the mesh's _tcp_cluster config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[09:47:37] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[09:49:16] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[09:49:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "A couple of questions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[09:52:41] <wikibugs>	 (03PS7) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[09:52:42] <wikibugs>	 (03PS6) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[09:52:42] <wikibugs>	 (03PS8) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[09:53:07] <wikibugs>	 (03CR) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[09:53:52] <wikibugs>	 (03PS8) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[09:53:52] <wikibugs>	 (03PS7) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[09:53:52] <wikibugs>	 (03PS9) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[09:54:41] <wikibugs>	 (03PS9) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[09:54:41] <wikibugs>	 (03PS8) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[09:54:41] <wikibugs>	 (03PS10) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[09:55:32] <wikibugs>	 (03PS10) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[09:55:32] <wikibugs>	 (03PS9) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[09:55:32] <wikibugs>	 (03PS11) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[09:55:48] <wikibugs>	 (03CR) 10Elukey: "ok now it should be done :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[09:58:18] <wikibugs>	 (03CR) 10JMeybohm: services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[09:59:02] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] modules: add health checks to the mesh's _tcp_cluster config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[10:10:44] <wikibugs>	 (03CR) 10Elukey: services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[10:11:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:11:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:11:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:11:47] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:11:47] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:11:47] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:11:47] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:11:57] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:12:23] <hnowlan>	 here
[10:12:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:12:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.782 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:12:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.398 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:13:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet, ms-fe2013.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:13:43] <jinxer-wm>	 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[10:13:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[10:13:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.819 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:13:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:13:57] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.841 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:14:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.383 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:14:12] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:14:26] <vgutierrez>	 Emperor: ^^ swift is struggling in codfw?
[10:14:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:14:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:15:00] <hnowlan>	 some kind of write timeouts? 
[10:15:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:15:30] <hnowlan>	 seeing a bunch of swift.common.exceptions.ChunkWriteTimeout: 60.0 seconds
[10:15:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:15:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:15:56] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1098900
[10:16:09] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.549 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:16:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.172 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:16:51] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.969 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:17:21] <hnowlan>	 the swift backends look impaired too 
[10:17:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:17:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:18:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:18:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.201 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:11] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:19:13] <hashar>	 yeah Swift had a hard time apparently
[10:19:23] <hashar>	 the FileOperation logging bucket has an elevated rate of errors
[10:19:37] <hashar>	 https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&refresh=5m&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-level=ERROR&var-channel=FileOperation
[10:19:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:20:05] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:20:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift_80: Servers ms-fe2009.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:20:39] <hashar>	 [{reqId}] {exception_url} Wikimedia\FileBackend\FileBackendError: Iterator page I/O error. :)
[10:20:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:20:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.428 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:21:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:21:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:21:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:21:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:21:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:22:01] <elukey>	 could it be related to the swift proxies in need of a restart?
[10:22:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:22:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:22:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:22:49] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:22:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:22:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:23:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:23:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:23:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:23:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.355 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:23:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.593 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:23:57] <elukey>	 or maybe an ms-be misbehaving
[10:24:16] <jelto>	 elukey: see -private
[10:24:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[10:26:05] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:26:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:26:51] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:26:51] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:26:57] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.254 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:26:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:26:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:27:06] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:27:09] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.204 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:27:25] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[10:27:27] <jinxer-wm>	 FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:27:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.839 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:28:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:29:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[10:29:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.473 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:30:34] <jelto>	 !incidents
[10:30:35] <sirenbot>	 5492 (ACKED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[10:30:35] <sirenbot>	 5493 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[10:30:35] <sirenbot>	 5494 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin)
[10:30:35] <sirenbot>	 5495 (ACKED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[10:30:36] <sirenbot>	 5491 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[10:30:36] <sirenbot>	 5482 (RESOLVED)  [10x] ProbeDown sre (probes/service magru)
[10:30:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:30:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:31:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:31:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:32:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:32:27] <jinxer-wm>	 RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:32:33] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[10:32:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:32:51] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:32:51] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:33:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:33:44] <jinxer-wm>	 RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[10:33:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[10:33:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:34:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:34:51] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:34:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[10:35:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 505 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:35:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:35:49] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:36:43] <icinga-wm>	 RECOVERY - snapshot of x1 in eqiad on backupmon1001 is OK: Last snapshot for x1 at eqiad (db1216) taken on 2024-11-28 10:17:50 (326 GiB, +0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:36:48] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:37:06] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:39:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:39:51] <jinxer-wm>	 FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[10:40:45] <wikibugs>	 06SRE, 10Incident-Reporting-System (Pilot wiki release December 2024), 10Trust and Safety Product Sprint (Sprint Gong (November 18 - December 6)): Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908#10364928 (10kostajh) 05In progres...
[10:41:45] <jelto>	 !incidents
[10:41:45] <sirenbot>	 5494 (ACKED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin)
[10:41:45] <sirenbot>	 5493 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[10:41:46] <sirenbot>	 5492 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[10:41:46] <sirenbot>	 5495 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[10:41:46] <sirenbot>	 5491 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[10:41:46] <sirenbot>	 5482 (RESOLVED)  [10x] ProbeDown sre (probes/service magru)
[10:42:12] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: Add zh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119)
[10:43:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add zh to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119) (owner: 10Gerrit maintenance bot)
[10:44:54] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365001 (10fnegri) @dcaro thanks for that analysis! I had a look at the [source code for Resolv::DNS](https://github.com/ruby/ruby/...
[10:48:02] <mszabo>	 jouncebot: next
[10:48:02] <jouncebot>	 In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1100)
[10:49:51] <jinxer-wm>	 RESOLVED: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging  - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[10:50:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Bugfix for commit [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098909
[10:50:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix for commit [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1098909 (owner: 10Giuseppe Lavagetto)
[10:51:15] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix commit bug - oblivian@cumin1002"
[10:51:17] <logmsgbot>	 !log oblivian@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix commit bug - oblivian@cumin1002
[10:51:49] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix commit bug - oblivian@cumin1002
[10:51:50] <logmsgbot>	 !log oblivian@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix commit bug - oblivian@cumin1002"
[10:52:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for baitolykin [puppet] - 10https://gerrit.wikimedia.org/r/1098910
[10:57:23] <wikibugs>	 (03PS12) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[10:57:42] <wikibugs>	 (03CR) 10Elukey: services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1100)
[11:00:18] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: recapi increase readiness prob in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098911
[11:00:25] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: increase readiness prob" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098912
[11:03:28] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db2237 gradually with 4 steps - Maint over (T379813)
[11:03:32] <stashbot>	 T379813: Wikimedia\Rdbms\DBQueryError: Error 1034: Index for table 'wbc_entity_usage' is corrupt; try to repair itFunction: Wikibase\Client\Usage\Sql\EntityUsageTable::queryUsagesQuery: SELECT  eu_aspect,eu_entity_id  FROM `wbc_entity - https://phabricator.wikimedia.org/T379813
[11:03:56] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] services: add health checks to Tegola's postgres TCP proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[11:04:18] <wikibugs>	 (03PS1) 10Máté Szabó: ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823)
[11:04:37] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: Maintenance
[11:04:51] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2033.codfw.wmnet with reason: Maintenance
[11:04:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2033 (T376905)', diff saved to https://phabricator.wikimedia.org/P71324 and previous config saved to /var/cache/conftool/dbconfig/20241128-110457-ladsgroup.json
[11:06:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó)
[11:08:18] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2204.codfw.wmnet with reason: Maintenance
[11:08:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2204.codfw.wmnet with reason: Maintenance
[11:09:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:10:20] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:10:34] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:10:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet
[11:10:51] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365134 (10fnegri)
[11:11:29] <moritzm>	 !log removing ganeti1022 from active Ganeti nodes T378921
[11:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:33] <stashbot>	 T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921
[11:11:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033 (T376905)', diff saved to https://phabricator.wikimedia.org/P71325 and previous config saved to /var/cache/conftool/dbconfig/20241128-111154-ladsgroup.json
[11:12:40] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[11:12:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10365139 (10MoritzMuehlenhoff)
[11:12:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[11:13:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T370903)', diff saved to https://phabricator.wikimedia.org/P71326 and previous config saved to /var/cache/conftool/dbconfig/20241128-111300-ladsgroup.json
[11:13:06] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[11:13:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[11:14:01] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[11:14:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet
[11:14:28] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10365147 (10ops-monitoring-bot) Draining ganeti1018.eqiad.wmnet of running VMs
[11:14:35] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti1022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[11:15:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T370903)', diff saved to https://phabricator.wikimedia.org/P71327 and previous config saved to /var/cache/conftool/dbconfig/20241128-111510-ladsgroup.json
[11:17:06] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ganeti1022:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:21:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet
[11:22:06] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ganeti1022:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:23:31] <Amir1>	 jouncebot: nowandnext
[11:23:31] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1100)
[11:23:31] <jouncebot>	 In 1 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1300)
[11:27:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033', diff saved to https://phabricator.wikimedia.org/P71329 and previous config saved to /var/cache/conftool/dbconfig/20241128-112701-ladsgroup.json
[11:29:14] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098910 (owner: 10Muehlenhoff)
[11:29:22] <wikibugs>	 (03Abandoned) 10Slyngshede: data.yaml: Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1098900 (owner: 10Slyngshede)
[11:30:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P71330 and previous config saved to /var/cache/conftool/dbconfig/20241128-113017-ladsgroup.json
[11:30:24] <wikibugs>	 (03PS1) 10Ladsgroup: Bump ratio of new parsercache key spec to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098914 (https://phabricator.wikimedia.org/T373037)
[11:30:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Extend access for baitolykin [puppet] - 10https://gerrit.wikimedia.org/r/1098910 (owner: 10Muehlenhoff)
[11:31:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet
[11:32:32] <wikibugs>	 (03PS1) 10Tim Starling: addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915
[11:32:32] <wikibugs>	 (03PS1) 10Tim Starling: Run dumpInterwiki.php locally with no changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916
[11:32:32] <wikibugs>	 (03PS1) 10Tim Starling: Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726)
[11:32:35] <wikibugs>	 (03PS1) 10Tim Starling: Activate id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726)
[11:33:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling)
[11:34:11] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10365209 (10ops-monitoring-bot) Draining ganeti1018.eqiad.wmnet of running VMs
[11:34:44] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 (owner: 10Tim Starling)
[11:38:27] <wikibugs>	 (03CR) 10Ladsgroup: Run dumpInterwiki.php locally with no changes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 (owner: 10Tim Starling)
[11:39:27] <wikibugs>	 (03PS2) 10Tim Starling: addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915
[11:39:27] <wikibugs>	 (03PS2) 10Tim Starling: Run dumpInterwiki.php locally with no changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916
[11:39:27] <wikibugs>	 (03PS2) 10Tim Starling: Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726)
[11:39:28] <wikibugs>	 (03PS2) 10Tim Starling: Activate id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726)
[11:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:41:25] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365244 (10dcaro) Nice!  I tried with: `   resolver = Resolv::DNS.new(     :nameserver => '127.0.0.1',     :raise_timeout_erros =>...
[11:41:29] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] addWiki.php tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098915 (owner: 10Tim Starling)
[11:42:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033', diff saved to https://phabricator.wikimedia.org/P71333 and previous config saved to /var/cache/conftool/dbconfig/20241128-114208-ladsgroup.json
[11:44:42] <wikibugs>	 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365253 (10Nemoralis) I also encountered this error when uploading a bulk file recently. > An unknown error occurred in storage backend "local-swift-codfw"
[11:45:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P71334 and previous config saved to /var/cache/conftool/dbconfig/20241128-114524-ladsgroup.json
[11:48:52] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2237 gradually with 4 steps - Maint over (T379813)
[11:48:56] <stashbot>	 T379813: Wikimedia\Rdbms\DBQueryError: Error 1034: Index for table 'wbc_entity_usage' is corrupt; try to repair itFunction: Wikibase\Client\Usage\Sql\EntityUsageTable::queryUsagesQuery: SELECT  eu_aspect,eu_entity_id  FROM `wbc_entity - https://phabricator.wikimedia.org/T379813
[11:50:12] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good in general, two comments/questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede)
[11:50:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098914 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup)
[11:51:34] <wikibugs>	 (03Merged) 10jenkins-bot: Bump ratio of new parsercache key spec to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098914 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup)
[11:51:57] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1098914|Bump ratio of new parsercache key spec to 2 (T373037)]]
[11:52:02] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[11:53:09] <wikibugs>	 (03CR) 10Muehlenhoff: "Well, no your patch needs to be merged first :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff)
[11:54:55] <wikibugs>	 (03PS3) 10Muehlenhoff: Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798)
[11:57:12] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1098914|Bump ratio of new parsercache key spec to 2 (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:57:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2033 (T376905)', diff saved to https://phabricator.wikimedia.org/P71336 and previous config saved to /var/cache/conftool/dbconfig/20241128-115715-ladsgroup.json
[11:57:17] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[11:57:21] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2031.codfw.wmnet with reason: Maintenance
[11:57:35] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2031.codfw.wmnet with reason: Maintenance
[11:57:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2031 (T376905)', diff saved to https://phabricator.wikimedia.org/P71337 and previous config saved to /var/cache/conftool/dbconfig/20241128-115741-ladsgroup.json
[11:57:46] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[11:58:32] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[11:59:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[12:00:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T370903)', diff saved to https://phabricator.wikimedia.org/P71338 and previous config saved to /var/cache/conftool/dbconfig/20241128-120031-ladsgroup.json
[12:00:36] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[12:04:35] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098914|Bump ratio of new parsercache key spec to 2 (T373037)]] (duration: 12m 37s)
[12:04:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031 (T376905)', diff saved to https://phabricator.wikimedia.org/P71339 and previous config saved to /var/cache/conftool/dbconfig/20241128-120437-ladsgroup.json
[12:04:39] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[12:05:19] <wikibugs>	 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365302 (10Yann) Again after the last chunk :(((  `02957: FAILED: internal_api_error_DBQueryError: [6d8711c6-4dea-4c57-a254-3e8c35471315] Caught exception of type Wikimedia\Rdbms\DBQueryError`
[12:11:15] <wikibugs>	 (03PS14) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[12:11:41] <wikibugs>	 (03PS12) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[12:17:20] <wikibugs>	 (03PS1) 10Máté Szabó: ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277)
[12:19:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031', diff saved to https://phabricator.wikimedia.org/P71340 and previous config saved to /var/cache/conftool/dbconfig/20241128-121943-ladsgroup.json
[12:21:21] <wikibugs>	 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365335 (10MatthewVernon) There was an incident that impacted codfw swift earlier today (from around 09:55 to 10:55 UTC); this seems likely a consequence of that, so I'd expect a retry would now be succe...
[12:23:48] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:23:52] <wikibugs>	 (03PS9) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555)
[12:24:49] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] LVS: enable do_ipv6_ra_primary in all sites [puppet] - 10https://gerrit.wikimedia.org/r/1093958 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh)
[12:25:16] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Temporarily change cumin installserver alias to not include mgaru [puppet] - 10https://gerrit.wikimedia.org/r/1093322 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney)
[12:28:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff)
[12:29:05] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[12:29:35] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[12:34:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031', diff saved to https://phabricator.wikimedia.org/P71342 and previous config saved to /var/cache/conftool/dbconfig/20241128-123451-ladsgroup.json
[12:39:56] <wikibugs>	 (03PS1) 10Jaime Nuche: scap target: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1098933 (https://phabricator.wikimedia.org/T378769)
[12:42:43] <wikibugs>	 (03CR) 10Jaime Nuche: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098933 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche)
[12:49:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ferm macro/nftables set for loadbalancer nodes [puppet] - 10https://gerrit.wikimedia.org/r/1098936
[12:49:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2031 (T376905)', diff saved to https://phabricator.wikimedia.org/P71343 and previous config saved to /var/cache/conftool/dbconfig/20241128-124957-ladsgroup.json
[12:52:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: toolforge::prometheus: add exporter for the k8s cert expiry (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[12:53:38] <wikibugs>	 (03PS1) 10Btullis: Add keytab files for the hadoop workers in the analytics horizon project [labs/private] - 10https://gerrit.wikimedia.org/r/1098937 (https://phabricator.wikimedia.org/T381087)
[12:54:36] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Add keytab files for the hadoop workers in the analytics horizon project [labs/private] - 10https://gerrit.wikimedia.org/r/1098937 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis)
[12:55:33] <wikibugs>	 (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[12:56:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[12:56:49] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-main1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[12:57:08] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-main1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[12:57:08] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka-main1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker
[12:57:21] <kamila_>	 !incidents
[12:57:21] <sirenbot>	 5496 (UNACKED)  kafka-main1002/Kafka Broker Server (paged)
[12:57:21] <sirenbot>	 5494 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin)
[12:57:22] <sirenbot>	 5493 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[12:57:22] <sirenbot>	 5492 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[12:57:22] <sirenbot>	 5495 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[12:57:22] <sirenbot>	 5491 (RESOLVED)  ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw)
[12:57:27] <wikibugs>	 (03PS16) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579)
[12:57:30] <kamila_>	 !ack 5496
[12:57:30] <sirenbot>	 5496 (ACKED)  kafka-main1002/Kafka Broker Server (paged)
[12:58:08] <wikibugs>	 (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[12:58:14] <hnowlan>	 effie: might this be related to your work? 
[12:58:40] <effie>	 but kafka-main1002 is now a spare
[12:58:41] <effie>	 sigh 
[12:58:48] <claime>	 downtime just expired
[12:59:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:59:42] <effie>	 sure, but, since yesterday the server is a spare server 
[12:59:56] <hnowlan>	 puppet is disabled
[13:00:03] <hnowlan>	 so that hasn't taken effect
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1300)
[13:00:08] <effie>	 wait
[13:00:24] <hnowlan>	 "Puppet is disabled. Hardware
[13:00:59] <effie>	 ok ok my miss, however I still have questions
[13:01:02] <effie>	 I will run puppet now 
[13:01:34] <vgutierrez>	 what's going on with kafka-main? it's impacting the CDN
[13:03:39] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10365419 (10Krd) Please unbreak now.
[13:04:46] <hnowlan>	 effie has been refreshing some of the hosts but it shouldn't be impacting the CDN. vgutierrez: where can I see the impact?  
[13:04:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[13:05:23] <effie>	 vgutierrez: we have switched to 1007 since yesterday 
[13:05:52] <vgutierrez>	 we've been getting lag alerts this morning 
[13:06:17] <effie>	 vgutierrez: can you please elaborate ?
[13:06:22] <fabfur>	 and some yesterday too, but as they were localized in magru I thought was due to the activities
[13:06:56] <wikibugs>	 (03PS1) 10NMW03: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939
[13:07:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 (owner: 10NMW03)
[13:08:02] <fabfur>	 effie: -sre-private
[13:08:28] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff)
[13:09:36] <wikibugs>	 (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[13:09:50] <wikibugs>	 (03PS2) 10NMW03: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939
[13:10:14] <fabfur>	 effie: sorry for the double channel switch, my bad, the alert we received were about lag: `PurgedHighEventLag: High event process lag with purged on cp5017` and such
[13:10:45] <effie>	 no problem 
[13:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:11:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Move Puppet CA monitoring out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1091194 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[13:12:18] <Nemoralis>	 jouncebot: next
[13:12:18] <jouncebot>	 In 0 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1400)
[13:12:46] <Nemoralis>	 jouncebot: now
[13:12:46] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1300)
[13:13:15] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove JIO direct path via peering from AVOIDED-PATHS in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1098942 (https://phabricator.wikimedia.org/T373015)
[13:14:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Packet loss reflected in NELs for traffic to Reliance Jio Infocomm Ltd over BBIX Singapore - https://phabricator.wikimedia.org/T373015#10365435 (10cmooney) I tested removing this as-path from being avoided on cr2-eqsin and there was no pack...
[13:14:43] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Remove JIO direct path via peering from AVOIDED-PATHS in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1098942 (https://phabricator.wikimedia.org/T373015) (owner: 10Cathal Mooney)
[13:15:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove JIO direct path via peering from AVOIDED-PATHS in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1098942 (https://phabricator.wikimedia.org/T373015) (owner: 10Cathal Mooney)
[13:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:21:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[13:24:54] <wikibugs>	 (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[13:27:14] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10365459 (10fnegri) > kinda weird behavior if you ask me  I agree this is quite confusing and also poorly documented.  One thing I d...
[13:30:28] <wikibugs>	 (03PS17) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579)
[13:31:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[13:37:03] <wikibugs>	 (03PS5) 10Krinkle: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[13:39:02] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] "I've folded the code into cluster_fe_hash and added a doc comment indicating the restrictions and caveats learned from last time this was " [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[13:39:25] <wikibugs>	 (03CR) 10Zabe: [C:04-1] "nope" [dns] - 10https://gerrit.wikimedia.org/r/1098905 (https://phabricator.wikimedia.org/T380119) (owner: 10Gerrit maintenance bot)
[13:39:41] <wikibugs>	 (03PS6) 10Krinkle: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[13:45:06] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti1022: update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1098951
[13:46:39] <wikibugs>	 (03PS7) 10Krinkle: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[13:46:49] <wikibugs>	 (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[13:47:09] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] "I've tried to summarise the situation in the commit message as best I can, for review by SRE. @Derick/Gergo is this accurate?" [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[13:48:39] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance
[13:48:53] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: Maintenance
[13:49:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2028 (T376905)', diff saved to https://phabricator.wikimedia.org/P71344 and previous config saved to /var/cache/conftool/dbconfig/20241128-134859-ladsgroup.json
[13:49:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ganeti1022: update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1098951 (owner: 10Muehlenhoff)
[13:49:57] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[13:54:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T376905)', diff saved to https://phabricator.wikimedia.org/P71345 and previous config saved to /var/cache/conftool/dbconfig/20241128-135451-ladsgroup.json
[13:55:23] <wikibugs>	 (03PS1) 10Muehlenhoff: cloudweb/codfw1dev: Use firewall::service for firewall definitions [puppet] - 10https://gerrit.wikimedia.org/r/1098952
[13:56:50] <wikibugs>	 (03CR) 10Elukey: [C:03+2] modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[13:57:52] <wikibugs>	 (03Merged) 10jenkins-bot: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[13:58:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[13:58:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[13:58:20] <wikibugs>	 (03PS11) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[13:58:26] <wikibugs>	 (03PS10) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[13:58:31] <wikibugs>	 (03PS13) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[13:58:47] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098952 (owner: 10Muehlenhoff)
[13:58:56] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[13:59:18] <wikibugs>	 (03PS18) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579)
[13:59:57] <wikibugs>	 (03Merged) 10jenkins-bot: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[14:00:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1400). nyaa~
[14:00:05] <jouncebot>	 tgr, mszabo, abijeet, and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:07] <wikibugs>	 (03CR) 10Elukey: [C:03+2] charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[14:00:14] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) (owner: 10Elukey)
[14:00:42] <Nemoralis>	 o/
[14:00:44] <mszabo>	 o/
[14:01:41] <wikibugs>	 (03PS19) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579)
[14:02:17] <wikibugs>	 (03CR) 10Majavah: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff)
[14:02:25] <wikibugs>	 (03CR) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[14:02:26] <Lucas_WMDE>	 I can probably deploy in a few minutes
[14:02:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[14:02:30] <tgr|away>	 o/
[14:02:42] <Nemoralis>	 thanks Lucas
[14:02:55] <Lucas_WMDE>	 tgr|away: I’d say feel free to start if you want to self-service :)
[14:02:55] <urbanecm>	 i can deploy now if you want me to?
[14:02:59] <Lucas_WMDE>	 or that
[14:03:02] <Lucas_WMDE>	 sure!
[14:03:11] <urbanecm>	 tgr's patches are backports, so they'd take 20 mins on CI anyway
[14:03:15] <wikibugs>	 (03PS20) 10David Caro: toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579)
[14:03:17] <urbanecm>	 (or well, maybe not for CA)
[14:03:20] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[14:03:20] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[14:03:51] <urbanecm>	 mszabo: i have to say, starting commit messages with "Allow IRS to record" prompts whole other meanings in my head
[14:04:06] <wikibugs>	 (03PS2) 10Máté Szabó: Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599)
[14:04:10] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó)
[14:04:24] <mszabo>	 urbanecm: I've cracked that joke a few times in meetings but nobody laughed, in part possibly due to negative past interactions with said abbreviation
[14:04:40] <urbanecm>	 hehe
[14:04:56] <wikibugs>	 (03Merged) 10jenkins-bot: Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) (owner: 10Máté Szabó)
[14:04:57] <tgr|away>	 "server-side events now subject to 3.5% VAT"
[14:05:14] <apergos>	 or a 10% tariff...
[14:05:35] <wikibugs>	 (03CR) 10Urbanecm: ReportIncident: Enable instrumentation on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[14:05:51] <mszabo>	 submit a W-1776 to report an incident
[14:05:52] <Lucas_WMDE>	 now we need a backronym for HMRC
[14:06:13] <urbanecm>	 mszabo: i can never remember those numbers
[14:06:34] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge::prometheus: add exporter for the k8s cert expiry [puppet] - 10https://gerrit.wikimedia.org/r/1097450 (https://phabricator.wikimedia.org/T366579) (owner: 10David Caro)
[14:06:38] <wikibugs>	 (03PS3) 10NMW03: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939
[14:06:42] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 (owner: 10NMW03)
[14:06:49] <mszabo>	 urbanecm: yeah, I'm happy not to have to deal with that
[14:06:51] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:06:54] <urbanecm>	 yep yep
[14:07:06] <urbanecm>	 mszabo: on a serious note, can you take a look at my comment at https://gerrit.wikimedia.org/r/1098913, please?
[14:07:23] <urbanecm>	 abijeet: hi, around too?
[14:07:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098939 (owner: 10NMW03)
[14:07:25] <mszabo>	 urbanecm: sure, one sec
[14:08:11] <urbanecm>	 tgr|away: phan fails for one of your backports (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1098622), can you take a look, please?
[14:08:27] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1098561|Allow IRS to record server-side interaction events (T380599)]], [[gerrit:1098939|Revert^2 "Add contact form for U4C"]]
[14:08:27] <wikibugs>	 (03PS1) 10ArielGlenn: extend account creation lookup service to cover forced creations by others [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401)
[14:08:32] <stashbot>	 T380599: Record server-side interaction event for IRS non-emergency flow submissions - https://phabricator.wikimedia.org/T380599
[14:08:39] * Lucas_WMDE is now also available for deployment if needed
[14:08:55] <urbanecm>	 although i guess `Error cloning https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser to /workspace/src/extensions/CheckUser` might be transient...
[14:09:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[14:09:48] <wikibugs>	 (03CR) 10Urbanecm: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[14:09:52] <urbanecm>	 i'll try again
[14:09:54] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[14:09:56] <tgr|away>	 thx
[14:09:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P71346 and previous config saved to /var/cache/conftool/dbconfig/20241128-140958-ladsgroup.json
[14:10:02] <wikibugs>	 (03PS2) 10Máté Szabó: ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823)
[14:10:07] <wikibugs>	 (03CR) 10Máté Szabó: ReportIncident: Enable instrumentation on labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[14:10:14] <mszabo>	 urbanecm: done
[14:10:33] <urbanecm>	 mszabo: ty!
[14:11:37] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[14:12:20] <wikibugs>	 (03Merged) 10jenkins-bot: ReportIncident: Enable instrumentation on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098913 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[14:12:39] <urbanecm>	 mszabo: should be deployed on beta automatically
[14:13:41] <wikibugs>	 (03CR) 10Gergő Tisza: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[14:14:28] <wikibugs>	 (03Merged) 10jenkins-bot: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[14:14:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10365530 (10MoritzMuehlenhoff)
[14:14:37] <logmsgbot>	 !log urbanecm@deploy2002 nmw03, mszabo, urbanecm: Backport for [[gerrit:1098561|Allow IRS to record server-side interaction events (T380599)]], [[gerrit:1098939|Revert^2 "Add contact form for U4C"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:14:42] <stashbot>	 T380599: Record server-side interaction event for IRS non-emergency flow submissions - https://phabricator.wikimedia.org/T380599
[14:14:45] <urbanecm>	 mszabo: Nemoralis: can you test at mwdebug, please?
[14:14:48] <Nemoralis>	 sure
[14:14:57] <moritzm>	 !log installing apr security updates
[14:15:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:37] <Nemoralis>	 LGTM!
[14:15:46] <urbanecm>	 ty!
[14:15:47] <mszabo>	 urbanecm: lgtm
[14:15:49] <urbanecm>	 ty
[14:15:50] <logmsgbot>	 !log urbanecm@deploy2002 nmw03, mszabo, urbanecm: Continuing with sync
[14:15:53] <urbanecm>	 proceeding
[14:16:29] <urbanecm>	 abijeet: hi, around for deployment?
[14:17:47] <abijeet>	 urbanecm, hey, I'm here
[14:17:57] <urbanecm>	 hello :)
[14:18:05] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:18:06] <abijeet>	 urbanecm, sorry for the delayed response
[14:18:09] <urbanecm>	 let's deploy then
[14:18:10] <urbanecm>	 no worries
[14:18:26] <apergos>	 tgr|away: while you're waiting on your patches: your +2 on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1090920 failed to go through ("This change depends on a change that failed to merge."), can you kick it again? 
[14:18:57] <wikibugs>	 (03Merged) 10jenkins-bot: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:20:08] <wikibugs>	 (03Merged) 10jenkins-bot: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[14:20:35] <mszabo>	 urbanecm: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is what I need to track to see the progress of the beta update right?
[14:21:44] <tgr|away>	 apergos: of, right, because dependency is not limited by branch
[14:22:03] <apergos>	 ty for the kick 
[14:22:27] <Dreamy_Jazz>	 !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration
[14:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:35] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098561|Allow IRS to record server-side interaction events (T380599)]], [[gerrit:1098939|Revert^2 "Add contact form for U4C"]] (duration: 14m 07s)
[14:22:40] <stashbot>	 T380599: Record server-side interaction event for IRS non-emergency flow submissions - https://phabricator.wikimedia.org/T380599
[14:23:20] <urbanecm>	 mszabo: that and the sync job
[14:23:29] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1098623|Use `useformat` query param for device detection or mobile domain (m.) (T380646 T375788)]], [[gerrit:1098913|ReportIncident: Enable instrumentation on labs (T372823)]], [[gerrit:1098509|Enable message group subscription feature for some wikis (T372386)]], [[gerrit:1098622|Use `useformat` query param for device detection or mobile domain (m.) (
[14:23:29] <logmsgbot>	 T380646 T375788)]]
[14:23:30] <mszabo>	 awesome, thanks
[14:23:31] <urbanecm>	 https://integration.wikimedia.org/ci/job/beta-scap-sync-world/, which would be triggered once this one finishes
[14:23:38] <stashbot>	 T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646
[14:23:38] <stashbot>	 T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788
[14:23:39] <stashbot>	 T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823
[14:23:39] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[14:24:34] <abijeet_>	 still here under a slightly different name :-)
[14:25:02] <Dreamy_Jazz>	 !log Started MediaModeration scanning scripts to run again over all wikis
[14:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028', diff saved to https://phabricator.wikimedia.org/P71347 and previous config saved to /var/cache/conftool/dbconfig/20241128-142505-ladsgroup.json
[14:27:56] <apergos>	 but why isn't it in the gate-and-submit queue now?  does jenkins need to be told the equiv of recheck? 
[14:28:11] <apergos>	 tgr|away:  ^^ 
[14:28:27] <urbanecm>	 apergos: unmet dependencies
[14:28:34] <urbanecm>	 it's now waiting on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1098956
[14:28:41] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, tgr, abi, mszabo: Backport for [[gerrit:1098623|Use `useformat` query param for device detection or mobile domain (m.) (T380646 T375788)]], [[gerrit:1098913|ReportIncident: Enable instrumentation on labs (T372823)]], [[gerrit:1098509|Enable message group subscription feature for some wikis (T372386)]], [[gerrit:1098622|Use `useformat` query param for device detection or mobile domain (m.
[14:28:42] <logmsgbot>	 ) (T380646 T375788)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:28:49] <stashbot>	 T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646
[14:28:50] <stashbot>	 T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788
[14:28:50] <stashbot>	 T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823
[14:28:50] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[14:28:53] <urbanecm>	 tgr|away: apergos: please test at mwdebug
[14:28:54] <urbanecm>	 eh
[14:28:58] <urbanecm>	 *abijeet_ ^^
[14:29:00] <wikibugs>	 (03PS1) 10Muehlenhoff: cloudcontrol/codfw1dev:: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991)
[14:29:12] <abijeet_>	 urbanecm, ok
[14:30:02] <apergos>	 (not much to test, it's a service, which is only exercised by a maintenance script. however, the wikis still work :-P )
[14:30:17] <tgr|away>	 urbanecm: on it, will take a bit
[14:30:51] <urbanecm>	 ack
[14:30:56] <apergos>	 (but I'm not the one with  the currently scapped change, only tgr) 
[14:31:46] <abijeet_>	 urbanecm, looks good
[14:32:03] <urbanecm>	 ty abijeet_ 
[14:33:08] <moritzm>	 !log installing node-es-module-lexer updates from Bookworm point release
[14:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:26] <wikibugs>	 (03PS1) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401)
[14:36:25] <urbanecm>	 !log [urbanecm@deploy2002 ~]$ mwscript-k8s -f extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php -- --wiki=bswiki # T378827
[14:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:30] <stashbot>	 T378827: Run Flow migration script at *Phase 1* wikis - https://phabricator.wikimedia.org/T378827
[14:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:39:13] <wikibugs>	 (03PS1) 10Brouberol: airflow: fix typo in the REQUESTS_CA_BUNDLE env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098966
[14:39:42] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow: fix typo in the REQUESTS_CA_BUNDLE env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098966 (owner: 10Brouberol)
[14:39:54] <urbanecm>	 !log [urbanecm@deploy2002 ~]$ while read wiki; do echo "== $wiki"; mwscript-k8s extensions/Flow/maintenance/FlowMoveBoardsToSubpages.php -- --wiki=$wiki; done < wikis.txt # wikis.txt is at P71349 # T378827
[14:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2028 (T376905)', diff saved to https://phabricator.wikimedia.org/P71350 and previous config saved to /var/cache/conftool/dbconfig/20241128-144012-ladsgroup.json
[14:40:18] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: Maintenance
[14:40:30] <tgr|away>	 are autocreations disallowed on wikitech?
[14:40:32] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2032.codfw.wmnet with reason: Maintenance
[14:40:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2032 (T376905)', diff saved to https://phabricator.wikimedia.org/P71351 and previous config saved to /var/cache/conftool/dbconfig/20241128-144039-ladsgroup.json
[14:41:16] <tgr|away>	 I get "
[14:41:18] <tgr|away>	 Auto-creation of a local account failed: You are not allowed to execute the action you have requested."
[14:41:38] <urbanecm>	 that doesn't feel right
[14:42:21] <tgr|away>	 anyone has a unified account on wikitech and willing to do a quick test?
[14:42:50] <urbanecm>	 tgr|away: i get the same error
[14:42:53] <urbanecm>	 i have two unified accs
[14:43:07] <urbanecm>	 and https://wikitech.wikimedia.org/wiki/Special:ListGroupRights says `createaccount` is not assigned to anyone
[14:43:45] <urbanecm>	 ...looking at commits, i see `labswiki: Disallow account autocreation` 
[14:43:48] <urbanecm>	 authored by MYSELF
[14:44:04] <tgr|away>	 ha
[14:44:13] <tgr|away>	 maybe you have an evil clone
[14:44:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: fix typo in the REQUESTS_CA_BUNDLE env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098966 (owner: 10Brouberol)
[14:44:44] <urbanecm>	 anyway, what's the test?
[14:44:46] <tgr|away>	 urbanecm: can you do a login on the mobile interface of wikitech, and check if you got automatically logged in to, say, en.m.wikipedia.org as well?
[14:44:54] <urbanecm>	 with mwdebug i presume
[14:45:08] <tgr|away>	 if you use firefox you'll have to disable extended tracking protection first
[14:45:11] <tgr|away>	 yeah
[14:45:15] <urbanecm>	 chrome
[14:45:21] <tgr|away>	 then its fine
[14:45:55] <tgr|away>	 ...actually, that's not what should be tested, sorry
[14:46:13] <urbanecm>	 tgr|away: it does not work
[14:46:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032 (T376905)', diff saved to https://phabricator.wikimedia.org/P71352 and previous config saved to /var/cache/conftool/dbconfig/20241128-144641-ladsgroup.json
[14:46:46] <wikibugs>	 (03PS1) 10Btullis: Add a keystore password for analytics-hadoop-labs [labs/private] - 10https://gerrit.wikimedia.org/r/1098967 (https://phabricator.wikimedia.org/T381087)
[14:46:47] <urbanecm>	 tgr|away: but also note `wgCentralAuthCookies = false` for labswiki
[14:46:55] <tgr|away>	 ooh
[14:46:59] <tgr|away>	 never mind then
[14:47:08] <tgr|away>	 thanks for checking that
[14:47:15] <tgr|away>	 the patch is good to go then
[14:47:19] <urbanecm>	 okay, proceeding
[14:47:23] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Add a keystore password for analytics-hadoop-labs [labs/private] - 10https://gerrit.wikimedia.org/r/1098967 (https://phabricator.wikimedia.org/T381087) (owner: 10Btullis)
[14:47:23] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, tgr, abi, mszabo: Continuing with sync
[14:51:24] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[14:51:37] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[14:51:39] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:51:55] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:51:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] turnilo: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1094420 (owner: 10Muehlenhoff)
[14:52:04] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[14:52:18] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[14:52:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[14:52:40] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[14:52:49] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[14:53:02] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[14:53:10] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[14:53:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[14:53:32] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:53:45] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:54:01] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[14:54:03] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098623|Use `useformat` query param for device detection or mobile domain (m.) (T380646 T375788)]], [[gerrit:1098913|ReportIncident: Enable instrumentation on labs (T372823)]], [[gerrit:1098509|Enable message group subscription feature for some wikis (T372386)]], [[gerrit:1098622|Use `useformat` query param for device detection or mobile domain (m.)
[14:54:03] <logmsgbot>	 (T380646 T375788)]] (duration: 30m 33s)
[14:54:09] <urbanecm>	 should be live
[14:54:11] <stashbot>	 T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646
[14:54:12] <stashbot>	 T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788
[14:54:12] <stashbot>	 T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823
[14:54:12] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[14:54:13] <urbanecm>	 anything else?
[14:54:14] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[14:54:22] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[14:54:36] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[14:54:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2151.codfw.wmnet with reason: Maintenance
[14:54:57] <volans>	 Amir1: the downtime cookbook accepts cumin queries to match multiple hosts at once if that helps ;) 
[14:55:02] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: Maintenance
[14:55:06] <tgr|away>	 thanks!
[14:55:20] <apergos>	 if you or tgr could help me untangle the merge-depends-on-backport issue, that would be lovely; should I abandon the one backport and wait for the merge on the other patch (or will it need to be kicked again) or...?
[14:55:20] <Amir1>	 volans: it's a bit complicated
[14:55:21] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2158.codfw.wmnet with reason: Maintenance
[14:55:35] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2158.codfw.wmnet with reason: Maintenance
[14:55:36] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance
[14:55:43] <Amir1>	 it's not parallel, it's serial but just fats (the table is small)
[14:55:49] <Amir1>	 *fast
[14:55:50] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance
[14:55:51] <urbanecm>	 apergos: you mean the https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1090920 one?
[14:56:08] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:56:10] <volans>	 ah ok
[14:56:13] <apergos>	 that's the one I would like to merge and won't right now. yep
[14:56:19] <Amir1>	 e.g. if I run the same script on s3, it's gonna take an hour between each
[14:56:22] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:56:25] <volans>	 looked too fast for usual DB maintenance :D
[14:56:28] <tgr|away>	 probably just get rid of the Depends-On line
[14:56:31] <urbanecm>	 apergos: on that change, i'd just remove depends-on
[14:56:39] <Amir1>	 yeah, the table is tiny
[14:56:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2180.codfw.wmnet with reason: Maintenance
[14:56:42] <urbanecm>	 it tries to ensures it is merged in all branches it exists in
[14:56:43] <tgr|away>	 I don't think it's recoverable otherwise
[14:56:50] <apergos>	 good point, since it's in master anyways
[14:56:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: Maintenance
[14:57:00] <urbanecm>	 which you actually don't want
[14:57:05] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2193.codfw.wmnet with reason: Maintenance
[14:57:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2193.codfw.wmnet with reason: Maintenance
[14:57:21] <urbanecm>	 (abandoning backport, re-+2ing and restoring would probably also work, but there's little point in doing that)
[14:57:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance
[14:57:29] <tgr|away>	 (see https://www.mediawiki.org/wiki/Gerrit/Cross-repo_dependencies#Possible_problems )
[14:57:43] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance
[14:57:58] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2217.codfw.wmnet with reason: Maintenance
[14:58:03] <tgr|away>	 You can just link the dependency in freetext.
[14:58:07] <apergos>	 (I shall)
[14:58:12] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2217.codfw.wmnet with reason: Maintenance
[14:58:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2224.codfw.wmnet with reason: Maintenance
[14:58:42] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: Maintenance
[14:58:44] <Amir1>	 the schema change is idempotent, I probably can even run it on master with replication but I'm just nervous about it :D
[14:58:59] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2229.codfw.wmnet with reason: Maintenance
[14:59:13] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: Maintenance
[14:59:36] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[14:59:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[14:59:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10365694 (10MoritzMuehlenhoff)
[14:59:47] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2214.codfw.wmnet with reason: Maintenance
[15:00:01] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2214.codfw.wmnet with reason: Maintenance
[15:00:36] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:00:50] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:01:03] <wikibugs>	 10SRE-Access-Requests, 06Machine-Learning-Team, 10Recommendation-API: Access to deploy recommendation API ML service for Stephane - https://phabricator.wikimedia.org/T381108 (10Nikerabbit) 03NEW
[15:01:49] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032', diff saved to https://phabricator.wikimedia.org/P71369 and previous config saved to /var/cache/conftool/dbconfig/20241128-150148-ladsgroup.json
[15:02:01] <wikibugs>	 (03Abandoned) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[15:02:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[15:02:49] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[15:04:32] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[15:04:46] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[15:06:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[15:06:43] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[15:07:10] <wikibugs>	 10SRE-swift-storage: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365747 (10Yann) Now with another file, I got  `03494: FAILED: stashfailed: Could not acquire lock. Somebody else is doing something to this file.`
[15:08:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[15:08:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[15:08:51] <Amir1>	 only two minutes on all of s3, that's cool
[15:09:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:24] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[15:10:38] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[15:10:40] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[15:10:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[15:12:01] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523#10365763 (10Aklapper) 05Resolved→03Open Reopening per second bullet point on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group
[15:12:59] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[15:13:13] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[15:13:20] <wikibugs>	 10SRE-swift-storage, 06Commons, 10UploadWizard: internal_api_error_UploadChunkFileException - https://phabricator.wikimedia.org/T381093#10365766 (10MatthewVernon) That's usually a sign that something has gone wrong above the swift level, I'm afraid (and previously when I've had it reported it has self-resolv...
[15:13:25] <Amir1>	 sigh that broke replication to wikireplicas
[15:15:15] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[15:15:28] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[15:15:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet
[15:16:38] <logmsgbot>	 !log gmodena@deploy2002 Started deploy [analytics/refinery@ac87303]: Gobblin config changes [analytics/refinery@ac873037]
[15:16:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032', diff saved to https://phabricator.wikimedia.org/P71370 and previous config saved to /var/cache/conftool/dbconfig/20241128-151655-ladsgroup.json
[15:18:30] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[15:18:44] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[15:19:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[15:19:44] <logmsgbot>	 !log gmodena@deploy2002 Finished deploy [analytics/refinery@ac87303]: Gobblin config changes [analytics/refinery@ac873037] (duration: 03m 05s)
[15:20:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2139.codfw.wmnet with reason: Maintenance
[15:20:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[15:20:39] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2139.codfw.wmnet with reason: Maintenance
[15:20:42] <apergos>	 tgr|away: still no gate and submit jobs, with the dependency line removed, do you need to remove your vote and redo again? (sorry)
[15:21:38] <moritzm>	 !log removing ganeti1018 from active Ganeti nodes T378921
[15:21:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:42] <stashbot>	 T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921
[15:22:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:23:32] <apergos>	 thanks yet again...
[15:24:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch idp-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1094426 (owner: 10Muehlenhoff)
[15:24:17] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[15:24:47] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[15:25:00] <logmsgbot>	 !log gmodena@deploy2002 Started deploy [analytics/refinery@ac87303] (thin): Gobblin config changes THIN [analytics/refinery@ac873037]
[15:25:30] <logmsgbot>	 !log gmodena@deploy2002 Finished deploy [analytics/refinery@ac87303] (thin): Gobblin config changes THIN [analytics/refinery@ac873037] (duration: 00m 30s)
[15:26:22] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10365819 (10LSobanski) The potential fix was merged in https://gitlab.com/mailman/mailman/-/issues/1151 and is included in Mailman version 3.3.10.
[15:26:39] <logmsgbot>	 !log gmodena@deploy2002 Started deploy [analytics/refinery@ac87303] (hadoop-test): Gobblin config changes [analytics/refinery@ac873037]
[15:27:05] <logmsgbot>	 !log gmodena@deploy2002 Finished deploy [analytics/refinery@ac87303] (hadoop-test): Gobblin config changes [analytics/refinery@ac873037] (duration: 00m 26s)
[15:27:06] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ganeti1018:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:29:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: Maintenance
[15:30:09] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: Maintenance
[15:32:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2032 (T376905)', diff saved to https://phabricator.wikimedia.org/P71371 and previous config saved to /var/cache/conftool/dbconfig/20241128-153202-ladsgroup.json
[15:32:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2005.wikimedia.org
[15:33:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523#10365848 (10elukey) 05Open→03Resolved TIL, already done thanks!
[15:36:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2005.wikimedia.org
[15:37:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2004.wikimedia.org
[15:39:28] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2156.codfw.wmnet with reason: Maintenance
[15:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:39:42] <wikibugs>	 (03Restored) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[15:39:42] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: Maintenance
[15:39:42] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti1018: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1098981
[15:39:44] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance
[15:39:46] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance
[15:40:07] <wikibugs>	 (03PS2) 10ArielGlenn: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401)
[15:44:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix keytab locations [labs/private] - 10https://gerrit.wikimedia.org/r/1098982
[15:46:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] hiera: set do_ipv6_primary_ra for all LVS in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1093957 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh)
[15:46:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: set do_ipv6_primary_ra for all LVS in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1093957 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh)
[15:46:36] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp-test2004.wikimedia.org
[15:48:40] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2177.codfw.wmnet with reason: Maintenance
[15:48:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: Maintenance
[15:50:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/maps [puppet] - 10https://gerrit.wikimedia.org/r/1098984
[15:50:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[15:55:08] <wikibugs>	 (03PS10) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555)
[15:56:12] <wikibugs>	 (03PS1) 10Gmodena: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040)
[15:56:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/1098982 (owner: 10Muehlenhoff)
[15:57:28] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] Fix keytab locations [labs/private] - 10https://gerrit.wikimedia.org/r/1098982 (owner: 10Muehlenhoff)
[15:58:12] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2190.codfw.wmnet with reason: Maintenance
[15:58:26] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: Maintenance
[15:58:40] <wikibugs>	 (03CR) 10Muehlenhoff: tftpboot: squash puppetserver log warning. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1073531 (https://phabricator.wikimedia.org/T374885) (owner: 10JHathaway)
[16:00:05] <jouncebot>	 hashar and andre: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1600)
[16:00:18] <hashar>	 oh true
[16:00:19] <hashar>	 well
[16:01:36] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply
[16:01:42] <logmsgbot>	 !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply
[16:04:29] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Fix keytab locations [labs/private] - 10https://gerrit.wikimedia.org/r/1098982 (owner: 10Muehlenhoff)
[16:04:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Update cloudcephmon secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1098988 (https://phabricator.wikimedia.org/T364870)
[16:07:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Add missing entry for recent LDAP addition [puppet] - 10https://gerrit.wikimedia.org/r/1098989 (https://phabricator.wikimedia.org/T380091)
[16:07:56] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: Maintenance
[16:07:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: Maintenance
[16:08:04] <wikibugs>	 (03PS11) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926)
[16:08:18] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040) (owner: 10Gmodena)
[16:09:31] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040) (owner: 10Gmodena)
[16:11:21] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098989 (https://phabricator.wikimedia.org/T380091) (owner: 10Muehlenhoff)
[16:11:46] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098985 (https://phabricator.wikimedia.org/T381040) (owner: 10Gmodena)
[16:13:51] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Localisation updates (November 26) [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175)
[16:14:06] <wikibugs>	 (03CR) 10Brouberol: "Looks good, with a tiny nit!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene)
[16:14:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński)
[16:16:04] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "ah snap sorry! Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/1098989 (https://phabricator.wikimedia.org/T380091) (owner: 10Muehlenhoff)
[16:17:29] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2205.codfw.wmnet with reason: Maintenance
[16:17:43] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: Maintenance
[16:19:28] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:19:44] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:19:53] <wikibugs>	 (03PS12) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926)
[16:19:57] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:20:00] <wikibugs>	 (03CR) 10Stevemunene: Enable pod-scoped "external services" network policies for airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene)
[16:20:12] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:20:44] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2083.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:21:00] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2083.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:21:20] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:21:45] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:22:04] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2085.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:22:06] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service ganeti1018:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:22:19] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2085.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:22:39] <logmsgbot>	 !log gmodena@deploy2002 Started deploy [airflow-dags/analytics@d7c0f58]: webrequest_frontend post deployment fixes
[16:22:39] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2086.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:23:00] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2086.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:23:38] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2087.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:23:55] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2087.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:24:09] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:24:24] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2088.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:24:35] <logmsgbot>	 !log gmodena@deploy2002 Finished deploy [airflow-dags/analytics@d7c0f58]: webrequest_frontend post deployment fixes (duration: 02m 22s)
[16:24:59] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10366013 (10elukey) Re-ran provision on all those, we are good, no changes registered. Now it is the turn of reimages, I'll kick off some.
[16:27:14] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2227.codfw.wmnet with reason: Maintenance
[16:27:28] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2227.codfw.wmnet with reason: Maintenance
[16:28:21] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye
[16:37:09] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[16:37:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[16:38:23] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "go go go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098499 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[16:39:07] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2209.codfw.wmnet with reason: Maintenance
[16:39:21] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2209.codfw.wmnet with reason: Maintenance
[16:41:29] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage
[16:41:39] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10366043 (10MatthewVernon) We had a problem with codfw swift this morning, with the sort of load pattern that I'd normally expect to "just" result in swift filling a network connectio...
[16:42:19] <wikibugs>	 (03CR) 10Vgutierrez: Add ferm macro/nftables set for loadbalancer nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098936 (owner: 10Muehlenhoff)
[16:43:02] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366047 (10elukey) Tried megactl (packaged by Moritz) on ms-be2082, this is the result:  ` elukey@ms-be2082:~$ su...
[16:44:37] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage
[16:46:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add component/maps [puppet] - 10https://gerrit.wikimedia.org/r/1098984 (owner: 10Muehlenhoff)
[16:47:40] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10366052 (10Ladsgroup) I'm not saying it's impossible but it's unlikely. The number of scripts per host is quite small (6-7) and they are mostly I/O bound waiting for the backends to...
[16:47:42] <sukhe>	 win 14
[16:48:41] <wikibugs>	 (03PS3) 10Hashar: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[16:49:33] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó)
[16:51:14] <Emperor>	 !log depool/restart swift/repool ms-fe2009
[16:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:31] <Emperor>	 !log depool/restart swift/repool ms-fe2014
[16:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:22] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10366070 (10MatthewVernon) No, all frontends had problems, the entire cluster was very sad cf [[ https://grafana.wikimedia.org/goto/Lapva97NR?orgId=1 | envoy on graphana ]], which is...
[16:55:05] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10366079 (10elukey) ms-be2081 done reimaged!
[16:55:09] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366080 (10MatthewVernon) Megactl is correct that the battery is missing, but obviously on nodes where we expect...
[16:57:59] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366085 (10elukey) >>! In T377853#10366080, @MatthewVernon wrote: > Megactl is correct that the battery is missin...
[16:59:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:00:04] <jouncebot>	 jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:57] <hashar>	 I have closed the train blocker task and I have claimed 1.44.0-wmf.5 to be a successful rollout
[17:06:08] <wikibugs>	 (03PS3) 10Ssingh: trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748
[17:06:18] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2081.codfw.wmnet with OS bullseye
[17:07:45] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4605/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh)
[17:08:24] <wikibugs>	 (03PS4) 10Ssingh: trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748
[17:09:54] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4606/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh)
[17:13:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh)
[17:15:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[17:16:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[17:17:11] <wikibugs>	 (03CR) 10Vgutierrez: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[17:45:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1276-1277].eqiad.wmnet} and (A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw or A:wikikube-staging-master-eqiad or A:wikikube-staging-worker-eqiad or A:wikikube-master-codfw or A:wikikube-worker-codfw or A:wikikube-master-eqiad or A:wikikube-worker-eqiad or A:ml-serve-master-eqiad or A:ml-serve-worker-
[17:45:14] <logmsgbot>	 eqiad or A:ml-serve-master-codfw or A:ml-serve-worker-codfw or A:ml-staging-master or A:ml-staging-worker or A:dse-k8s-master or A:dse-k8s-worker or A:aux-master or A:aux-worker)
[17:47:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1276.eqiad.wmnet with OS bookworm
[17:50:56] <wikibugs>	 (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[17:51:02] <icinga-wm>	 PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:52:15] <wikibugs>	 (03PS15) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[17:52:30] <wikibugs>	 (03PS13) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[17:57:32] <wikibugs>	 (03PS1) 10Joal: Move hourly gobblin event start-time later [puppet] - 10https://gerrit.wikimedia.org/r/1099010 (https://phabricator.wikimedia.org/T376144)
[17:57:45] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/1098988 (https://phabricator.wikimedia.org/T364870) (owner: 10Muehlenhoff)
[18:00:05] <jouncebot>	 bd808: Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1800). Please do the needful.
[18:00:06] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T1800)
[18:00:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Packet loss reflected in NELs for traffic to Reliance Jio Infocomm Ltd over BBIX Singapore - https://phabricator.wikimedia.org/T373015#10366155 (10cmooney) 05Open→03Resolved
[18:00:58] <bd808>	 not today jouncebot. I'm "on holiday"
[18:06:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage
[18:09:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage
[18:14:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10366167 (10phaultfinder)
[18:28:03] <icinga-wm>	 RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:28:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1276.eqiad.wmnet with OS bookworm
[18:33:39] <abijeet>	 thanks urbanecm for deploying the configuration change. :-)
[18:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:41:51] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "Looks good. 👍 For reference: This is basically a revert of Idce1027." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098045 (https://phabricator.wikimedia.org/T377809) (owner: 10Joely Rooke WMDE)
[19:05:25] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10366216 (10MoritzMuehlenhoff) It differentiates states already, ms-be2082 has "module missing, pack missing, char...
[19:08:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1277.eqiad.wmnet with OS bookworm
[19:09:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:12:05] <icinga-wm>	 PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:18:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366224 (10cmooney)
[19:23:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366225 (10ssingh) > Which will hopefully verify everything is consistent. In terms of the wider work to integrate with Netbox and get data onto our authdns hosts I will need to wor...
[19:26:05] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366232 (10cmooney)
[19:27:58] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage
[19:28:07] <wikibugs>	 (03Abandoned) 10Ssingh: dnsbox: set anycast-hc num_backups to one [puppet] - 10https://gerrit.wikimedia.org/r/1051381 (owner: 10Ssingh)
[19:28:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó)
[19:29:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10366237 (10cmooney)
[19:31:58] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage
[19:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:39:54] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:40:18] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:43:08] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:43:44] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:50:08] <icinga-wm>	 RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:50:32] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1277.eqiad.wmnet with OS bookworm
[19:50:34] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1276-1277].eqiad.wmnet} and (A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw or A:wikikube-staging-master-eqiad or A:wikikube-staging-worker-eqiad or A:wikikube-master-codfw or A:wikikube-worker-codfw or A:wikikube-master-eqiad or A:wikikube-worker-eqiad or A:ml-serve-master-eqiad or
[19:50:34] <logmsgbot>	 A:ml-serve-worker-eqiad or A:ml-serve-master-codfw or A:ml-serve-worker-codfw or A:ml-staging-master or A:ml-staging-worker or A:dse-k8s-master or A:dse-k8s-worker or A:aux-master or A:aux-worker)
[19:51:54] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:52:22] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:00:38] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:05:52] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 8.401 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:06:18] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:06:28] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:09:51] <wikibugs>	 (03PS1) 10Arlolra: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034
[20:13:33] <kostajh>	 jouncebot: nowandnext
[20:13:33] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 46 minute(s)
[20:13:33] <jouncebot>	 In 0 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T2100)
[20:13:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nicely done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1094422 (https://phabricator.wikimedia.org/T377926) (owner: 10Stevemunene)
[20:15:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó)
[20:16:38] <wikibugs>	 (03Merged) 10jenkins-bot: ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098929 (https://phabricator.wikimedia.org/T380277) (owner: 10Máté Szabó)
[20:16:55] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy (T380277)]]
[20:17:00] <stashbot>	 T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277
[20:22:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:23:02] <logmsgbot>	 !log kharlan@deploy2002 kharlan, mszabo: Backport for [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy (T380277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:23:07] <stashbot>	 T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277
[20:23:17] <logmsgbot>	 !log kharlan@deploy2002 kharlan, mszabo: Continuing with sync
[20:30:04] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deploy (T380277)]] (duration: 13m 08s)
[20:30:11] <stashbot>	 T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277
[20:36:46] <wikibugs>	 (03CR) 10MSantos: [C:03+1] Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034 (owner: 10Arlolra)
[20:36:50] <kostajh>	 I'm done with my backport
[20:38:32] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:39:54] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:41:24] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:41:44] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:47:32] <icinga-wm>	 PROBLEM - Disk space on serpens is CRITICAL: DISK CRITICAL - free space: / 448 MB (2% inode=92%): /tmp 448 MB (2% inode=92%): /var/tmp 448 MB (2% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=serpens&var-datasource=codfw+prometheus/ops
[20:54:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mediawiki group for ABreault (WMF) - https://phabricator.wikimedia.org/T381123 (10ABreault-WMF) 03NEW
[20:55:12] <wikibugs>	 (03PS2) 10Arlolra: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1099034 (https://phabricator.wikimedia.org/T381123)
[20:58:34] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:59:26] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:59:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241128T2100).
[21:00:05] <jouncebot>	 danisztls, MatmaRex, apergos, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:17] <danisztls>	 o/
[21:00:20] <MatmaRex>	 hi
[21:00:20] <apergos>	 here, believe it or not :-P
[21:01:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 99.26% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[21:04:52] <apergos>	 I wonder who's running the window 
[21:05:36] <tgr|away>	 I can do if there are no takers
[21:06:03] <apergos>	 ah you're here anyways
[21:06:29] <tgr|away>	 just got here
[21:06:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 99.26% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[21:07:29] <apergos>	 I'd say if none of the named suspects shows in the next couple minutes, go ahead and run it
[21:07:45] <tgr|away>	 danisztls: can the two config patches be deployed together?
[21:08:23] <danisztls>	 tgr|away: yes
[21:09:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:09:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:09:57] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Localisation updates (November 26) [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński)
[21:10:25] <wikibugs>	 (03Merged) 10jenkins-bot: Reader Survey: Undeploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:10:28] <wikibugs>	 (03Merged) 10jenkins-bot: Reader Survey: Deploy on multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:10:44] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1098617|Reader Survey: Undeploy on enwiki (T378660)]], [[gerrit:1098627|Reader Survey: Deploy on multiple wikis (T378660)]]
[21:10:50] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:11:18] <tgr|away>	 apergos: you don't need to test the patch, I assume?
[21:12:20] <apergos>	 not really. I mean "did it break normal wiki operation" for the checkuser service, I guess, but certainly not the centralauth script
[21:12:41] <tgr|away>	 the service is not used by anything else, right?
[21:13:13] <apergos>	 no, and ci should have caught any changes in servicewiring that would be a problem
[21:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:13:48] <tgr|away>	 I'll just deploy it together with something else then
[21:14:03] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[21:14:12] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] extend account creation lookup service to cover forced creations by others [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[21:15:31] <tgr|away>	 MatmaRex: will you need to test the VE changes?
[21:15:55] <MatmaRex>	 tgr|away: not really, although i could
[21:15:59] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@6d38940]: Generate canary events faster in Airflow
[21:16:12] <MatmaRex>	 but it's a localisation backport, can't really brak anything
[21:16:20] <logmsgbot>	 !log tgr@deploy2002 tgr, dani: Backport for [[gerrit:1098617|Reader Survey: Undeploy on enwiki (T378660)]], [[gerrit:1098627|Reader Survey: Deploy on multiple wikis (T378660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:16:21] <MatmaRex>	 break*
[21:16:24] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:17:38] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@6d38940]: Generate canary events faster in Airflow (duration: 01m 39s)
[21:18:40] <danisztls>	 TheresNoTime: all looks good
[21:18:50] <logmsgbot>	 !log tgr@deploy2002 tgr, dani: Continuing with sync
[21:24:22] <danisztls>	 tgr|away: thanks
[21:25:27] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098617|Reader Survey: Undeploy on enwiki (T378660)]], [[gerrit:1098627|Reader Survey: Deploy on multiple wikis (T378660)]] (duration: 14m 43s)
[21:25:32] <stashbot>	 T378660: Quicksurvey deployment for Reader Survey - https://phabricator.wikimedia.org/T378660
[21:31:06] <wikibugs>	 (03Merged) 10jenkins-bot: Localisation updates (November 26) [extensions/VisualEditor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098990 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński)
[21:32:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:36:20] <wikibugs>	 (03Merged) 10jenkins-bot: extend account creation lookup service to cover forced creations by others [extensions/CheckUser] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098956 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[21:36:21] <wikibugs>	 (03Merged) 10jenkins-bot: extend account creation backfill script to forced account creations by others [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098965 (https://phabricator.wikimedia.org/T378401) (owner: 10ArielGlenn)
[21:39:25] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1098990|Localisation updates (November 26) (T372175)]], [[gerrit:1098956|extend account creation lookup service to cover forced creations by others (T378401)]], [[gerrit:1098965|extend account creation backfill script to forced account creations by others (T378401)]], [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot deplo
[21:39:25] <logmsgbot>	 y (T380277)]]
[21:39:30] <stashbot>	 T372175: Allow tag labels and links to be translateable separately - https://phabricator.wikimedia.org/T372175
[21:39:31] <stashbot>	 T378401: Start running backfillLocalAccounts.php - https://phabricator.wikimedia.org/T378401
[21:39:31] <stashbot>	 T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277
[21:41:32] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:50:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1178 depool (T361627)', diff saved to https://phabricator.wikimedia.org/P71373 and previous config saved to /var/cache/conftool/dbconfig/20241128-215026-ladsgroup.json
[21:50:31] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[21:51:07] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: Schema change (T361627)
[21:51:21] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: Schema change (T361627)
[21:53:42] <logmsgbot>	 !log tgr@deploy2002 tgr, ariel, matmarex, mszabo: Backport for [[gerrit:1098990|Localisation updates (November 26) (T372175)]], [[gerrit:1098956|extend account creation lookup service to cover forced creations by others (T378401)]], [[gerrit:1098965|extend account creation backfill script to forced account creations by others (T378401)]], [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot
[21:53:42] <logmsgbot>	 deploy (T380277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:53:47] <stashbot>	 T372175: Allow tag labels and links to be translateable separately - https://phabricator.wikimedia.org/T372175
[21:53:47] <stashbot>	 T378401: Start running backfillLocalAccounts.php - https://phabricator.wikimedia.org/T378401
[21:53:48] <stashbot>	 T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277
[21:54:42] <MatmaRex>	 tgr|away: my change looks good on mwdebug
[21:56:08] <tgr|away>	 kostajh: do you want to test the patch?
[21:56:48] <apergos>	 all is fine here (checked reads, recentchanges, edit :-P)
[22:01:01] <tgr|away>	 per https://phabricator.wikimedia.org/T380277#10366334 I suppose the answer is no
[22:03:56] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Run dumpInterwiki.php locally with no changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098916 (owner: 10Tim Starling)
[22:04:12] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Activate id.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098918 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling)
[22:04:20] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Prepare id.wikivoyage.org for installation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098917 (https://phabricator.wikimedia.org/T380726) (owner: 10Tim Starling)
[22:05:28] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.pool db1178 gradually with 4 steps - Maint over (T361627)
[22:05:32] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[22:06:50] <tgr|away>	 I guess that patch got deployed already outside the window?
[22:06:56] <tgr|away>	 that was confusing
[22:07:06] <logmsgbot>	 !log tgr@deploy2002 tgr, ariel, matmarex, mszabo: Continuing with sync
[22:07:49] <apergos>	 I saw the checkmark and had no idea what that was about (in the deployment calendar)
[22:08:19] <tgr|away>	 I didn't even notice that
[22:14:58] <kostajh>	 Sorry. I wrote in the channel and added the “Done” check mark in the calendar
[22:15:12] <kostajh>	 Should I have removed it from the calendar?
[22:15:52] <apergos>	 ah that was the source of the Done!  :-D     maybe  a strikethrough if you wanted a record of the deployment to be someplace...?
[22:15:54] <tgr|away>	 or I should have paid more attention, I guess
[22:17:03] <tgr|away>	 having it in the calendar is generally useful for people trying to see what changed (not so much for this patch, but if it's a code change that can break something, it's better to have a paper trail)
[22:17:13] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098990|Localisation updates (November 26) (T372175)]], [[gerrit:1098956|extend account creation lookup service to cover forced creations by others (T378401)]], [[gerrit:1098965|extend account creation backfill script to forced account creations by others (T378401)]], [[gerrit:1098929|ReportIncident: Setup $wgReportIncidentLocalLinks for ptwiki pilot depl
[22:17:13] <logmsgbot>	 oy (T380277)]] (duration: 37m 48s)
[22:17:19] <stashbot>	 T372175: Allow tag labels and links to be translateable separately - https://phabricator.wikimedia.org/T372175
[22:17:19] <stashbot>	 T378401: Start running backfillLocalAccounts.php - https://phabricator.wikimedia.org/T378401
[22:17:20] <stashbot>	 T380277: Prepare local links configuration for IRS pilot wiki - https://phabricator.wikimedia.org/T380277
[22:17:53] <tgr|away>	 !log UTC late deploys done
[22:17:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:16] <MatmaRex>	 thanks for deploying tgr|away
[22:18:34] <apergos>	 yep thanks fr the deploys
[22:19:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10366490 (10phaultfinder)
[22:22:30] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[22:22:44] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance
[22:22:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T328817)', diff saved to https://phabricator.wikimedia.org/P71376 and previous config saved to /var/cache/conftool/dbconfig/20241128-222250-ladsgroup.json
[22:22:56] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[22:27:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T328817)', diff saved to https://phabricator.wikimedia.org/P71377 and previous config saved to /var/cache/conftool/dbconfig/20241128-222751-ladsgroup.json
[22:27:57] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[22:33:57] <wikibugs>	 (03PS1) 10Tim Starling: Fix various installPreConfigured bugs [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099059 (https://phabricator.wikimedia.org/T352113)
[22:34:35] <wikibugs>	 (03PS1) 10Tim Starling: installer: Fix failure to install blobs table [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099060
[22:35:29] <wikibugs>	 (03PS1) 10Tim Starling: Convert addWiki.php to a wrapper around core installPreConfigured.php [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099061 (https://phabricator.wikimedia.org/T352113)
[22:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:39:21] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[22:39:35] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[22:39:36] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[22:39:53] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[22:40:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P71379 and previous config saved to /var/cache/conftool/dbconfig/20241128-223959-ladsgroup.json
[22:42:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P71380 and previous config saved to /var/cache/conftool/dbconfig/20241128-224258-ladsgroup.json
[22:45:56] <wikibugs>	 (03PS2) 10Tim Starling: Convert addWiki.php to a wrapper around core installPreConfigured.php [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099061 (https://phabricator.wikimedia.org/T352113)
[22:45:56] <wikibugs>	 (03PS1) 10Tim Starling: addWiki: Add UpdateSearchIndexConfig [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099064
[22:45:56] <wikibugs>	 (03PS1) 10Tim Starling: dumpInterwiki: read from preinstall.dblist [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099065 (https://phabricator.wikimedia.org/T352113)
[22:45:57] <wikibugs>	 (03PS1) 10Tim Starling: addWiki: Move DB_ADMIN to core [extensions/WikimediaMaintenance] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1099066
[22:49:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P71381 and previous config saved to /var/cache/conftool/dbconfig/20241128-224905-ladsgroup.json
[22:50:51] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1178 gradually with 4 steps - Maint over (T361627)
[22:50:55] <stashbot>	 T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627
[22:56:36] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:56:56] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:58:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P71383 and previous config saved to /var/cache/conftool/dbconfig/20241128-225805-ladsgroup.json
[22:58:26] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:58:46] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:04:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P71384 and previous config saved to /var/cache/conftool/dbconfig/20241128-230412-ladsgroup.json
[23:09:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:13:13] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T328817)', diff saved to https://phabricator.wikimedia.org/P71385 and previous config saved to /var/cache/conftool/dbconfig/20241128-231312-ladsgroup.json
[23:13:15] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[23:13:17] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[23:13:28] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance
[23:13:30] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[23:13:43] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[23:13:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T328817)', diff saved to https://phabricator.wikimedia.org/P71386 and previous config saved to /var/cache/conftool/dbconfig/20241128-231350-ladsgroup.json
[23:16:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T328817)', diff saved to https://phabricator.wikimedia.org/P71387 and previous config saved to /var/cache/conftool/dbconfig/20241128-231650-ladsgroup.json
[23:19:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P71388 and previous config saved to /var/cache/conftool/dbconfig/20241128-231919-ladsgroup.json
[23:31:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P71389 and previous config saved to /var/cache/conftool/dbconfig/20241128-233157-ladsgroup.json
[23:34:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P71390 and previous config saved to /var/cache/conftool/dbconfig/20241128-233426-ladsgroup.json
[23:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:47:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P71391 and previous config saved to /var/cache/conftool/dbconfig/20241128-234704-ladsgroup.json
[23:47:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:50:10] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.017e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad