[00:00:09] (03CR) 10Zabe: [C: 03+2] Start reading from af_user(_text)/afh_user(_text) in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992830 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [00:01:41] (03Merged) 10jenkins-bot: Start reading from af_user(_text)/afh_user(_text) in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992830 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [00:02:31] !log zabe@deploy2002 Started scap: Backport for [[gerrit:992830|Start reading from af_user(_text)/afh_user(_text) in testwiki (T355616)]] [00:02:36] T355616: Start reading from af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T355616 [00:03:59] !log zabe@deploy2002 zabe: Backport for [[gerrit:992830|Start reading from af_user(_text)/afh_user(_text) in testwiki (T355616)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:04:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T354336)', diff saved to https://phabricator.wikimedia.org/P55587 and previous config saved to /var/cache/conftool/dbconfig/20240125-000452-marostegui.json [00:04:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1247.eqiad.wmnet with reason: Maintenance [00:05:01] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [00:05:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1247.eqiad.wmnet with reason: Maintenance [00:05:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T354336)', diff saved to https://phabricator.wikimedia.org/P55588 and previous config saved to /var/cache/conftool/dbconfig/20240125-000515-marostegui.json [00:05:36] !log zabe@deploy2002 zabe: Continuing with sync [00:07:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T354336)', diff saved to https://phabricator.wikimedia.org/P55589 and previous config saved to /var/cache/conftool/dbconfig/20240125-000726-marostegui.json [00:12:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to (general SRE production SSH access) for swfrench - https://phabricator.wikimedia.org/T355834 (10Scott_French) 05In progress→03Resolved [00:12:08] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:992830|Start reading from af_user(_text)/afh_user(_text) in testwiki (T355616)]] (duration: 09m 36s) [00:12:27] T355616: Start reading from af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T355616 [00:12:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2103.codfw.wmnet with OS bullseye [00:22:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P55590 and previous config saved to /var/cache/conftool/dbconfig/20240125-002233-marostegui.json [00:37:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P55591 and previous config saved to /var/cache/conftool/dbconfig/20240125-003739-marostegui.json [00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992654 [00:38:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992654 (owner: 10TrainBranchBot) [00:52:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T354336)', diff saved to https://phabricator.wikimedia.org/P55592 and previous config saved to /var/cache/conftool/dbconfig/20240125-005245-marostegui.json [00:52:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1248.eqiad.wmnet with reason: Maintenance [00:52:51] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [00:53:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1248.eqiad.wmnet with reason: Maintenance [00:53:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T354336)', diff saved to https://phabricator.wikimedia.org/P55593 and previous config saved to /var/cache/conftool/dbconfig/20240125-005307-marostegui.json [00:54:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T354336)', diff saved to https://phabricator.wikimedia.org/P55594 and previous config saved to /var/cache/conftool/dbconfig/20240125-005417-marostegui.json [01:00:43] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/992654 (owner: 10TrainBranchBot) [01:01:45] (03PS1) 10Cwhite: logstash: consume from mediawiki accesslog sampled topics [puppet] - 10https://gerrit.wikimedia.org/r/992656 (https://phabricator.wikimedia.org/T355836) [01:01:47] (03PS1) 10Cwhite: logstash: stop consuming the full mediawiki accesslog topics [puppet] - 10https://gerrit.wikimedia.org/r/992657 (https://phabricator.wikimedia.org/T355836) [01:09:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P55595 and previous config saved to /var/cache/conftool/dbconfig/20240125-010923-marostegui.json [01:24:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P55596 and previous config saved to /var/cache/conftool/dbconfig/20240125-012430-marostegui.json [01:28:03] !log fab@deploy2002 Started deploy [airflow-dags/research@e6aa85a]: (no justification provided) [01:28:17] !log fab@deploy2002 Finished deploy [airflow-dags/research@e6aa85a]: (no justification provided) (duration: 00m 13s) [01:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:39:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T354336)', diff saved to https://phabricator.wikimedia.org/P55597 and previous config saved to /var/cache/conftool/dbconfig/20240125-013936-marostegui.json [01:39:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1249.eqiad.wmnet with reason: Maintenance [01:39:43] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [01:39:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1249.eqiad.wmnet with reason: Maintenance [01:39:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T354336)', diff saved to https://phabricator.wikimedia.org/P55598 and previous config saved to /var/cache/conftool/dbconfig/20240125-013958-marostegui.json [01:42:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T354336)', diff saved to https://phabricator.wikimedia.org/P55599 and previous config saved to /var/cache/conftool/dbconfig/20240125-014208-marostegui.json [01:57:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P55600 and previous config saved to /var/cache/conftool/dbconfig/20240125-015714-marostegui.json [02:12:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P55601 and previous config saved to /var/cache/conftool/dbconfig/20240125-021221-marostegui.json [02:27:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T354336)', diff saved to https://phabricator.wikimedia.org/P55602 and previous config saved to /var/cache/conftool/dbconfig/20240125-022727-marostegui.json [02:27:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:27:34] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [02:27:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:29:16] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:29:16] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:29:50] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:39:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:13] (03PS1) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/992835 (https://phabricator.wikimedia.org/T353642) [02:51:24] (03PS2) 10Andrew Bogott: disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/992835 (https://phabricator.wikimedia.org/T353642) [02:55:49] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: remove the archive_db stage from the cron host [puppet] - 10https://gerrit.wikimedia.org/r/992835 (https://phabricator.wikimedia.org/T353642) (owner: 10Andrew Bogott) [03:09:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:57:29] (03CR) 10Samwilson: [C: 03+1] "I've double-checked it and it's right." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992632 (https://phabricator.wikimedia.org/T350653) (owner: 10Samtar) [05:03:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-mediawiki-production-daily.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:50:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [05:51:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [05:55:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:55:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:56:01] (03PS1) 10Marostegui: Revert "mariadb: Disable notifications on A1 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/992780 [05:56:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance [05:56:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2106.codfw.wmnet with reason: Maintenance [05:56:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2106 (T354336)', diff saved to https://phabricator.wikimedia.org/P55603 and previous config saved to /var/cache/conftool/dbconfig/20240125-055626-marostegui.json [05:56:31] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [05:58:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T354336)', diff saved to https://phabricator.wikimedia.org/P55604 and previous config saved to /var/cache/conftool/dbconfig/20240125-055837-marostegui.json [06:00:07] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications on A1 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/992780 (owner: 10Marostegui) [06:02:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55605 and previous config saved to /var/cache/conftool/dbconfig/20240125-060214-root.json [06:02:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55606 and previous config saved to /var/cache/conftool/dbconfig/20240125-060222-root.json [06:02:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55607 and previous config saved to /var/cache/conftool/dbconfig/20240125-060240-root.json [06:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55608 and previous config saved to /var/cache/conftool/dbconfig/20240125-060249-root.json [06:10:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s2 T355682 [06:10:28] T355682: Switchover s2 master (db2107 -> db2104) - https://phabricator.wikimedia.org/T355682 [06:10:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2104 with weight 0 T355682', diff saved to https://phabricator.wikimedia.org/P55609 and previous config saved to /var/cache/conftool/dbconfig/20240125-061048-root.json [06:11:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s2 T355682 [06:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:12:40] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2104 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/992428 (https://phabricator.wikimedia.org/T355682) (owner: 10Gerrit maintenance bot) [06:13:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P55610 and previous config saved to /var/cache/conftool/dbconfig/20240125-061344-marostegui.json [06:15:19] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992842 (https://phabricator.wikimedia.org/T355683) [06:17:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55611 and previous config saved to /var/cache/conftool/dbconfig/20240125-061719-root.json [06:17:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 5%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55612 and previous config saved to /var/cache/conftool/dbconfig/20240125-061727-root.json [06:17:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55613 and previous config saved to /var/cache/conftool/dbconfig/20240125-061745-root.json [06:17:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 5%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55614 and previous config saved to /var/cache/conftool/dbconfig/20240125-061753-root.json [06:26:32] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992842 (https://phabricator.wikimedia.org/T355683) (owner: 10Marostegui) [06:27:15] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992842 (https://phabricator.wikimedia.org/T355683) (owner: 10Marostegui) [06:28:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P55615 and previous config saved to /var/cache/conftool/dbconfig/20240125-062851-marostegui.json [06:29:04] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:992842|ProductionServices.php: Promote pc2014 (T355683)]] [06:29:09] T355683: Switchover pc2 master - https://phabricator.wikimedia.org/T355683 [06:30:58] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:992842|ProductionServices.php: Promote pc2014 (T355683)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:31:25] !log marostegui@deploy2002 marostegui: Continuing with sync [06:32:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55616 and previous config saved to /var/cache/conftool/dbconfig/20240125-063225-root.json [06:32:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 10%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55617 and previous config saved to /var/cache/conftool/dbconfig/20240125-063232-root.json [06:32:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55618 and previous config saved to /var/cache/conftool/dbconfig/20240125-063250-root.json [06:32:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 10%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55619 and previous config saved to /var/cache/conftool/dbconfig/20240125-063258-root.json [06:37:46] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:992842|ProductionServices.php: Promote pc2014 (T355683)]] (duration: 08m 42s) [06:37:51] T355683: Switchover pc2 master - https://phabricator.wikimedia.org/T355683 [06:38:06] (03PS1) 10Marostegui: pc2: Enable notifications on the master [puppet] - 10https://gerrit.wikimedia.org/r/992843 (https://phabricator.wikimedia.org/T355683) [06:39:17] (03CR) 10Marostegui: [C: 03+2] pc2: Enable notifications on the master [puppet] - 10https://gerrit.wikimedia.org/r/992843 (https://phabricator.wikimedia.org/T355683) (owner: 10Marostegui) [06:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:43:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T354336)', diff saved to https://phabricator.wikimedia.org/P55620 and previous config saved to /var/cache/conftool/dbconfig/20240125-064357-marostegui.json [06:44:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:44:03] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:44:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:44:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2110 (T354336)', diff saved to https://phabricator.wikimedia.org/P55621 and previous config saved to /var/cache/conftool/dbconfig/20240125-064420-marostegui.json [06:47:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55622 and previous config saved to /var/cache/conftool/dbconfig/20240125-064729-root.json [06:47:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 25%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55623 and previous config saved to /var/cache/conftool/dbconfig/20240125-064737-root.json [06:47:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55624 and previous config saved to /var/cache/conftool/dbconfig/20240125-064755-root.json [06:48:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 25%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55625 and previous config saved to /var/cache/conftool/dbconfig/20240125-064803-root.json [06:53:46] 10SRE, 10ops-codfw, 10Data-Persistence: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Marostegui) Database related hosts are being repooled [06:55:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T354336)', diff saved to https://phabricator.wikimedia.org/P55626 and previous config saved to /var/cache/conftool/dbconfig/20240125-065535-marostegui.json [06:55:41] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700) [07:00:04] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0700) [07:00:08] arnaudb: ready? [07:00:41] ready [07:00:46] oooook [07:00:55] !log Starting s2 codfw failover from db2107 to db2104 - T355682 [07:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:03] T355682: Switchover s2 master (db2107 -> db2104) - https://phabricator.wikimedia.org/T355682 [07:01:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s2 codfw as read-only for maintenance - T355682', diff saved to https://phabricator.wikimedia.org/P55627 and previous config saved to /var/cache/conftool/dbconfig/20240125-070120-marostegui.json [07:01:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2104 to s2 primary and set section read-write T355682', diff saved to https://phabricator.wikimedia.org/P55628 and previous config saved to /var/cache/conftool/dbconfig/20240125-070153-marostegui.json [07:02:08] arnaudb: done, can you check you can write in any s2 wiki? [07:02:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55629 and previous config saved to /var/cache/conftool/dbconfig/20240125-070234-root.json [07:02:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 50%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55630 and previous config saved to /var/cache/conftool/dbconfig/20240125-070242-root.json [07:02:47] one sec, on it [07:03:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55631 and previous config saved to /var/cache/conftool/dbconfig/20240125-070300-root.json [07:03:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 50%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55632 and previous config saved to /var/cache/conftool/dbconfig/20240125-070308-root.json [07:05:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/992429 (https://phabricator.wikimedia.org/T355682) (owner: 10Gerrit maintenance bot) [07:06:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2107 T355682', diff saved to https://phabricator.wikimedia.org/P55633 and previous config saved to /var/cache/conftool/dbconfig/20240125-070604-marostegui.json [07:06:21] T355682: Switchover s2 master (db2107 -> db2104) - https://phabricator.wikimedia.org/T355682 [07:07:07] (03CR) 10Mxmxchere: "Hi Joe and thanks for the prompt review. Your goal is that for etcd 3.3/Debian 11 machines the config file should remain untouched to circ" [puppet] - 10https://gerrit.wikimedia.org/r/992629 (owner: 10Mxmxchere) [07:08:06] everything looks ok on my end [07:08:11] ok thanks [07:08:36] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) [07:12:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 db2160 db2109 db2107 db2137:3314 db2135:3315 db2143 db2147 db2177 db2178 db2188 T355549', diff saved to https://phabricator.wikimedia.org/P55634 and previous config saved to /var/cache/conftool/dbconfig/20240125-071253-marostegui.json [07:12:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P55635 and previous config saved to /var/cache/conftool/dbconfig/20240125-071259-marostegui.json [07:13:00] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [07:13:54] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) Database hosts are depooled - @cmooney confirm if you will downtime them or if I should do it myself [07:17:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55636 and previous config saved to /var/cache/conftool/dbconfig/20240125-071739-root.json [07:17:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 75%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55637 and previous config saved to /var/cache/conftool/dbconfig/20240125-071747-root.json [07:18:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55638 and previous config saved to /var/cache/conftool/dbconfig/20240125-071805-root.json [07:18:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 75%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55639 and previous config saved to /var/cache/conftool/dbconfig/20240125-071813-root.json [07:20:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2137:3315 T355549', diff saved to https://phabricator.wikimedia.org/P55640 and previous config saved to /var/cache/conftool/dbconfig/20240125-072010-marostegui.json [07:20:19] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [07:28:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P55641 and previous config saved to /var/cache/conftool/dbconfig/20240125-072806-marostegui.json [07:31:56] (03PS4) 10Slyngshede: Debian packaging, dependencies and permissions [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 [07:32:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55642 and previous config saved to /var/cache/conftool/dbconfig/20240125-073244-root.json [07:32:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2157 (re)pooling @ 100%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55643 and previous config saved to /var/cache/conftool/dbconfig/20240125-073252-root.json [07:32:58] (03CR) 10Slyngshede: Debian packaging, dependencies and permissions (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [07:33:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55644 and previous config saved to /var/cache/conftool/dbconfig/20240125-073310-root.json [07:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P55645 and previous config saved to /var/cache/conftool/dbconfig/20240125-073319-root.json [07:43:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T354336)', diff saved to https://phabricator.wikimedia.org/P55646 and previous config saved to /var/cache/conftool/dbconfig/20240125-074312-marostegui.json [07:43:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance [07:43:18] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:43:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2119.codfw.wmnet with reason: Maintenance [07:43:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2119 (T354336)', diff saved to https://phabricator.wikimedia.org/P55647 and previous config saved to /var/cache/conftool/dbconfig/20240125-074334-marostegui.json [07:45:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T354336)', diff saved to https://phabricator.wikimedia.org/P55648 and previous config saved to /var/cache/conftool/dbconfig/20240125-074546-marostegui.json [07:59:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] ml-serve: Drop explicit list of deployExtraClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/992764 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [08:00:04] Amir1 and Urbanecm: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0800). [08:00:04] Tran: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:16] 👋 [08:00:49] (03CR) 10Muehlenhoff: Debian packaging, dependencies and permissions (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [08:00:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P55650 and previous config saved to /var/cache/conftool/dbconfig/20240125-080053-marostegui.json [08:00:54] hi [08:01:23] (03CR) 10Kosta Harlan: "Yes" [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [08:01:30] I'm going to add https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/992123 to the calendar [08:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:01:45] Hi Tran! I can deploy your patch [08:02:01] I can also run the deploy steps myself if you're here to help bail me out if I mess it up? [08:02:10] I do have access to the deploy server [08:03:10] hmm, actually, sorry I just got a notice from my calendar reminding me I need to leave soon [08:03:16] Amir1, are you around? [08:04:42] or perhaps hashar? [08:05:57] Alternatively, I could just deploy it and revert immediately if something goes pear shaped. The steps look reasonable. [08:06:36] (03CR) 10Kosta Harlan: Update beta configs to reflect new temp account naming pattern (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [08:07:11] Tran: yeah going with https://deploy-commands.toolforge.org/bacc/992670 should be pretty straightforward [08:07:34] if you're comfortable doing so, I'm around for another 5 minutes or so [08:07:43] Okay I can start [08:08:09] likewise, if you're comfortable syncing https://deploy-commands.toolforge.org/bacc/992123, I'd appreciate that. The patch is already live on wmf.14 and merged into master, it just didn't make the branch cut for wmf.15. [08:08:45] Let's see if I can get this first one done without problem and if I can, I'll do yours too. [08:09:19] (03PS2) 10Muehlenhoff: Remove long-absented resource [puppet] - 10https://gerrit.wikimedia.org/r/992700 [08:09:25] Actually, let me do yours first so I can answer your comment on my patch without rushing [08:10:32] (03CR) 10STran: [C: 03+2] "backporting" [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [08:11:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by stran@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [08:12:50] Tran: for verifying https://gerrit.wikimedia.org/r/992123, you'd create a new account on test.wikipedia.org via mwdebug2002 and then have a look at logstash debug dashboard https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002?_g=h@48fceb7&_a=h@b20f488 [08:15:02] (03CR) 10Muehlenhoff: [C: 03+2] Remove long-absented resource [puppet] - 10https://gerrit.wikimedia.org/r/992700 (owner: 10Muehlenhoff) [08:15:46] (03Merged) 10jenkins-bot: PreAuthenticationProvider: Allow blocking account creation based on IP reputation [extensions/CentralAuth] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992123 (https://phabricator.wikimedia.org/T354928) (owner: 10Kosta Harlan) [08:16:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P55651 and previous config saved to /var/cache/conftool/dbconfig/20240125-081559-marostegui.json [08:16:12] !log stran@deploy2002 Started scap: Backport for [[gerrit:992123|PreAuthenticationProvider: Allow blocking account creation based on IP reputation (T354928)]] [08:16:17] T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928 [08:16:23] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete setting [puppet] - 10https://gerrit.wikimedia.org/r/992407 (owner: 10Muehlenhoff) [08:19:53] (03PS1) 10Muehlenhoff: Default insetup::buster role to not send notifications as well [puppet] - 10https://gerrit.wikimedia.org/r/992846 [08:20:16] (03PS2) 10Muehlenhoff: Switch hadoop master/standby roles to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619) [08:22:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:28:19] (03CR) 10Filippo Giunchedi: [C: 03+1] Default insetup::buster role to not send notifications as well [puppet] - 10https://gerrit.wikimedia.org/r/992846 (owner: 10Muehlenhoff) [08:31:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T354336)', diff saved to https://phabricator.wikimedia.org/P55652 and previous config saved to /var/cache/conftool/dbconfig/20240125-083106-marostegui.json [08:31:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [08:31:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2136.codfw.wmnet with reason: Maintenance [08:31:12] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:31:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:31:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2137.codfw.wmnet with reason: Maintenance [08:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:40:25] (03CR) 10Muehlenhoff: [C: 03+2] Default insetup::buster role to not send notifications as well [puppet] - 10https://gerrit.wikimedia.org/r/992846 (owner: 10Muehlenhoff) [08:40:50] (03CR) 10Muehlenhoff: [C: 03+2] Remove Marko from a few groups no longer needed/used [puppet] - 10https://gerrit.wikimedia.org/r/991774 (owner: 10Muehlenhoff) [08:44:58] !log stran@deploy2002 stran and kharlan: Backport for [[gerrit:992123|PreAuthenticationProvider: Allow blocking account creation based on IP reputation (T354928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:45:08] T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928 [08:45:42] (03Abandoned) 10Filippo Giunchedi: puppet: fail the run with puppet 7 and buster [puppet] - 10https://gerrit.wikimedia.org/r/991540 (owner: 10Filippo Giunchedi) [08:49:54] (03Abandoned) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [08:50:54] (03CR) 10Filippo Giunchedi: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [08:53:56] Currently testing 992123, hoping to be done before the window ends and apologies if I run over. [08:59:10] Tran: I'm back; how's it going? [08:59:50] Testing it right now. It took longer to deploy than expected. I was able to successfully create an account and logs looked okay to me. Could you double check? [09:00:05] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0900) [09:00:22] the train is blocked [09:00:42] Tran: yeah, checking it [09:00:43] a blocker due to CentralAuth got added yesterday night [09:00:52] I have to announce it [09:01:02] hashar: do you have a link? [09:01:27] tgr: https://gerrit.wikimedia.org/r/992804 UserGroupManager: Fix cross-wiki database access [09:01:30] (03CR) 10Muehlenhoff: [C: 03+2] udp2log: Replace ferm rules with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/991793 (owner: 10Muehlenhoff) [09:01:41] due to some heavy refactoring in the mediawiki DB layer https://gerrit.wikimedia.org/r/c/mediawiki/core/+/990745 [09:01:54] according to taavi (but I see no reason to not trust his judgement :] ) [09:02:07] I am mentioning him for reference [09:02:14] and all that code completely escapes me [09:02:26] Tran: it looks good to me [09:02:41] great thanks I'll continue with the sync [09:04:21] kostajh> do you know how I can recover from a disconnected pipe. I forgot to run this in a screen. [09:05:53] Tran: I am not sure. [09:06:05] Well that's awkward. Do I re-run scap? or revert? [09:06:12] hashar: give me ten minutes to test. [09:06:39] tgr: yeah no worries, that got reported last night [09:06:49] I think you can just re-run `scap backport {changeid}` [09:06:55] Tran: check with ps if it's still running? [09:06:57] but maybe someone else here knows [09:07:13] I think it's paused at the "test on mwdebug" stage [09:07:24] oh, right [09:07:59] that's probably not recoverable without root [09:08:06] but yeah you can just re-run it [09:08:07] I'm not sure if `scap backport` works on an already merged patch, but `scap sync-world` will surely do the right thing since the patch was already merged and pulled to deploy2002 [09:08:24] backport works too, it just skips the merge part then [09:08:36] oh even better [09:08:46] but yeah sync is a little faster [09:09:00] you will need to abort the old scap since it has a lock system [09:09:28] maybe there is a command line parameter for that? [09:09:57] would that be `scap backport --revert `? [09:09:58] if not, probably fine to just kill it, if it's waiting for a keypress [09:10:12] I don't have access to the process, based on what `ps` is telling me [09:10:14] no, revert would try to undo the change [09:10:47] I would try `scap backport 992123`. (after invoking `screen` or `tmux`) [09:11:09] okay let me try that and yes, lesson learned. Use `tmux`. [09:12:02] !log stran@deploy2002 Started scap: Backport for [[gerrit:992123|PreAuthenticationProvider: Allow blocking account creation based on IP reputation (T354928)]] [09:12:07] T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928 [09:13:30] Tran: you seem to be owning process 29685 [09:14:04] (03PS1) 10Muehlenhoff: Revert "udp2log: Replace ferm rules with firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/992880 [09:14:16] !log stran@deploy2002 kharlan and stran: Backport for [[gerrit:992123|PreAuthenticationProvider: Allow blocking account creation based on IP reputation (T354928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:15:46] (03CR) 10Muehlenhoff: [C: 03+2] Revert "udp2log: Replace ferm rules with firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/992880 (owner: 10Muehlenhoff) [09:16:26] PROBLEM - Host mwlog2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:44] I made https://gitlab.wikimedia.org/toolforge-repos/deploy-commands/-/merge_requests/1 to update the deployment commands page to reference tmux/screen [09:16:50] tgr is that the old scrap I disconnected from? I think 12856 is the new one I just kicked off. [09:18:12] !log stran@deploy2002 kharlan and stran: Continuing with sync [09:18:12] it looks like 29685 was `scap backport` which has invoked a new process for `sync-world` which is 12856 [09:18:23] new scap is syncing [09:19:40] if it doesn't prevent you from running scap again it's fine. I thought it uses a lockfile but maybe that's only done during the sync step. [09:21:50] RECOVERY - Host mwlog2002 is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms [09:22:41] I can't remember where (or whether) the scap log files are, but it emits its logs over syslog which can then be seen in Kibana https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 [09:22:57] so you can potentially check the progress from there [09:23:24] yesterday a backport took 10/11 minutes, I am guessing that is the new baseline [09:25:20] and pid 12856 is still emitting logs (can be checked by filtering on `process.pid:12856` [09:29:26] !log stran@deploy2002 Finished scap: Backport for [[gerrit:992123|PreAuthenticationProvider: Allow blocking account creation based on IP reputation (T354928)]] (duration: 17m 24s) [09:29:31] T354928: Allow denial of account creation for IPs known to ipoid - https://phabricator.wikimedia.org/T354928 [09:29:56] Well I think that finished successfully [09:30:09] it took 17 minutes [09:30:30] and the `!log` shows it has completed [09:30:38] no clue why it took SO long though :-\\\\\ [09:31:08] checking `ps`, I don't see any processes I own that refer to the commands I ran about 40 minutes ago so I guess it timed out? [09:31:08] (03CR) 10Kosta Harlan: Update beta configs to reflect new temp account naming pattern (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [09:31:40] hashar: the old one is not emitting logs: https://logstash.wikimedia.org/goto/9ccb86a05ea94cb1f559061b9d21e0cb [09:32:04] I would assume the old process was just killed when the SSH session timed out [09:32:07] right [09:32:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55653 and previous config saved to /var/cache/conftool/dbconfig/20240125-093208-marostegui.json [09:32:16] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:32:33] Tran: will you sync the config patch or do you want to do that another time? I had a question about one of the values there, so maybe a later window is better. [09:32:41] and as the last scap run finished successfully, everything should be in a consistent state now [09:33:22] kostajh We're out of the window so I can reschedule it. I came to the same assumption you did but I pinged someone with more context about it and we can wait for that answer. [09:33:29] ok [09:33:41] taavi: do you know how to reproduce the train blocked bug locally? [09:33:55] I guess I need to clear the central user cache? [09:34:21] Tran: the train is blocked so syncing a config change should be fine [09:34:39] hashar: ^ right? [09:34:45] yes [09:34:55] Tran: yes please continue with your deployment [09:35:00] jouncebot: now [09:35:00] For the next 1 hour(s) and 24 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T0900) [09:35:12] we are not running the train this morning [09:35:33] and I am usually more than happy having the backport window to be extended as long as all parties are aware :) [09:36:04] FWIW the fix for T355813 looks good, I just need to figure out how to test it [09:36:04] T355813: CentralAuth doesn't shows user rights correctly - https://phabricator.wikimedia.org/T355813 [09:36:14] ah great [09:36:34] tgr: I'm able to repro just by visiting Special:CA on my local wiki without the fix applied [09:36:52] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:36:59] I am not qualified at all in reviewing any of that since I know nothing about CentralAuth, shared DB or the DB abstraction layer or architecture [09:37:16] taavi: then I guess we can cherry pick and try it out on mwdebug? [09:37:41] PROBLEM - Host mwlog2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:38:17] eek [09:38:39] moritzm: is mwlog2002 being down related to the udp2log patch you merged earlier? [09:38:42] I can ssh on mwlog2002 [09:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [09:39:01] RECOVERY - Host mwlog2002 is UP: PING OK - Packet loss = 0%, RTA = 31.31 ms [09:39:10] duh, I'm being stupid [09:39:30] of course it works locally if I have the same groups on every wiki [09:40:21] whoops [09:41:18] (03CR) 10Muehlenhoff: [C: 03+2] Switch hadoop master/standby roles to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/990693 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:42:01] Sorry for the delay, was discussing if we were ready to deploy the config change. If I could still make it in 5-10 minutes, that would be great otherwise we're not in a rush. [09:43:58] (03PS5) 10Slyngshede: Debian packaging, dependencies and database migration. [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 [09:44:06] (03Abandoned) 10Muehlenhoff: Also default insetup::buster role disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/990695 (owner: 10Muehlenhoff) [09:45:30] (03PS4) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) [09:46:48] (03CR) 10Kosta Harlan: Update beta configs to reflect new temp account naming pattern (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [09:47:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P55654 and previous config saved to /var/cache/conftool/dbconfig/20240125-094714-marostegui.json [09:47:34] (03CR) 10Kosta Harlan: [C: 03+1] Update beta configs to reflect new temp account naming pattern (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [09:48:11] (03CR) 10Muehlenhoff: Debian packaging, dependencies and database migration. (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [09:50:01] (03PS5) 10STran: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) [09:50:37] (03CR) 10Kosta Harlan: [C: 03+1] "thanks! Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [09:51:01] (03CR) 10Muehlenhoff: [C: 03+2] Fold linux44 into the regular wmf kmod::blacklist [puppet] - 10https://gerrit.wikimedia.org/r/992702 (owner: 10Muehlenhoff) [09:51:39] hashar: +2-d. I'll leave the backport to someone else, it's getting late. [09:53:00] If no one has any objections, could I start my config backport of 992670? [09:53:33] Tran: I'd say go for it [09:53:48] core merges take way longer than config merges [09:53:57] alright then I'm starting [09:54:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by stran@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [09:55:18] (03Merged) 10jenkins-bot: Update beta configs to reflect new temp account naming pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992670 (https://phabricator.wikimedia.org/T349503) (owner: 10STran) [09:55:32] (03PS6) 10Slyngshede: Debian packaging, dependencies and database migration. [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 [09:59:26] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [09:59:51] (03CR) 10Slyngshede: [C: 03+2] Debian packaging, dependencies and database migration. (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [10:00:05] backport of 992670 is done [10:01:34] hashar: should we log that the backport window is done? [10:02:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P55655 and previous config saved to /var/cache/conftool/dbconfig/20240125-100221-marostegui.json [10:02:52] (03Merged) 10jenkins-bot: Debian packaging, dependencies and database migration. [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992739 (owner: 10Slyngshede) [10:05:34] (03CR) 10Majavah: [V: 03+1 C: 03+2] Bring cloudrabbit1003 in service as a new cluster [puppet] - 10https://gerrit.wikimedia.org/r/992725 (owner: 10Majavah) [10:07:46] (03PS1) 10Muehlenhoff: Install debmonitor-server on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/992881 (https://phabricator.wikimedia.org/T241049) [10:12:45] (03PS1) 10Majavah: P:openstack: rabbitmq: fix RABBITMQ_NODENAME [puppet] - 10https://gerrit.wikimedia.org/r/992882 [10:12:47] (03PS1) 10Majavah: rabbitmq: fix order of invalidate_rabbitmq_guest_account [puppet] - 10https://gerrit.wikimedia.org/r/992883 [10:14:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1208/co" [puppet] - 10https://gerrit.wikimedia.org/r/992883 (owner: 10Majavah) [10:17:04] !log upgrading python-pymysql in S6 DB hosts to 1.0.2-2~wmf11u1 T355531 [10:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:10] T355531: Migrate all db-* scripts to Bookworm - https://phabricator.wikimedia.org/T355531 [10:17:24] (03CR) 10Majavah: [C: 03+2] P:openstack: rabbitmq: fix RABBITMQ_NODENAME [puppet] - 10https://gerrit.wikimedia.org/r/992882 (owner: 10Majavah) [10:17:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55656 and previous config saved to /var/cache/conftool/dbconfig/20240125-101728-marostegui.json [10:17:29] (03CR) 10Majavah: [V: 03+1 C: 03+2] rabbitmq: fix order of invalidate_rabbitmq_guest_account [puppet] - 10https://gerrit.wikimedia.org/r/992883 (owner: 10Majavah) [10:17:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [10:17:33] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:17:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [10:17:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55657 and previous config saved to /var/cache/conftool/dbconfig/20240125-101750-marostegui.json [10:20:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55658 and previous config saved to /var/cache/conftool/dbconfig/20240125-102002-marostegui.json [10:21:24] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) >>! In T355549#9487462, @Marostegui wrote: > Database hosts are depooled - @cmooney confirm if you wi... [10:21:42] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit1003.eqiad.wmnet with OS bookworm [10:22:04] (03PS1) 10Majavah: wikimediacloud.org: Move RabbitMQ traffic to cloudrabbit1003 [dns] - 10https://gerrit.wikimedia.org/r/992884 (https://phabricator.wikimedia.org/T345610) [10:27:37] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) Great thank you! [10:31:37] (03CR) 10Slyngshede: Install debmonitor-server on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992881 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:35:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P55659 and previous config saved to /var/cache/conftool/dbconfig/20240125-103509-marostegui.json [10:35:55] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1003.eqiad.wmnet with reason: host reimage [10:38:18] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:39:10] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1003.eqiad.wmnet with reason: host reimage [10:43:18] (03PS1) 10Hnowlan: tegola: temporarily disable maps2006 db [deployment-charts] - 10https://gerrit.wikimedia.org/r/992887 (https://phabricator.wikimedia.org/T355549) [10:45:21] (03CR) 10Clément Goubert: [C: 03+1] tegola: temporarily disable maps2006 db [deployment-charts] - 10https://gerrit.wikimedia.org/r/992887 (https://phabricator.wikimedia.org/T355549) (owner: 10Hnowlan) [10:46:02] (03CR) 10Muehlenhoff: Install debmonitor-server on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992881 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [10:48:41] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] deployment_server: add dummy oauth2-proxy secrets for jaeger [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:48:42] (03CR) 10Muehlenhoff: [C: 03+2] mariadb::monitor_memory: Update package name [puppet] - 10https://gerrit.wikimedia.org/r/983721 (owner: 10Muehlenhoff) [10:49:24] godog: merging your oauth labs-private patch [10:49:33] done [10:49:42] (03PS1) 10Majavah: systemd: timer_service: Move ConditionPathExists to correct section [puppet] - 10https://gerrit.wikimedia.org/r/992888 [10:50:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P55660 and previous config saved to /var/cache/conftool/dbconfig/20240125-105015-marostegui.json [10:50:58] moritzm: thank you! [10:52:34] (03PS1) 10Muehlenhoff: mariabdb::monitor_memory: Also update update name in dependency [puppet] - 10https://gerrit.wikimedia.org/r/992890 [10:52:53] (03PS1) 10Zabe: UserGroupManager: Fix cross-wiki database access [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992781 (https://phabricator.wikimedia.org/T355813) [10:53:46] kostajh: sorry I was in meeting. `!log` the end of the backport window is often done yes, that is a good way to broadcast it has completed :) [10:53:55] the alternative is to ask / sync up here [10:54:06] (03CR) 10Muehlenhoff: [C: 03+2] mariabdb::monitor_memory: Also update update name in dependency [puppet] - 10https://gerrit.wikimedia.org/r/992890 (owner: 10Muehlenhoff) [10:54:40] hashar: T.gr mentioned +2'ing some core change, did you end up syncing that? Or was that not for a backport? [10:54:50] the cherry pick is in the pipe https://gerrit.wikimedia.org/r/c/mediawiki/core/+/992781 [10:55:03] I think taavi now how to reproduces it [10:55:08] s/now/know/ [10:55:10] I will deploy it [10:55:14] (03CR) 10Hashar: [C: 03+2] UserGroupManager: Fix cross-wiki database access [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992781 (https://phabricator.wikimedia.org/T355813) (owner: 10Zabe) [10:57:15] looks like I can check it comparing meta vs frwiki [10:57:17] https://meta.wikimedia.org/wiki/Special:CentralAuth?target=hashar [10:57:23] https://fr.wikipedia.org/wiki/Special:CentralAuth?target=hashar [10:57:27] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1003.eqiad.wmnet with OS bookworm [10:58:16] (03PS1) 10Muehlenhoff: ganeti: Stop using transition package [puppet] - 10https://gerrit.wikimedia.org/r/992891 [11:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1100). [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1100) [11:05:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T354336)', diff saved to https://phabricator.wikimedia.org/P55662 and previous config saved to /var/cache/conftool/dbconfig/20240125-110521-marostegui.json [11:05:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:05:27] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:05:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [11:05:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [11:05:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2147.codfw.wmnet with reason: Maintenance [11:07:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T354336)', diff saved to https://phabricator.wikimedia.org/P55663 and previous config saved to /var/cache/conftool/dbconfig/20240125-110714-marostegui.json [11:10:23] (03PS1) 10Btullis: varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/992782 (https://phabricator.wikimedia.org/T346463) [11:11:25] (03PS2) 10Btullis: varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/992782 (https://phabricator.wikimedia.org/T346463) [11:12:30] (03CR) 10CI reject: [V: 04-1] varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/992782 (https://phabricator.wikimedia.org/T346463) (owner: 10Btullis) [11:13:10] (03PS3) 10Btullis: varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/992782 (https://phabricator.wikimedia.org/T346463) [11:15:22] (03CR) 10Muehlenhoff: Upstream release v0.3.4 (031 comment) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [11:15:52] (03Merged) 10jenkins-bot: UserGroupManager: Fix cross-wiki database access [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/992781 (https://phabricator.wikimedia.org/T355813) (owner: 10Zabe) [11:16:55] (03CR) 10Muehlenhoff: "They all look harmless or are not relevant to us (like the standards version), I'd say we can ignore them and revisit later if/when we aim" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [11:19:33] (03PS1) 10Zabe: Start reading from af_actor/afh_actor in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992894 (https://phabricator.wikimedia.org/T355616) [11:19:53] jouncebot: now [11:19:53] For the next 0 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1100) [11:19:54] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1100) [11:20:06] I am deploying that mediawiki/core patch for CentralAuth [11:20:52] !log hashar@deploy2002 Started scap: Backport for [[gerrit:992781|UserGroupManager: Fix cross-wiki database access (T355813)]] [11:20:58] T355813: CentralAuth doesn't shows user rights correctly - https://phabricator.wikimedia.org/T355813 [11:22:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P55664 and previous config saved to /var/cache/conftool/dbconfig/20240125-112220-marostegui.json [11:22:36] !log hashar@deploy2002 hashar and zabe: Backport for [[gerrit:992781|UserGroupManager: Fix cross-wiki database access (T355813)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:23:25] !log hashar@deploy2002 hashar and zabe: Continuing with sync [11:25:24] (03PS2) 10Volans: Upstream release v0.3.4 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 [11:25:29] (03CR) 10Volans: "addressed comments" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [11:26:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [11:26:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [11:26:43] !log Restarting ferm.service on k8s node kubernetes2036.codfw.wmnet - T354855 [11:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:48] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [11:26:55] (03CR) 10Muehlenhoff: "One final bit inline" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [11:27:00] RECOVERY - Check systemd state on kubernetes2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:59] (03PS3) 10Volans: Upstream release v0.3.4 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 [11:28:01] (03CR) 10Volans: Upstream release v0.3.4 (031 comment) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [11:28:35] I will run the train after lunch [11:29:43] !log hashar@deploy2002 Finished scap: Backport for [[gerrit:992781|UserGroupManager: Fix cross-wiki database access (T355813)]] (duration: 08m 50s) [11:29:48] T355813: CentralAuth doesn't shows user rights correctly - https://phabricator.wikimedia.org/T355813 [11:31:34] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992660 [11:31:51] (03CR) 10Muehlenhoff: [C: 03+1] "Nice, ship it :-)" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [11:33:44] (03CR) 10Vgutierrez: hiera: add acls for heavy ratelimiting abusing ip from list (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [11:33:46] (03PS1) 10Slyngshede: Enable debmonitor service on installation [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992898 [11:35:21] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) p:05Triage→03Medium [11:35:25] (03CR) 10Zabe: [C: 03+2] Start reading from af_actor/afh_actor in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992894 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [11:35:39] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) [11:35:45] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:36:08] (03Merged) 10jenkins-bot: Start reading from af_actor/afh_actor in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992894 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [11:36:42] !log zabe@deploy2002 Started scap: Backport for [[gerrit:992894|Start reading from af_actor/afh_actor in group0 wikis (T355616)]] [11:36:48] T355616: Start reading from af_actor/afh_actor - https://phabricator.wikimedia.org/T355616 [11:36:52] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10cmooney) p:05Triage→03Medium [11:37:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P55665 and previous config saved to /var/cache/conftool/dbconfig/20240125-113727-marostegui.json [11:37:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) p:05Triage→03Medium [11:38:01] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) [11:38:11] !log zabe@deploy2002 zabe: Backport for [[gerrit:992894|Start reading from af_actor/afh_actor in group0 wikis (T355616)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:38:17] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10cmooney) [11:38:43] !log zabe@deploy2002 zabe: Continuing with sync [11:39:04] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) p:05Triage→03Medium [11:39:15] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:39:21] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) [11:39:31] (03PS1) 10Alexandros Kosiaris: ipoid: Fix chart default ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/992899 (https://phabricator.wikimedia.org/T355167) [11:40:28] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10cmooney) [11:40:33] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:41:03] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:41:09] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10cmooney) [11:42:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1038.eqiad.wmnet to cluster eqiad and group D [11:42:30] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) p:05Triage→03Medium [11:42:39] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) [11:42:45] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:43:28] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10cmooney) p:05Triage→03Medium [11:43:35] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10cmooney) [11:43:41] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:44:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1038.eqiad.wmnet to cluster eqiad and group D [11:45:07] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:992894|Start reading from af_actor/afh_actor in group0 wikis (T355616)]] (duration: 08m 25s) [11:45:19] T355616: Start reading from af_actor/afh_actor - https://phabricator.wikimedia.org/T355616 [11:45:22] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992661 [11:45:23] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10cmooney) p:05Triage→03Medium [11:45:31] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10cmooney) [11:45:37] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:46:58] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 (10cmooney) p:05Triage→03Medium [11:47:07] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:47:13] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 (10cmooney) [11:47:49] (03CR) 10Vgutierrez: [C: 04-1] "this doesn't seem to be a bug, see I7fb15acdf1c5cd6e6b257d1de82437b33f96fbc3." [puppet] - 10https://gerrit.wikimedia.org/r/991409 (https://phabricator.wikimedia.org/T355158) (owner: 10Fabfur) [11:52:03] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@708f0f3]: (no justification provided) [11:52:10] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2036 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:52:12] 10SRE, 10Infrastructure-Foundations, 10netops: Create netbox script to support moving a cable from one network port to another - https://phabricator.wikimedia.org/T355869 (10cmooney) p:05Triage→03Low [11:52:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T354336)', diff saved to https://phabricator.wikimedia.org/P55666 and previous config saved to /var/cache/conftool/dbconfig/20240125-115233-marostegui.json [11:52:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [11:52:40] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:52:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2155.codfw.wmnet with reason: Maintenance [11:52:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:52:53] (03CR) 10Fabfur: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/992782 (https://phabricator.wikimedia.org/T346463) (owner: 10Btullis) [11:53:13] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10cmooney) p:05Triage→03Medium [11:53:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:53:20] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10cmooney) [11:53:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T354336)', diff saved to https://phabricator.wikimedia.org/P55667 and previous config saved to /var/cache/conftool/dbconfig/20240125-115322-marostegui.json [11:53:26] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:54:31] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 (10cmooney) p:05Triage→03Medium [11:54:40] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 (10cmooney) [11:54:46] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:55:29] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 (10cmooney) p:05Triage→03Medium [11:55:40] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:55:46] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 (10cmooney) [11:56:28] (03PS1) 10Hnowlan: installserver: fix disk profiles for new k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/992900 (https://phabricator.wikimedia.org/T354791) [11:56:31] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 (10cmooney) p:05Triage→03Medium [11:56:42] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [11:56:48] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw - https://phabricator.wikimedia.org/T355873 (10cmooney) [11:57:35] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:01:00] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10cmooney) p:05Triage→03Medium [12:01:09] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10cmooney) [12:01:15] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:06:36] !log installing openssh security updates [12:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/992900 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:10:10] (03CR) 10Clément Goubert: [C: 03+1] installserver: fix disk profiles for new k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/992900 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:12:32] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@708f0f3]: (no justification provided) (duration: 20m 28s) [12:13:07] (03CR) 10Muehlenhoff: Enable debmonitor service on installation (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992898 (owner: 10Slyngshede) [12:19:52] (03PS31) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [12:21:48] (03PS32) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) [12:23:03] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:24:21] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:25:14] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:25:55] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992662 [12:26:23] (03PS5) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 [12:26:24] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [12:26:34] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 (10cmooney) [12:26:53] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:26] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:29:48] * topranks looking [12:30:04] (03PS2) 10Slyngshede: Enable debmonitor service on installation [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992898 [12:30:13] (03CR) 10Slyngshede: Enable debmonitor service on installation (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992898 (owner: 10Slyngshede) [12:31:14] topranks: check with hnowlan if it's not the hosts he's working on [12:31:15] <_joe_> jouncebot: nowandnext [12:31:16] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [12:31:16] In 0 hour(s) and 28 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1300) [12:31:53] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:32:05] claime: ok will do [12:32:36] hnowlan: alert for BGP on cr in codfw, for hosts mw2395, mw2267 and mw2357 - that related to anything you're doing? [12:32:40] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992898 (owner: 10Slyngshede) [12:33:02] topranks: yeah, I just drained those hosts [12:33:18] hnowlan: all good, I acked the alert there [12:33:22] thanks [12:34:18] (03CR) 10Slyngshede: [C: 03+2] Enable debmonitor service on installation [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992898 (owner: 10Slyngshede) [12:34:40] (KubernetesRsyslogDown) firing: (3) rsyslog on mw2267:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:35:04] (03CR) 10Fabfur: hiera: add acls for heavy ratelimiting abusing ip from list (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/989968 (https://phabricator.wikimedia.org/T353910) (owner: 10Fabfur) [12:35:08] me also, acked [12:35:46] (03CR) 10Hnowlan: [C: 03+2] installserver: fix disk profiles for new k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/992900 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:37:25] (03Merged) 10jenkins-bot: Enable debmonitor service on installation [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992898 (owner: 10Slyngshede) [12:38:18] (03Abandoned) 10Muehlenhoff: Bump standards version [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981293 (owner: 10Muehlenhoff) [12:39:40] (03CR) 10Btullis: [C: 03+2] varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/992782 (https://phabricator.wikimedia.org/T346463) (owner: 10Btullis) [12:41:15] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2357.codfw.wmnet with OS bullseye [12:41:30] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2395.codfw.wmnet with OS bullseye [12:41:47] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2267.codfw.wmnet with OS bullseye [12:42:25] (03CR) 10Muehlenhoff: "Ack, I'll do that now." [puppet] - 10https://gerrit.wikimedia.org/r/989090 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [12:43:16] I will promote the wikis at 13:00 UTC (17 minutes from now) [12:47:23] (03CR) 10Muehlenhoff: [C: 03+2] eventschemas: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/989090 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [12:53:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T354336)', diff saved to https://phabricator.wikimedia.org/P55669 and previous config saved to /var/cache/conftool/dbconfig/20240125-125353-marostegui.json [12:54:00] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:57:41] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2395.codfw.wmnet with reason: host reimage [12:58:29] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2357.codfw.wmnet with reason: host reimage [12:58:57] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2267.codfw.wmnet with reason: host reimage [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1300) [13:01:57] (03CR) 10Majavah: "A rather large PCC run can be seen here: https://puppet-compiler.wmflabs.org/output/992888/1210/" [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah) [13:02:13] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2395.codfw.wmnet with reason: host reimage [13:02:21] !log draining VMs from ganeti2021 ahead of codfw rack b5 maintenance T355549 [13:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:27] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [13:02:53] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [13:04:12] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992927 (https://phabricator.wikimedia.org/T354433) [13:04:14] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992927 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [13:04:58] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992927 (https://phabricator.wikimedia.org/T354433) (owner: 10TrainBranchBot) [13:05:00] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2267.codfw.wmnet with reason: host reimage [13:08:18] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2357.codfw.wmnet with reason: host reimage [13:09:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P55670 and previous config saved to /var/cache/conftool/dbconfig/20240125-130900-marostegui.json [13:12:46] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.15 refs T354433 [13:13:11] T354433: 1.42.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T354433 [13:14:38] PROBLEM - MariaDB Replica SQL: s6 #page on db2129 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1054, Errmsg: Error Unknown column user_is_temp in field list on query. Default database: jawiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:15:03] (03CR) 10Ayounsi: [C: 03+1] ganeti: Stop using transition package [puppet] - 10https://gerrit.wikimedia.org/r/992891 (owner: 10Muehlenhoff) [13:15:16] Amir1: ^ [13:15:24] mmmm [13:15:25] is it only 1 host? [13:15:28] scap failed to sync on mw2267 mw2395 and mw2357, I am assuming they are being reimaged [13:15:31] let me check [13:15:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2129', diff saved to https://phabricator.wikimedia.org/P55671 and previous config saved to /var/cache/conftool/dbconfig/20240125-131547-marostegui.json [13:15:50] both of those are listed in T354791 [13:15:51] depooled for now [13:15:56] thanks, marostegui [13:16:00] (reclaiming jobrunners for k8s) [13:16:01] if it is 1 host, no biggie [13:16:05] T354791: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 [13:16:18] it is the candidate master [13:16:23] uf [13:16:31] * topranks here [13:16:44] 10SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 10DBA, 10MediaWiki-libs-Rdbms, 10Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255 (10Reedy) [13:16:50] *all three of the hosts that hashar listed [13:16:53] I also promoted the wikis a few minutes which might cause various issues [13:16:53] * Lucas_WMDE shuts up while dbas talk [13:17:04] marostegui: Made T355885 [13:17:05] T355885: replication broken on db2129 - https://phabricator.wikimedia.org/T355885 [13:17:14] k [13:17:15] Lucas_WMDE: thank you to have verified! :) [13:17:24] ah, I know what's going on [13:17:35] I pinged Amir because he was the person that had more prob to know about it [13:17:35] I fix it [13:17:47] and it seems I wasn't wrong :-D [13:18:07] It might be a leftover of https://phabricator.wikimedia.org/T336886 [13:18:37] If mw looks sane we can descalate the response and let you handle it on the ticket not in a rush? [13:18:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:18:46] yup [13:18:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [13:18:56] I'm sure I ran --check everywhere [13:19:09] The schema change is running [13:19:27] Amir1: can you check frwiki, ruwiki and labswiki as well on that host? [13:19:32] RECOVERY - MariaDB Replica SQL: s6 #page on db2129 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:42] marostegui: the script automatically does [13:19:48] Amir1: ok [13:19:53] although discuss with hashar, as that could have been the trigger (deployment) [13:20:03] Amir1: all clean? Can I repool? [13:20:11] yes, the patch that starts writing to it got merged in this train [13:20:18] marostegui: yes, thanks! [13:20:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: After T355885', diff saved to https://phabricator.wikimedia.org/P55672 and previous config saved to /var/cache/conftool/dbconfig/20240125-132043-root.json [13:21:16] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2395.codfw.wmnet with OS bullseye [13:21:17] I run check again [13:21:45] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2267.codfw.wmnet with OS bullseye [13:24:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P55673 and previous config saved to /var/cache/conftool/dbconfig/20240125-132407-marostegui.json [13:24:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 241, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:55] (03CR) 10Volans: [C: 03+2] Upstream release v0.3.4 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [13:25:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [13:25:03] !log stopping logstash service on logstash2025 to faciliate VM migration T355549 [13:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:09] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [13:26:18] (03PS1) 10Slyngshede: Bump version number for Debian package release to 0.4.0. [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992928 [13:26:21] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [13:26:25] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [13:26:58] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [13:27:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [13:27:12] hashar: for the sake of the train, I checked everything again (except s3 because checking every db in every replica gonna take at least six hours) and it was fine [13:27:33] <3 [13:27:41] Sorry for the mess :D [13:27:50] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [13:28:01] !log draining VMs from ganeti2022 ahead of codfw rack b5 maintenance T355549 [13:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:17] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2357.codfw.wmnet with OS bullseye [13:28:40] (03CR) 10Muehlenhoff: [C: 03+1] Bump version number for Debian package release to 0.4.0. [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992928 (owner: 10Slyngshede) [13:28:46] (03CR) 10Ayounsi: "Let's add it to the pile of things to check after upgrading Netbox :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [13:29:27] Amir1: as long as the mess is handled by someone, I am all fine with eggs being broken [13:29:39] (something like that, I don't know how to translate the french idiom I have in mind) [13:29:41] Thanks <3 [13:29:53] we should invent our own idioms [13:30:09] "don't put the carrot in the fridge when your DBA have an umbrella" [13:30:17] (03CR) 10Majavah: [C: 03+1] "+1. This means that any tools doing something stupid like `var_dump( $_SERVER );` will now leak their database credentials but that feels " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) (owner: 10David Caro) [13:32:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [13:32:49] (03CR) 10David Caro: [C: 03+2] lighthttpd: don't remove environment vars [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) (owner: 10David Caro) [13:32:59] * hashar whistles at LiquidThreads [13:33:21] ¡log Undeploying LiquidThreads [13:33:42] ooh, that'd be nice :D [13:33:45] if only [13:33:52] or even Flow [13:34:15] they are both in the pipes as I got it [13:34:43] I was almost going to file my Annual Planning Santa wishlist asking for both to be prioritized for decommissioned [13:34:49] decommissionment [13:34:55] well something like that [13:35:24] and we have: https://phabricator.wikimedia.org/T350164 `[Spike] Investigate Undeploying LiquidThreads` [13:35:38] and https://phabricator.wikimedia.org/T332022 `[Epic] Undeploying StructuredDiscussions (Flow)` [13:35:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: After T355885', diff saved to https://phabricator.wikimedia.org/P55674 and previous config saved to /var/cache/conftool/dbconfig/20240125-133547-root.json [13:36:01] T355885: replication broken on db2129 - https://phabricator.wikimedia.org/T355885 [13:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:39:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T354336)', diff saved to https://phabricator.wikimedia.org/P55675 and previous config saved to /var/cache/conftool/dbconfig/20240125-133913-marostegui.json [13:39:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [13:39:24] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:39:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2172.codfw.wmnet with reason: Maintenance [13:39:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T354336)', diff saved to https://phabricator.wikimedia.org/P55676 and previous config saved to /var/cache/conftool/dbconfig/20240125-133935-marostegui.json [13:40:06] (03Merged) 10jenkins-bot: Upstream release v0.3.4 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/992788 (owner: 10Volans) [13:40:10] (03Merged) 10jenkins-bot: lighthttpd: don't remove environment vars [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/988498 (https://phabricator.wikimedia.org/T354320) (owner: 10David Caro) [13:41:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T354336)', diff saved to https://phabricator.wikimedia.org/P55677 and previous config saved to /var/cache/conftool/dbconfig/20240125-134147-marostegui.json [13:43:56] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [13:47:48] !log uploaded debmonitor-client_0.3.4 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [13:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:54] moritzm: ^^^ [13:48:56] thanks, will take care of the rollout in a bit [13:50:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: After T355885', diff saved to https://phabricator.wikimedia.org/P55678 and previous config saved to /var/cache/conftool/dbconfig/20240125-135052-root.json [13:51:17] T355885: replication broken on db2129 - https://phabricator.wikimedia.org/T355885 [13:53:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [13:56:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P55679 and previous config saved to /var/cache/conftool/dbconfig/20240125-135653-marostegui.json [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1400). [14:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:03:53] hm, I don’t see anything in https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1400 [14:03:57] PROBLEM - Check systemd state on ml-serve2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:04] ah, https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2142737 [14:05:40] !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [14:05:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: After T355885', diff saved to https://phabricator.wikimedia.org/P55680 and previous config saved to /var/cache/conftool/dbconfig/20240125-140557-root.json [14:06:04] T355885: replication broken on db2129 - https://phabricator.wikimedia.org/T355885 [14:08:50] (03PS1) 10Jdlrobson: Begin capturing errors for Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992931 [14:12:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P55681 and previous config saved to /var/cache/conftool/dbconfig/20240125-141200-marostegui.json [14:15:30] !log failover ganeti master for codfw to ganeti2020 T355549 [14:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:35] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [14:15:40] (03PS3) 10Zabe: foreachwikiindblist: Return early when no arg is passed [puppet] - 10https://gerrit.wikimedia.org/r/992263 [14:17:05] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:18:04] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [14:19:31] PROBLEM - ganeti-wconfd running on ganeti2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: After T355885', diff saved to https://phabricator.wikimedia.org/P55682 and previous config saved to /var/cache/conftool/dbconfig/20240125-142102-root.json [14:21:03] !log Draining kubernetes2031 - T355549 [14:21:09] T355885: replication broken on db2129 - https://phabricator.wikimedia.org/T355885 [14:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:15] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [14:22:13] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:22:33] RECOVERY - BGP status on lsw1-b2-codfw.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:23:51] !log Draining kubernetes2032 - T355549 [14:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:13] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:29] ^probably me, will relaunch once done [14:25:38] !log Draining kubernetes2033 - T355549 [14:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:55] !log Draining kubernetes2023 - T355549 [14:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:02] With the right node, better. [14:26:21] !log installing debmonitor-client 0.3.4 fleet-wide [14:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T354336)', diff saved to https://phabricator.wikimedia.org/P55683 and previous config saved to /var/cache/conftool/dbconfig/20240125-142706-marostegui.json [14:27:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [14:27:12] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:27:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [14:27:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T354336)', diff saved to https://phabricator.wikimedia.org/P55684 and previous config saved to /var/cache/conftool/dbconfig/20240125-142729-marostegui.json [14:28:53] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:55] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes2058.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2026.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2052.codfw.wmnet, kubernetes2028.codfw.wmnet, kubernetes2059.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2049.codfw.wmnet, kubernetes2040.codfw.wmnet, ku [14:29:55] 2041.codfw.wmnet, kubernetes2037.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2035.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:30:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:55] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:34:38] !log Depooling parse2006 (setting inactive) - T355549 [14:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:47] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [14:34:56] !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=parse2006.codfw.wmnet [14:35:13] !log Depooling parse2007 (setting inactive) - T355549 [14:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:20] !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=parse2007.codfw.wmnet [14:37:04] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Unable to reimage elastic2088 and elastic2094 to bullseye - https://phabricator.wikimedia.org/T355830 (10bking) [14:39:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:14] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) Submitted all new tsr reports along with smartctl data [14:54:33] (03PS1) 10Ssingh: wikimedia.org: add DKIM selectors for store.wm.org [dns] - 10https://gerrit.wikimedia.org/r/992936 (https://phabricator.wikimedia.org/T355835) [14:59:22] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:34] RECOVERY - Check systemd state on ml-serve2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:22] 10SRE, 10Data Products (Data Products Sprint 08): Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) [15:10:05] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: wdqs::internal [15:10:14] RECOVERY - Disk space on stat1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [15:12:33] (03PS1) 10Muehlenhoff: Switch wdqs::internal to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992940 (https://phabricator.wikimedia.org/T349619) [15:12:49] (03PS2) 10Muehlenhoff: Switch wdqs::internal to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992940 (https://phabricator.wikimedia.org/T349619) [15:14:31] (03CR) 10Hnowlan: [C: 03+2] tegola: temporarily disable maps2006 db [deployment-charts] - 10https://gerrit.wikimedia.org/r/992887 (https://phabricator.wikimedia.org/T355549) (owner: 10Hnowlan) [15:15:07] (03PS1) 10Zabe: Start reading from af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992942 (https://phabricator.wikimedia.org/T355616) [15:15:09] (03CR) 10Muehlenhoff: [C: 03+2] Switch wdqs::internal to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992940 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:15:26] (03Merged) 10jenkins-bot: tegola: temporarily disable maps2006 db [deployment-charts] - 10https://gerrit.wikimedia.org/r/992887 (https://phabricator.wikimedia.org/T355549) (owner: 10Hnowlan) [15:17:34] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:18:57] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [15:19:04] (03CR) 10Slyngshede: [C: 03+2] Bump version number for Debian package release to 0.4.0. [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992928 (owner: 10Slyngshede) [15:19:17] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [15:20:33] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2006.cofw.wmnet [15:20:43] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) updated system settings server is back up now [15:20:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wdqs::internal [15:21:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:22:03] (03PS5) 10Ssingh: P:dns::auth: add support for depooling authdns via confd [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [15:22:05] (03Merged) 10jenkins-bot: Bump version number for Debian package release to 0.4.0. [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/992928 (owner: 10Slyngshede) [15:23:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:25:38] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: wcqs::public [15:25:52] (03CR) 10Ssingh: P:dns::auth: add support for depooling authdns via confd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:27:07] (03PS1) 10Muehlenhoff: Switch wcqs::public to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992966 (https://phabricator.wikimedia.org/T349619) [15:28:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T354336)', diff saved to https://phabricator.wikimedia.org/P55687 and previous config saved to /var/cache/conftool/dbconfig/20240125-152801-marostegui.json [15:28:17] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:28:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch wcqs::public to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/992966 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:29:13] (03CR) 10Slyngshede: Install debmonitor-server on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/992881 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [15:29:48] (03CR) 10Slyngshede: [C: 03+1] "The Debian package has been updated, so the existing patch is good to go" [puppet] - 10https://gerrit.wikimedia.org/r/992881 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [15:33:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wcqs::public [15:38:38] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/992967 (https://phabricator.wikimedia.org/T354959) [15:39:03] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:39:23] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) thanks! Let's let this sit w/out workload for a week or so and see if stays up, then we can try giving it some work to do. [15:43:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P55688 and previous config saved to /var/cache/conftool/dbconfig/20240125-154307-marostegui.json [15:43:15] (03CR) 10Muehlenhoff: [C: 03+2] Install debmonitor-server on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/992881 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [15:46:04] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on asw-b-codfw,lsw1-b5-codfw.mgmt with reason: prepping for server uplink migration [15:46:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on asw-b-codfw,lsw1-b5-codfw.mgmt with reason: prepping for server uplink migration [15:46:40] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=34ae871a-7149-43dd-8180-02ddd5b8c983) set by... [15:46:57] !log configuring lsw1-b5-codfw switch ports for servers to be moved T355549 [15:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:02] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [15:47:42] (03PS1) 10Stevemunene: Remove dummy-keytabs for decommissioned druid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/992968 (https://phabricator.wikimedia.org/T336043) [15:49:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] deployment_server: add dummy oauth2-proxy secrets for jaeger [labs/private] - 10https://gerrit.wikimedia.org/r/992699 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [15:50:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984143 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [15:50:57] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 (10cmooney) Just an update here, the restriction still exists however I think I know how I went wrong. In order for the irb interface to be "up" the associated vlan ne... [15:51:58] !log disabling puppet fleet-wide to allow for maintenance in codfw rack b5 which hosts puppetmaster2003 T355549 [15:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:12] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [15:54:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'preparing to clone db2169 on db2196 as per TT343674', diff saved to https://phabricator.wikimedia.org/P55689 and previous config saved to /var/cache/conftool/dbconfig/20240125-155450-arnaudb.json [15:56:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/992968 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [15:57:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on cr[1-2]-codfw with reason: prepping for server uplink migration [15:57:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on cr[1-2]-codfw with reason: prepping for server uplink migration [15:57:37] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e2f0518c-1df7-4528-89a1-5f2b248a7520) set by... [15:58:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on 32 hosts with reason: Migrating servers in codfw rack B5 to lsw1-b5-codfw T355549 [15:58:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P55690 and previous config saved to /var/cache/conftool/dbconfig/20240125-155813-marostegui.json [15:58:17] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [15:58:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on 32 hosts with reason: Migrating servers in codfw rack B5 to lsw1-b5-codfw T355549 [16:02:22] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: / 1959 MB (2% inode=83%): /tmp 1959 MB (2% inode=83%): /var/tmp 1959 MB (2% inode=83%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1005&var-datasource=eqiad+prometheus/ops [16:03:23] !log Network maintenance codfw rack b5 underway T355549 [16:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:48] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [16:10:16] PROBLEM - Host ml-staging-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T354336)', diff saved to https://phabricator.wikimedia.org/P55691 and previous config saved to /var/cache/conftool/dbconfig/20240125-161320-marostegui.json [16:13:49] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:14:09] (KubernetesCalicoDown) firing: ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:14:21] (ProbeDown) firing: (2) Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:22] (JobUnavailable) firing: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:15:44] (03PS1) 10Hnowlan: kubernetes: make 5 jobrunners kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/992973 (https://phabricator.wikimedia.org/T354791) [16:16:48] (03PS1) 10Ebernhardson: cirrus: Disable cloudelastic writes to testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992974 (https://phabricator.wikimedia.org/T352335) [16:17:28] (03CR) 10CI reject: [V: 04-1] cirrus: Disable cloudelastic writes to testwiki and mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992974 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [16:19:09] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Align consumer-devnull with deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/992806 (owner: 10Ebernhardson) [16:19:54] (03Abandoned) 10Ebernhardson: cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321 (owner: 10Ebernhardson) [16:20:23] (03Merged) 10jenkins-bot: cirrus updater: Align consumer-devnull with deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/992806 (owner: 10Ebernhardson) [16:24:07] (03PS1) 10Jgiannelos: mobileapps: Use core /page/html output in all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/992975 [16:26:53] (03PS2) 10Jgiannelos: mobileapps: Use core /page/html output in all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/992975 (https://phabricator.wikimedia.org/T339865) [16:27:15] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) [16:28:04] (03PS1) 10Hnowlan: Revert "tegola: temporarily disable maps2006 db" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992986 [16:28:11] (03PS2) 10Ebernhardson: cirrus updater: Expand test deployment to prod+cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/979147 (https://phabricator.wikimedia.org/T352335) [16:29:36] !log uncordoning kubernetes2031 - T355549 [16:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:54] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [16:30:11] (03CR) 10BCornwall: [C: 03+1] wikimedia.org: add DKIM selectors for store.wm.org [dns] - 10https://gerrit.wikimedia.org/r/992936 (https://phabricator.wikimedia.org/T355835) (owner: 10Ssingh) [16:31:50] RECOVERY - Host ml-staging-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 56.51 ms [16:32:31] !log uncordoning kubernetes2032 - T355549 [16:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:41] !log uncordoning kubernetes2023 - T355549 [16:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:34] !log repooling parse2006 - T355549 [16:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:44] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=parse2006.codfw.wmnet [16:34:09] (KubernetesCalicoDown) resolved: ml-staging-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:34:15] !log repooling parse2007 - T355549 [16:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:20] (ProbeDown) resolved: (2) Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:34:21] (JobUnavailable) resolved: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:23] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=parse2007.codfw.wmnet [16:35:55] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) Migration done! Serious props to @papaul and @Jhancock.wm for the smooth and super-fast execution!... [16:36:12] (03CR) 10Majavah: [C: 03+2] wikimediacloud.org: Move RabbitMQ traffic to cloudrabbit1003 [dns] - 10https://gerrit.wikimedia.org/r/992884 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [16:38:57] (03PS5) 10Alexandros Kosiaris: Switch canaries to 0.1% OpenTelemetry sampling [puppet] - 10https://gerrit.wikimedia.org/r/984814 (https://phabricator.wikimedia.org/T351566) [16:40:16] (03PS6) 10Alexandros Kosiaris: Switch canaries to 0.1% OpenTelemetry sampling [puppet] - 10https://gerrit.wikimedia.org/r/984814 (https://phabricator.wikimedia.org/T351566) [16:41:56] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for cr[1-2]-codfw [16:41:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr[1-2]-codfw [16:42:49] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for 32 hosts [16:43:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 32 hosts [16:43:12] PROBLEM - cassandra-b SSL 10.192.16.83:7000 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:43:20] PROBLEM - cassandra-a service on restbase2013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:43:44] PROBLEM - cassandra-c service on restbase2013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:43:48] PROBLEM - cassandra-b CQL 10.192.16.83:9042 on restbase2013 is CRITICAL: connect to address 10.192.16.83 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:43:50] PROBLEM - cassandra-b service on restbase2013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:44:22] PROBLEM - cassandra-c SSL 10.192.16.84:7000 on restbase2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:44:36] PROBLEM - cassandra-c CQL 10.192.16.84:9042 on restbase2013 is CRITICAL: connect to address 10.192.16.84 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:45:09] (03Abandoned) 10Andrew Bogott: base: puppet_alert: don't advertise the disable file [puppet] - 10https://gerrit.wikimedia.org/r/868221 (owner: 10Majavah) [16:48:01] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2013.codfw.wmnet with reason: Decommissioning — T352469 [16:48:17] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2013.codfw.wmnet with reason: Decommissioning — T352469 [16:48:19] T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469 [16:48:34] !log taavi@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudrabbit[1001-1002].wikimedia.org [16:48:45] (03CR) 10DCausse: [C: 03+1] cirrus updater: Expand test deployment to prod+cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/979147 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [16:49:08] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2014.codfw.wmnet with reason: Decommissioning — T352469 [16:49:12] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2014.codfw.wmnet with reason: Decommissioning — T352469 [16:51:23] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: add DKIM selectors for store.wm.org [dns] - 10https://gerrit.wikimedia.org/r/992936 (https://phabricator.wikimedia.org/T355835) (owner: 10Ssingh) [16:51:37] (03PS2) 10Ssingh: wikimedia.org: add DKIM selectors for store.wm.org [dns] - 10https://gerrit.wikimedia.org/r/992936 (https://phabricator.wikimedia.org/T355835) [16:52:33] (03CR) 10Hnowlan: [C: 03+2] Revert "tegola: temporarily disable maps2006 db" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992986 (owner: 10Hnowlan) [16:52:57] !log running authdns-update for CR 992936: T355835 [16:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:13] T355835: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 [16:53:25] (03Merged) 10jenkins-bot: Revert "tegola: temporarily disable maps2006 db" [deployment-charts] - 10https://gerrit.wikimedia.org/r/992986 (owner: 10Hnowlan) [16:56:01] 10SRE, 10serviceops, 10SecTeam-Processed, 10Security, 10Vuln-Misconfiguration: Helm Chart misconfigurations - https://phabricator.wikimedia.org/T355167 (10sbassett) 05In progress→03Resolved p:05Triage→03Low [16:56:30] 10SRE, 10serviceops, 10SecTeam-Processed, 10Security, 10Vuln-Misconfiguration: Helm Chart misconfigurations - https://phabricator.wikimedia.org/T355167 (10sbassett) 05Resolved→03In progress Whoops, I'll leave it in progress until the patches are actually merged/deployed. [16:56:51] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [16:57:37] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10klausman) Nice work. On our machine (ml-serve2002), it was but four seconds: `[Thu Jan 25 16:09:14 2024] tg3... [16:58:40] (03PS3) 10Ebernhardson: cirrus updater: Expand test deployment to prod+cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/979147 (https://phabricator.wikimedia.org/T352335) [16:59:19] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Remove dummy-keytabs for decommissioned druid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/992968 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [16:59:28] 10SRE-OnFire, 10Znuny, 10collaboration-services: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10LSobanski) a:05Jelto→03LSobanski Claiming this as it's a process / SLA question for the time being. [17:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:22] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: make 5 jobrunners kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/992973 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [17:00:53] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudrabbit[1001-1002].wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1002" [17:01:06] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [17:01:23] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [17:01:34] (03PS1) 10Btullis: Update the datahub containers to pick up new JRE [deployment-charts] - 10https://gerrit.wikimedia.org/r/992980 (https://phabricator.wikimedia.org/T354273) [17:03:22] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Expand test deployment to prod+cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/979147 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [17:04:15] (03Merged) 10jenkins-bot: cirrus updater: Expand test deployment to prod+cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/979147 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [17:04:19] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudrabbit[1001-1002].wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1002" [17:04:19] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:20] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudrabbit[1001-1002].wikimedia.org [17:04:33] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1002 for hosts: `cloudrabbit[1001-1002].wikimedia.org` - cloudrabbit100... [17:05:24] (03CR) 10Btullis: [C: 03+2] Update the datahub containers to pick up new JRE [deployment-charts] - 10https://gerrit.wikimedia.org/r/992980 (https://phabricator.wikimedia.org/T354273) (owner: 10Btullis) [17:05:40] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 (10ssingh) @bcampbell: The changes have been merged, please try the authenticate domain part now.... [17:05:43] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [17:06:20] (03Merged) 10jenkins-bot: Update the datahub containers to pick up new JRE [deployment-charts] - 10https://gerrit.wikimedia.org/r/992980 (https://phabricator.wikimedia.org/T354273) (owner: 10Btullis) [17:06:34] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) [17:07:07] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:09:22] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:09:39] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:13:03] 10SRE, 10Cumin, 10Infrastructure-Foundations: Feature request: When cumin is running with -b (and -s), it should display the current host being affected - https://phabricator.wikimedia.org/T355811 (10Volans) p:05Triage→03Medium [17:14:32] jouncebot: nowandnext [17:14:32] For the next 0 hour(s) and 45 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1700) [17:14:32] In 0 hour(s) and 45 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1800) [17:14:32] In 0 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1800) [17:17:05] (03PS1) 10Ebernhardson: flink-operator: Add cirrus-streaming-updater to prod watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/992983 (https://phabricator.wikimedia.org/T352335) [17:17:29] !log btullis@deploy2002 helmfile [staging] START helmfile.d/services/datahub: apply on main [17:19:08] (03CR) 10DCausse: [C: 03+1] flink-operator: Add cirrus-streaming-updater to prod watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/992983 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [17:20:10] (03CR) 10Bking: [C: 03+1] flink-operator: Add cirrus-streaming-updater to prod watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/992983 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [17:21:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm [17:22:11] ^ BGP alerts expected in drmrs [17:22:34] !log btullis@deploy2002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:22:47] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [17:25:51] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 (10bcampbell) @ssingh Thank you, I just initiated the process, which Shopify says may take 24 hou... [17:26:09] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:26:11] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:27:47] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) @bcampbell I assume the intent is to allow shopify to dkim sign their mail with keys we adv... [17:30:28] !log deploying new captchas (T141490) [17:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:48] T141490: Deploy improved FancyCaptcha - https://phabricator.wikimedia.org/T141490 [17:33:09] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [17:34:07] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [17:38:28] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10ssingh) >>! In T355833#9489071, @jhathaway wrote: > @bcampbell I assume the intent is to allow shopify... [17:38:49] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [17:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [17:40:17] (03PS2) 10Ladsgroup: mediawiki: Use the new captcha [puppet] - 10https://gerrit.wikimedia.org/r/990697 (https://phabricator.wikimedia.org/T141490) [17:40:57] (03PS3) 10Ladsgroup: mediawiki: Use the new captcha [puppet] - 10https://gerrit.wikimedia.org/r/990697 (https://phabricator.wikimedia.org/T141490) [17:43:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [17:44:52] (03PS4) 10Ladsgroup: mediawiki: Use the new captcha [puppet] - 10https://gerrit.wikimedia.org/r/990697 (https://phabricator.wikimedia.org/T141490) [17:44:57] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Use the new captcha [puppet] - 10https://gerrit.wikimedia.org/r/990697 (https://phabricator.wikimedia.org/T141490) (owner: 10Ladsgroup) [17:45:14] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for asw-b-codfw,lsw1-b5-codfw.mgmt [17:45:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for asw-b-codfw,lsw1-b5-codfw.mgmt [17:46:24] (03PS5) 10Reedy: mediawiki: Replace deprecated blacklist parameter in captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) [17:47:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage [17:48:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55692 and previous config saved to /var/cache/conftool/dbconfig/20240125-174803-root.json [17:48:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55693 and previous config saved to /var/cache/conftool/dbconfig/20240125-174819-root.json [17:48:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55694 and previous config saved to /var/cache/conftool/dbconfig/20240125-174825-root.json [17:48:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55695 and previous config saved to /var/cache/conftool/dbconfig/20240125-174833-root.json [17:48:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3315 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55696 and previous config saved to /var/cache/conftool/dbconfig/20240125-174840-root.json [17:48:46] (03PS1) 10Ssingh: wikimedia.org: fix store.wm.org records [dns] - 10https://gerrit.wikimedia.org/r/993008 (https://phabricator.wikimedia.org/T355835) [17:48:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55697 and previous config saved to /var/cache/conftool/dbconfig/20240125-174846-root.json [17:48:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55698 and previous config saved to /var/cache/conftool/dbconfig/20240125-174851-root.json [17:48:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55699 and previous config saved to /var/cache/conftool/dbconfig/20240125-174857-root.json [17:49:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 10%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55700 and previous config saved to /var/cache/conftool/dbconfig/20240125-174902-root.json [17:49:17] (03CR) 10JHathaway: "maybe add the condition both to the timer and the service? https://github.com/systemd/systemd/issues/3963" [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah) [17:49:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [17:49:43] (03CR) 10CI reject: [V: 04-1] wikimedia.org: fix store.wm.org records [dns] - 10https://gerrit.wikimedia.org/r/993008 (https://phabricator.wikimedia.org/T355835) (owner: 10Ssingh) [17:49:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [17:51:19] (03PS2) 10Ssingh: wikimedia.org: fix store.wm.org records [dns] - 10https://gerrit.wikimedia.org/r/993008 (https://phabricator.wikimedia.org/T355835) [17:52:18] (03CR) 10CI reject: [V: 04-1] wikimedia.org: fix store.wm.org records [dns] - 10https://gerrit.wikimedia.org/r/993008 (https://phabricator.wikimedia.org/T355835) (owner: 10Ssingh) [17:54:28] (03PS2) 10Reedy: captchaloop: Generate old and new captchas [puppet] - 10https://gerrit.wikimedia.org/r/990715 [17:54:30] (03PS1) 10Reedy: mediawiki: Refactor and improve captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/993010 [17:55:33] (03PS3) 10Ssingh: wikimedia.org: fix store.wm.org records [dns] - 10https://gerrit.wikimedia.org/r/993008 (https://phabricator.wikimedia.org/T355835) [17:58:18] (03PS1) 10Btullis: Update the spark-operator image name and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/993012 (https://phabricator.wikimedia.org/T354273) [17:59:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Replace deprecated blacklist parameter in captchaloop [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [17:59:37] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/774940 (https://phabricator.wikimedia.org/T277936) (owner: 10Reedy) [18:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T1800) [18:00:51] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: fix store.wm.org records [dns] - 10https://gerrit.wikimedia.org/r/993008 (https://phabricator.wikimedia.org/T355835) (owner: 10Ssingh) [18:01:06] !log running authdns-update for CR 993008: T355835 [18:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:33] T355835: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 [18:03:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55701 and previous config saved to /var/cache/conftool/dbconfig/20240125-180308-root.json [18:03:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55702 and previous config saved to /var/cache/conftool/dbconfig/20240125-180324-root.json [18:03:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55703 and previous config saved to /var/cache/conftool/dbconfig/20240125-180330-root.json [18:03:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55704 and previous config saved to /var/cache/conftool/dbconfig/20240125-180338-root.json [18:03:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3315 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55705 and previous config saved to /var/cache/conftool/dbconfig/20240125-180345-root.json [18:03:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55706 and previous config saved to /var/cache/conftool/dbconfig/20240125-180351-root.json [18:03:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55707 and previous config saved to /var/cache/conftool/dbconfig/20240125-180356-root.json [18:04:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55708 and previous config saved to /var/cache/conftool/dbconfig/20240125-180402-root.json [18:04:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 25%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55709 and previous config saved to /var/cache/conftool/dbconfig/20240125-180407-root.json [18:05:16] 10SRE, 10ops-codfw, 10Data-Persistence, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) @Jhancock.wm @papaul <3 [18:07:29] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that store.wikimedia.org complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355835 (10ssingh) ` $ dig n1j._domainkey.wikimedia.org +short dkim1.327bdf87d37c.p413.email.myshopify.co... [18:10:55] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:11:09] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:13:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS bookworm [18:18:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55710 and previous config saved to /var/cache/conftool/dbconfig/20240125-181814-root.json [18:18:28] (03PS1) 10Bking: cloudelastic: add CNAME for migration canary [dns] - 10https://gerrit.wikimedia.org/r/993014 (https://phabricator.wikimedia.org/T355617) [18:18:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55711 and previous config saved to /var/cache/conftool/dbconfig/20240125-181829-root.json [18:18:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55712 and previous config saved to /var/cache/conftool/dbconfig/20240125-181835-root.json [18:18:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55713 and previous config saved to /var/cache/conftool/dbconfig/20240125-181843-root.json [18:18:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3315 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55714 and previous config saved to /var/cache/conftool/dbconfig/20240125-181850-root.json [18:18:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55715 and previous config saved to /var/cache/conftool/dbconfig/20240125-181856-root.json [18:19:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55716 and previous config saved to /var/cache/conftool/dbconfig/20240125-181901-root.json [18:19:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55717 and previous config saved to /var/cache/conftool/dbconfig/20240125-181907-root.json [18:19:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 50%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55718 and previous config saved to /var/cache/conftool/dbconfig/20240125-181912-root.json [18:21:00] (03CR) 10Ebernhardson: [C: 03+1] cloudelastic: add CNAME for migration canary [dns] - 10https://gerrit.wikimedia.org/r/993014 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:33:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55719 and previous config saved to /var/cache/conftool/dbconfig/20240125-183318-root.json [18:33:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55720 and previous config saved to /var/cache/conftool/dbconfig/20240125-183334-root.json [18:33:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55721 and previous config saved to /var/cache/conftool/dbconfig/20240125-183340-root.json [18:33:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55722 and previous config saved to /var/cache/conftool/dbconfig/20240125-183348-root.json [18:33:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3315 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55723 and previous config saved to /var/cache/conftool/dbconfig/20240125-183355-root.json [18:34:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55724 and previous config saved to /var/cache/conftool/dbconfig/20240125-183401-root.json [18:34:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55725 and previous config saved to /var/cache/conftool/dbconfig/20240125-183406-root.json [18:34:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55726 and previous config saved to /var/cache/conftool/dbconfig/20240125-183412-root.json [18:34:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 75%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55727 and previous config saved to /var/cache/conftool/dbconfig/20240125-183417-root.json [18:42:54] (03PS2) 10Bking: cloudelastic: add CNAME for migration canary [dns] - 10https://gerrit.wikimedia.org/r/993014 (https://phabricator.wikimedia.org/T355617) [18:43:17] 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Ahoelzl) [18:45:35] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: reboot [18:46:00] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: reboot [18:47:00] !log phab2002 - rebooting [18:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:07] 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Ahoelzl) @RLazarus an ETA / update on the request would be very much appreciated. Cluster access is a key step for onboarding Aleksandar to my engineering team. Thank you! [18:48:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2159 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55728 and previous config saved to /var/cache/conftool/dbconfig/20240125-184823-root.json [18:48:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2109 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55729 and previous config saved to /var/cache/conftool/dbconfig/20240125-184839-root.json [18:48:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2107 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55730 and previous config saved to /var/cache/conftool/dbconfig/20240125-184845-root.json [18:48:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3314 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55731 and previous config saved to /var/cache/conftool/dbconfig/20240125-184853-root.json [18:49:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2137:3315 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55732 and previous config saved to /var/cache/conftool/dbconfig/20240125-184900-root.json [18:49:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55733 and previous config saved to /var/cache/conftool/dbconfig/20240125-184906-root.json [18:49:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55734 and previous config saved to /var/cache/conftool/dbconfig/20240125-184911-root.json [18:49:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2178 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55735 and previous config saved to /var/cache/conftool/dbconfig/20240125-184917-root.json [18:49:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 100%: After network maintenance', diff saved to https://phabricator.wikimedia.org/P55736 and previous config saved to /var/cache/conftool/dbconfig/20240125-184922-root.json [18:49:26] 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10RLazarus) This week's clinic duty SRE is @Arnoldokoth. [18:49:31] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10Ahoelzl) @Arnoldokoth would it be possible to get an update / ETA on the request? Ldap / wmf access is blocking onboarding Aleksandra to the engineering team. Thank you! [18:59:11] 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Arnoldokoth) Thanks @RLazarus Apologies @Ahoelzl This will be done as soon as @odimitrijevic / @Milimetric approve the request. [19:01:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:02:02] 10SRE, 10SRE-Access-Requests: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10Arnoldokoth) a:03odimitrijevic [19:06:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:11:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10Arnoldokoth) 05Open→03In progress [19:16:24] (03PS49) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [19:16:26] (03PS6) 10AOkoth: vrts: enable connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/988679 [19:16:28] (03PS1) 10AOkoth: admin: add amastilovic to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/993019 (https://phabricator.wikimedia.org/T355607) [19:19:08] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/993019 (https://phabricator.wikimedia.org/T355607) (owner: 10AOkoth) [19:19:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10Arnoldokoth) @Ahoelzl This will be resolved shortly. [19:19:34] (03CR) 10Bking: [C: 03+2] flink-operator: Add cirrus-streaming-updater to prod watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/992983 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [19:20:39] (03PS2) 10AOkoth: admin: add amastilovic to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/993019 (https://phabricator.wikimedia.org/T355607) [19:22:19] (03Merged) 10jenkins-bot: flink-operator: Add cirrus-streaming-updater to prod watched namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/992983 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [19:22:30] (03CR) 10AOkoth: [C: 03+2] admin: add amastilovic to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/993019 (https://phabricator.wikimedia.org/T355607) (owner: 10AOkoth) [19:24:50] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:25:55] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:28:13] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10Arnoldokoth) @amastilovic This should be okay now. [19:28:28] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:28:34] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:29:16] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [19:29:33] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [19:29:42] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) [19:29:47] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) [19:30:36] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Dzahn) >>! In T355591#9486355, @Arinaigu wrote: > There seems to be a problem with my developer account as well. Hi! It seems the problem is there is an account "Arinaugu" without the trailing... [19:31:35] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) cloudrabbit1002 is now in E4 U17 CableID 2M-20220016 Port 3 [19:41:19] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) Hi! I created the account Arinaigu for Meta Wikimedia and MediaWiki. Then I created a separate developer/Wikitech account arinaigum. I think I read somewhere in the documentation that t... [19:43:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:45:34] 10SRE, 10SRE-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10taavi) You do have a developer account, and the fact that you can log in to https://idm.wikimedia.org and https://idp.wikimedia.org confirms that. The problem with logging in to https://wikitech.... [19:48:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:56:39] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [19:57:09] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10VRiley-WMF) cloudrabbit1001 is now in C8 U19 CableID 5336 Port 21 [19:58:51] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPs for cloudrabbit1002 - taavi@cumin1002" [19:59:35] jouncebot: nowandnext [19:59:35] No deployments scheduled for the next 1 hour(s) and 0 minute(s) [19:59:35] In 1 hour(s) and 0 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T2100) [19:59:43] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPs for cloudrabbit1002 - taavi@cumin1002" [19:59:43] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:00:34] !log taavi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit1002 [20:01:21] !log taavi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit1002 [20:02:09] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [20:03:41] (03CR) 10Zabe: [C: 03+2] Start reading from af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992942 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [20:04:20] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPs for cloudrabbit1001 - taavi@cumin1002" [20:04:25] (03Merged) 10jenkins-bot: Start reading from af_actor/afh_actor in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992942 (https://phabricator.wikimedia.org/T355616) (owner: 10Zabe) [20:05:14] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add IPs for cloudrabbit1001 - taavi@cumin1002" [20:05:14] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:05:20] !log taavi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudrabbit1001 [20:05:21] !log zabe@deploy2002 Started scap: Backport for [[gerrit:992942|Start reading from af_actor/afh_actor in group1 wikis (T355616)]] [20:05:29] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) Hi @xcollazo are you asking to add the list in addition to the current recipients or to entirely replace them (to remove the other recipients)? ` ops-du... [20:05:42] T355616: Start reading from af_actor/afh_actor - https://phabricator.wikimedia.org/T355616 [20:06:10] !log taavi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudrabbit1001 [20:06:47] (03PS1) 10Majavah: Move cloudrabbit1001/2 to private vlan [puppet] - 10https://gerrit.wikimedia.org/r/993026 (https://phabricator.wikimedia.org/T345610) [20:08:09] (03CR) 10Majavah: [C: 03+2] Move cloudrabbit1001/2 to private vlan [puppet] - 10https://gerrit.wikimedia.org/r/993026 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [20:09:36] !log zabe@deploy2002 zabe: Backport for [[gerrit:992942|Start reading from af_actor/afh_actor in group1 wikis (T355616)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:05] !log zabe@deploy2002 zabe: Continuing with sync [20:10:56] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit1001.eqiad.wmnet with OS bookworm [20:11:30] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.eqiad.wmnet with OS bookworm [20:14:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:14:34] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10xcollazo) > And another question, would it make sense if we move this to a Google group where your team becomes admin so in the future you can control it yourse... [20:14:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:07] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:37] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:15:46] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:16:48] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:992942|Start reading from af_actor/afh_actor in group1 wikis (T355616)]] (duration: 11m 27s) [20:16:53] T355616: Start reading from af_actor/afh_actor - https://phabricator.wikimedia.org/T355616 [20:19:33] !log taavi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudrabbit1002.eqiad.wmnet with OS bookworm [20:19:48] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudrabbit1002.eqiad.wmnet with OS bookworm [20:20:53] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) >>! In T355891#9489388, @xcollazo wrote: > Oh, that would be nice. How about we do that instead? > > Then I can take care of forwarding/figuring out if t... [20:25:30] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set cloudrabbit1001/2 as active - taavi@cumin1002" [20:26:38] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set cloudrabbit1001/2 as active - taavi@cumin1002" [20:27:30] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1001.eqiad.wmnet with reason: host reimage [20:30:05] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) ITS request #99871 [20:30:51] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10Dzahn) 05In progress→03Resolved a:03Dzahn [20:32:46] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1001.eqiad.wmnet with reason: host reimage [20:32:53] (03PS1) 10Ebernhardson: cirrus-updater: Increase producer memory from 2g to 3g [deployment-charts] - 10https://gerrit.wikimedia.org/r/993028 (https://phabricator.wikimedia.org/T352335) [20:33:00] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit1002.eqiad.wmnet with reason: host reimage [20:33:48] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10Dzahn) Also added to WMF-NDA group in Phabricator. (per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group) You can now see non-public tickets. [20:33:52] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:33:56] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:34:46] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10Dzahn) a:05Dzahn→03None [20:35:39] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:35:44] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:36:23] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit1002.eqiad.wmnet with reason: host reimage [20:37:02] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:37:07] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:44:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:45:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.248 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:45:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:50:40] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [20:51:29] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [20:51:30] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1001.eqiad.wmnet with OS bookworm [20:54:24] (03PS1) 10Hashar: gerrit: use finer groups for commit commentlink [puppet] - 10https://gerrit.wikimedia.org/r/993029 (https://phabricator.wikimedia.org/T354886) [20:54:52] (03CR) 10Hashar: "Example https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/992939" [puppet] - 10https://gerrit.wikimedia.org/r/993029 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [20:54:57] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [20:55:21] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Aleksandar Mastilovic - https://phabricator.wikimedia.org/T355607 (10amastilovic) @Arnoldokoth thank you! [20:55:47] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - taavi@cumin1002" [20:55:48] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit1002.eqiad.wmnet with OS bookworm [20:56:23] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:56:28] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:57:49] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:57:53] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:58:09] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [20:58:11] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:59:17] (03CR) 10Paladox: [C: 03+1] "Tested locally and works." [puppet] - 10https://gerrit.wikimedia.org/r/993029 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240125T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:53] (03CR) 10Cwhite: [C: 03+2] logstash: consume from mediawiki accesslog sampled topics [puppet] - 10https://gerrit.wikimedia.org/r/992656 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite) [21:05:20] (03PS1) 10Ebernhardson: cirrus-updater: Normalize kafka configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993032 (https://phabricator.wikimedia.org/T352335) [21:07:07] (03CR) 10Dzahn: [C: 03+2] gerrit: use finer groups for commit commentlink [puppet] - 10https://gerrit.wikimedia.org/r/993029 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [21:09:56] (03PS2) 10Ebernhardson: cirrus-updater: Normalize kafka configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993032 (https://phabricator.wikimedia.org/T352335) [21:11:43] (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Normalize kafka configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993032 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:11:51] (03CR) 10Dzahn: [C: 03+2] "deployed and config reload." [puppet] - 10https://gerrit.wikimedia.org/r/993029 (https://phabricator.wikimedia.org/T354886) (owner: 10Hashar) [21:12:37] (03Merged) 10jenkins-bot: cirrus-updater: Normalize kafka configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/993032 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:13:39] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:13:44] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:14:09] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:14:16] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:19:00] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:19:32] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:35:11] (03PS1) 10Ryan Kemper: cloudelastic: remove old masters [puppet] - 10https://gerrit.wikimedia.org/r/993038 (https://phabricator.wikimedia.org/T351354) [21:36:01] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/993038 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [21:36:16] (03PS1) 10Ebernhardson: cirrus-updater: Update list of allowed wikis in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/993039 [21:37:55] (03CR) 10Bking: [C: 03+1] cloudelastic: remove old masters [puppet] - 10https://gerrit.wikimedia.org/r/993038 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [21:38:51] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [21:40:00] (03CR) 10Ebernhardson: [C: 03+2] cirrus-updater: Update list of allowed wikis in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/993039 (owner: 10Ebernhardson) [21:40:55] (03Merged) 10jenkins-bot: cirrus-updater: Update list of allowed wikis in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/993039 (owner: 10Ebernhardson) [21:44:06] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [21:44:10] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:44:23] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:44:41] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:55:19] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:55:26] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:07:39] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: cloudelastic maintenance [22:08:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: cloudelastic maintenance [22:08:10] (03PS2) 10Ryan Kemper: cloudelastic: remove old masters [puppet] - 10https://gerrit.wikimedia.org/r/993038 (https://phabricator.wikimedia.org/T351354) [22:08:57] !log T351354 Downtimed `cloudelastic*`; shortly will restart `cloudelastic100[1,2,4]` one host at a time to make them no longer masters [22:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:11] T351354: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 [22:09:13] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] cloudelastic: remove old masters [puppet] - 10https://gerrit.wikimedia.org/r/993038 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [22:11:25] !log T351354 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/993038; restarting `cloudelastic1001` following puppet run [22:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:07] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [22:13:23] (03PS1) 10Ebernhardson: cirrus updater: Configure http routes for prod clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/993045 (https://phabricator.wikimedia.org/T352335) [22:14:20] (03PS2) 10Ebernhardson: cirrus updater: Configure http routes for prod clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/993045 (https://phabricator.wikimedia.org/T352335) [22:15:37] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Configure http routes for prod clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/993045 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [22:15:50] !log T351354 Restarting `cloudelastic1004` following puppet run [22:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:55] T351354: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 [22:16:42] (03Merged) 10jenkins-bot: cirrus updater: Configure http routes for prod clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/993045 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [22:19:20] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:19:30] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:25:58] !log T351354 Restarting `cloudelastic1002` [22:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:03] T351354: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 [22:28:02] (03PS1) 10BCornwall: fixup! Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/993046 [22:28:18] (03Abandoned) 10BCornwall: fixup! Add module for ncmonitor [puppet] - 10https://gerrit.wikimedia.org/r/993046 (owner: 10BCornwall) [22:33:19] !log T351354 Now restarting new masters to keep configs in sync; restarting `cloudelastic1007` [22:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:24] T351354: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 [22:34:42] !log T351354 Now restarting new masters to keep configs in sync; restarting `cloudelastic1009` [22:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:06] !log T351354 Restarting `cloudelastic1006` (final restart for today) [22:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:12] T351354: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 [22:52:41] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1010 for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [22:52:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cloudelastic1010 for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [22:52:46] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [22:53:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1010 for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [22:53:48] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cloudelastic1010 for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [22:53:55] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1010.wikimedia.org for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [22:53:58] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1010.wikimedia.org for use cloudelastic1010 as migration canary - bking@cumin2002 - T355617 [23:14:18] 10SRE, 10LDAP-Access-Requests: Grant Access to ops for swfrench - https://phabricator.wikimedia.org/T355912 (10Scott_French) [23:15:17] 10SRE, 10LDAP-Access-Requests: Grant Access to ops for swfrench - https://phabricator.wikimedia.org/T355912 (10Scott_French) 05Open→03In progress p:05Triage→03Medium [23:16:23] (03PS1) 10Scott French: admin: move swfrench from sre-admins to ops [puppet] - 10https://gerrit.wikimedia.org/r/993050 (https://phabricator.wikimedia.org/T355912) [23:17:41] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cloudelastic1010.wikimedia.org with reason: migration canary T355617 [23:17:55] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [23:17:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cloudelastic1010.wikimedia.org with reason: migration canary T355617 [23:21:00] (03CR) 10RLazarus: [C: 03+2] admin: move swfrench from sre-admins to ops [puppet] - 10https://gerrit.wikimedia.org/r/993050 (https://phabricator.wikimedia.org/T355912) (owner: 10Scott French) [23:22:35] 10SRE, 10Data Products: Forward ops-dumps@wikimedia.org to data-engineering-alerts@lists.wikimedia.org - https://phabricator.wikimedia.org/T355891 (10Dzahn) a:03Dzahn [23:29:08] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Sturm . # T355485 [23:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:13] T355485: Server side upload for Sturm - https://phabricator.wikimedia.org/T355485 [23:41:43] (03PS3) 10Zabe: Setup namespace for 2025, 2026, enable subpages for 2023-2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961963 (https://phabricator.wikimedia.org/T347622) (owner: 10Robertsky) [23:44:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961963 (https://phabricator.wikimedia.org/T347622) (owner: 10Robertsky) [23:45:20] (03Merged) 10jenkins-bot: Setup namespace for 2025, 2026, enable subpages for 2023-2026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961963 (https://phabricator.wikimedia.org/T347622) (owner: 10Robertsky) [23:45:35] !log zabe@deploy2002 Started scap: Backport for [[gerrit:961963|Setup namespace for 2025, 2026, enable subpages for 2023-2026 (T347622)]] [23:45:40] T347622: wikimaniawiki: create namespace for 2025 and 2026 - https://phabricator.wikimedia.org/T347622 [23:46:59] !log zabe@deploy2002 robertsky and zabe: Backport for [[gerrit:961963|Setup namespace for 2025, 2026, enable subpages for 2023-2026 (T347622)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:47:29] !log zabe@deploy2002 robertsky and zabe: Continuing with sync [23:54:05] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:961963|Setup namespace for 2025, 2026, enable subpages for 2023-2026 (T347622)]] (duration: 08m 30s) [23:54:17] !log zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=wikimaniawiki --fix # T347622 [23:54:27] T347622: wikimaniawiki: create namespace for 2025 and 2026 - https://phabricator.wikimedia.org/T347622 [23:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log