[00:00:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038789 (owner: 10TrainBranchBot) [00:04:25] (03PS1) 10Cwhite: logstash: add drop for php notice unedefined index issue [puppet] - 10https://gerrit.wikimedia.org/r/1038790 (https://phabricator.wikimedia.org/T366657) [00:08:04] (03CR) 10Cwhite: [C:03+2] logstash: add drop for php notice unedefined index issue [puppet] - 10https://gerrit.wikimedia.org/r/1038790 (https://phabricator.wikimedia.org/T366657) (owner: 10Cwhite) [00:20:28] (03PS1) 10JHathaway: dummy ssl key [labs/private] - 10https://gerrit.wikimedia.org/r/1038920 [00:22:38] (03CR) 10JHathaway: [C:03+2] dummy ssl key [labs/private] - 10https://gerrit.wikimedia.org/r/1038920 (owner: 10JHathaway) [00:22:41] (03CR) 10JHathaway: [V:03+2 C:03+2] dummy ssl key [labs/private] - 10https://gerrit.wikimedia.org/r/1038920 (owner: 10JHathaway) [00:25:09] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [01:03:43] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9862244 (10Bodhisattwa) Seeing the ESEAP mailing list, I think, it would be OK, if we get the name as wikimoitree@lists.wikimedia.org [01:08:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:10:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:08:45] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:10:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [02:34:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [02:34:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P64041 and previous config saved to /var/cache/conftool/dbconfig/20240605-023423-ladsgroup.json [02:34:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:38:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:55:45] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P64042 and previous config saved to /var/cache/conftool/dbconfig/20240605-031310-ladsgroup.json [03:13:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:27:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T364299)', diff saved to https://phabricator.wikimedia.org/P64043 and previous config saved to /var/cache/conftool/dbconfig/20240605-032704-marostegui.json [03:27:07] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:28:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P64044 and previous config saved to /var/cache/conftool/dbconfig/20240605-032817-ladsgroup.json [03:42:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P64045 and previous config saved to /var/cache/conftool/dbconfig/20240605-034212-marostegui.json [03:43:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P64046 and previous config saved to /var/cache/conftool/dbconfig/20240605-034326-ladsgroup.json [03:57:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P64047 and previous config saved to /var/cache/conftool/dbconfig/20240605-035719-marostegui.json [03:58:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T364069)', diff saved to https://phabricator.wikimedia.org/P64048 and previous config saved to /var/cache/conftool/dbconfig/20240605-035831-marostegui.json [03:58:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P64049 and previous config saved to /var/cache/conftool/dbconfig/20240605-035832-ladsgroup.json [03:58:35] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:58:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [03:58:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:58:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [03:58:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T352010)', diff saved to https://phabricator.wikimedia.org/P64050 and previous config saved to /var/cache/conftool/dbconfig/20240605-035855-ladsgroup.json [04:12:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T364299)', diff saved to https://phabricator.wikimedia.org/P64051 and previous config saved to /var/cache/conftool/dbconfig/20240605-041227-marostegui.json [04:12:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:12:31] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:12:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:12:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [04:12:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [04:13:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T364299)', diff saved to https://phabricator.wikimedia.org/P64052 and previous config saved to /var/cache/conftool/dbconfig/20240605-041306-marostegui.json [04:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P64053 and previous config saved to /var/cache/conftool/dbconfig/20240605-041339-marostegui.json [04:28:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P64054 and previous config saved to /var/cache/conftool/dbconfig/20240605-042847-marostegui.json [04:43:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T364069)', diff saved to https://phabricator.wikimedia.org/P64055 and previous config saved to /var/cache/conftool/dbconfig/20240605-044355-marostegui.json [04:43:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [04:43:58] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:44:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [04:44:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T364069)', diff saved to https://phabricator.wikimedia.org/P64056 and previous config saved to /var/cache/conftool/dbconfig/20240605-044418-marostegui.json [05:08:13] (03PS1) 10Marostegui: es6,es7: Add candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/1038925 (https://phabricator.wikimedia.org/T365098) [05:09:09] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1038925 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui) [05:09:11] (03CR) 10Marostegui: [C:03+2] es6,es7: Add candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/1038925 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui) [05:11:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:11] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1457 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:19:59] (03CR) 10Sg912: [V:03+1 C:03+1] cassandra: create new commons_impact_analytics role [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:17:11] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:55:13] (03CR) 10Hashar: [C:03+2] Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810 (owner: 10Hashar) [06:55:43] (03Merged) 10jenkins-bot: Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810 (owner: 10Hashar) [06:57:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:03:44] (03CR) 10Muehlenhoff: "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [07:07:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366038 [07:07:53] (03CR) 10DCausse: "thanks for the fixes!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [07:07:54] T366038: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T366038 [07:07:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 0 T366038', diff saved to https://phabricator.wikimedia.org/P64057 and previous config saved to /var/cache/conftool/dbconfig/20240605-070758-root.json [07:08:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366038 [07:08:17] (03PS10) 10DCausse: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [07:08:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once approval for analytics-privatedata-users is in)" [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [07:08:48] (03CR) 10DCausse: [C:03+1] wdqs.data-reload: fix regex escaping (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [07:09:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9862426 (10MoritzMuehlenhoff) >>! In T364715#9840315, @colewhite wrote: > Added Data Engineering tag fo... [07:10:10] (03PS1) 10Marostegui: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039067 (https://phabricator.wikimedia.org/T366038) [07:10:14] (03Abandoned) 10Marostegui: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1035872 (https://phabricator.wikimedia.org/T366038) (owner: 10Gerrit maintenance bot) [07:10:37] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039067 (https://phabricator.wikimedia.org/T366038) (owner: 10Marostegui) [07:16:53] (03CR) 10Ryan Kemper: [C:03+2] opensearch/roll-restart-reboot: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1031063 (owner: 10Ryan Kemper) [07:18:06] (03CR) 10Jelto: [C:04-1] "typo, comment in line 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10ClĂ©ment Goubert) [07:18:29] (03PS4) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) [07:19:39] (03PS5) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) [07:19:56] (03CR) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [07:20:56] (03Merged) 10jenkins-bot: opensearch/roll-restart-reboot: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1031063 (owner: 10Ryan Kemper) [07:24:09] !log Starting s2 codfw failover from db2207 to db2204 - T366038 [07:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:12] T366038: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T366038 [07:24:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2204 to s2 primary T366038', diff saved to https://phabricator.wikimedia.org/P64058 and previous config saved to /var/cache/conftool/dbconfig/20240605-072427-marostegui.json [07:25:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2207 T366038', diff saved to https://phabricator.wikimedia.org/P64059 and previous config saved to /var/cache/conftool/dbconfig/20240605-072509-root.json [07:25:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9862490 (10Reedy) [07:25:45] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2207.codfw.wmnet with reason: Long schema change [07:27:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2207.codfw.wmnet with reason: Long schema change [07:28:12] !log dbmaint codfw s2 deploy schema change on db2207 T364299 [07:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:14] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:29:25] (03PS1) 10Muehlenhoff: Failover URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/1039166 [07:30:04] (03PS1) 10Marostegui: db1186: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039167 (https://phabricator.wikimedia.org/T366556) [07:30:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P64060 and previous config saved to /var/cache/conftool/dbconfig/20240605-073024-root.json [07:30:36] (03CR) 10Marostegui: [C:03+2] db1186: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039167 (https://phabricator.wikimedia.org/T366556) (owner: 10Marostegui) [07:30:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1004.wikimedia.org [07:30:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1186.eqiad.wmnet with reason: Reimage [07:31:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2004.wikimedia.org [07:31:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1186.eqiad.wmnet with reason: Reimage [07:31:22] (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/1039166 (owner: 10Muehlenhoff) [07:35:09] (03PS3) 10ClĂ©ment Goubert: miscweb: Use a random miscweb image for default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) [07:35:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1004.wikimedia.org [07:35:16] (03CR) 10ClĂ©ment Goubert: miscweb: Use a random miscweb image for default value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10ClĂ©ment Goubert) [07:35:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2004.wikimedia.org [07:35:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar) [07:36:47] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the preparation. Let me know when this should be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar) [07:37:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bookworm [07:37:31] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db1186.eqiad.wmnet with OS bookworm [07:37:32] (03CR) 10Muehlenhoff: [C:03+2] gerrit: remove mac algos no more supported by Mina SSHD [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar) [07:38:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bookworm [07:38:19] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db1186.eqiad.wmnet with OS bookworm [07:38:43] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1021.eqiad.wmnet [07:38:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bookworm [07:40:18] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:40:22] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:45:07] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1021.eqiad.wmnet [07:47:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T364299)', diff saved to https://phabricator.wikimedia.org/P64061 and previous config saved to /var/cache/conftool/dbconfig/20240605-074739-marostegui.json [07:47:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:50:00] (03PS3) 10Ayounsi: Netbox deploy for 4.0.2 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) [07:50:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mirror1001.wikimedia.org [07:50:32] (03CR) 10Ayounsi: Netbox deploy for 4.0.2 (032 comments) [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:52:12] (03CR) 10ClĂ©ment Goubert: "That would be great to add, but possibly would be more at home in the reboot function itself so it could be reused by all cookbooks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10ClĂ©ment Goubert) [07:53:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage [07:53:32] (03PS4) 10Ayounsi: Netbox deploy for 4.0.3 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) [07:53:33] (03CR) 10Jelto: [C:03+1] "lgtm now, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10ClĂ©ment Goubert) [07:54:11] (03PS1) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [07:54:31] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1025.eqiad.wmnet [07:54:43] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10ClĂ©ment Goubert) [07:55:09] (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [07:56:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage [07:57:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mirror1001.wikimedia.org [07:58:14] (03CR) 10ClĂ©ment Goubert: sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10ClĂ©ment Goubert) [07:59:26] (03PS8) 10ClĂ©ment Goubert: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 [07:59:40] (03CR) 10Volans: sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10ClĂ©ment Goubert) [08:00:08] !log cgoubert@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [08:00:43] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1025.eqiad.wmnet [08:01:11] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1026.eqiad.wmnet [08:02:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P64062 and previous config saved to /var/cache/conftool/dbconfig/20240605-080247-marostegui.json [08:04:08] (03CR) 10ClĂ©ment Goubert: sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10ClĂ©ment Goubert) [08:05:59] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete thanos-query.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff) [08:07:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS [08:07:15] 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:07:57] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1026.eqiad.wmnet [08:08:18] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1027.eqiad.wmnet [08:08:35] (03PS1) 10ClĂ©ment Goubert: Revert "mw1358: Put back insetup::serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/1038834 [08:09:13] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:11:35] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [08:14:30] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1027.eqiad.wmnet [08:17:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P64063 and previous config saved to /var/cache/conftool/dbconfig/20240605-081755-marostegui.json [08:18:27] (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1038771 (https://phabricator.wikimedia.org/T365503) (owner: 10Brouberol) [08:18:50] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [08:19:17] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6 [08:19:17] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:19:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1186.eqiad.wmnet with OS bookworm [08:21:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9862589 (10Clement_Goubert) Thanks! [08:21:17] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:49] (03PS11) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [08:23:14] (03CR) 10Volans: "This could be a nice addition to the parent class in `sre/__init__.py`. Spicerack has already an `uptime()` method and we could collect al" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10ClĂ©ment Goubert) [08:23:44] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:52] (03CR) 10Hnowlan: [C:03+1] Revert "mw1358: Put back insetup::serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/1038834 (owner: 10ClĂ©ment Goubert) [08:24:05] (03CR) 10ClĂ©ment Goubert: [C:03+2] Revert "mw1358: Put back insetup::serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/1038834 (owner: 10ClĂ©ment Goubert) [08:24:20] (03CR) 10Brouberol: [V:03+1 C:03+2] analytics_test_cluster_coordinator: upgrade mariadb to version 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1038771 (https://phabricator.wikimedia.org/T365503) (owner: 10Brouberol) [08:27:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:27:25] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:27:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1358 to wikikube-worker1001 [08:27:47] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [08:29:59] (03CR) 10Hashar: [C:03+2] plugins: Add wm-schedule-deployment plugin (034 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [08:30:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [08:30:33] (03Merged) 10jenkins-bot: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [08:30:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64064 and previous config saved to /var/cache/conftool/dbconfig/20240605-083041-root.json [08:30:48] !log hashar@deploy1002 Started deploy [gerrit/gerrit@b91b3bd]: Use a wildcard TypeScript include for plugins [08:30:56] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b91b3bd]: Use a wildcard TypeScript include for plugins (duration: 00m 08s) [08:31:17] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1028.eqiad.wmnet [08:31:21] !log hashar@deploy1002 Started deploy [gerrit/gerrit@7ea913b]: plugins: Add wm-schedule-deployment plugin - T366512 [08:31:29] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@7ea913b]: plugins: Add wm-schedule-deployment plugin - T366512 (duration: 00m 07s) [08:31:42] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1358 to wikikube-worker1001 - cgoubert@cumin1002" [08:33:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T364299)', diff saved to https://phabricator.wikimedia.org/P64065 and previous config saved to /var/cache/conftool/dbconfig/20240605-083304-marostegui.json [08:33:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:33:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1358 to wikikube-worker1001 - cgoubert@cumin1002" [08:33:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:33:18] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1001 [08:33:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:33:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T364299)', diff saved to https://phabricator.wikimedia.org/P64066 and previous config saved to /var/cache/conftool/dbconfig/20240605-083328-marostegui.json [08:33:42] (03PS1) 10Marostegui: Revert "db1186: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038835 [08:34:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [08:34:23] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:34:25] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:34:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1001 [08:34:31] (03PS1) 10Muehlenhoff: Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1039173 (https://phabricator.wikimedia.org/T360414) [08:34:38] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1358 to wikikube-worker1001 [08:35:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9862621 (10darthmon_wmde) hereby I, as "direct supervisor" of Ricki's, aprove for Ricki to get access to analytics-privatedata-users. Since this is cru... [08:35:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [08:37:05] (03CR) 10Marostegui: [C:03+2] Revert "db1186: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038835 (owner: 10Marostegui) [08:37:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64067 and previous config saved to /var/cache/conftool/dbconfig/20240605-083733-root.json [08:37:52] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1028.eqiad.wmnet [08:38:44] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:38:51] (03PS1) 10Marostegui: db1186: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039176 [08:39:14] (03CR) 10Marostegui: [C:03+2] db1186: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039176 (owner: 10Marostegui) [08:39:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [08:41:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1227.eqiad.wmnet with reason: Reimage [08:41:57] (03PS1) 10Marostegui: db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039178 (https://phabricator.wikimedia.org/T362745) [08:42:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1227.eqiad.wmnet with reason: Reimage [08:42:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1227', diff saved to https://phabricator.wikimedia.org/P64068 and previous config saved to /var/cache/conftool/dbconfig/20240605-084211-root.json [08:42:52] (03PS2) 10Marostegui: db1227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039178 (https://phabricator.wikimedia.org/T362745) [08:43:21] (03CR) 10Marostegui: [C:03+2] db1227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039178 (https://phabricator.wikimedia.org/T362745) (owner: 10Marostegui) [08:43:41] jouncebot: nowandnext [08:43:41] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [08:43:41] In 1 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1000) [08:43:44] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:43:55] (03PS5) 10Effie Mouzeli: ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [08:44:25] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-co [08:44:25] 602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:25] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw [08:44:25] /IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1227.eqiad.wmnet with OS bookworm [08:44:56] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4044.ulsfo.wmnet [08:45:17] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet [08:45:46] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1029.eqiad.wmnet [08:45:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64069 and previous config saved to /var/cache/conftool/dbconfig/20240605-084547-root.json [08:45:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet [08:45:59] Dreamy_Jazz: Deployments may fail as I'm rebooting the whole k8s cluster in codfw, which means most of the nodes are cordoned off [08:46:20] Thanks for the heads up. [08:46:29] Any thoughts on when this might be complete? [08:47:44] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet [08:47:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet [08:48:27] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:27] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:48:32] (03CR) 10Effie Mouzeli: [C:03+2] ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [08:49:19] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9862669 (10cmooney) [08:49:28] Dreamy_Jazz: I fear it's going to take most of the day, although we may be able to run deployments once we cross a certain threshold of rebooted nodes [08:49:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet [08:50:01] (03Merged) 10jenkins-bot: ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [08:50:23] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4044.ulsfo.wmnet [08:50:36] !log fabfur@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cp4044.ulsfo.wmnet [08:51:03] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4044.ulsfo.wmnet [08:51:25] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [08:51:32] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4052.ulsfo.wmnet [08:52:10] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1029.eqiad.wmnet [08:52:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64070 and previous config saved to /var/cache/conftool/dbconfig/20240605-085239-root.json [08:52:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:53:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org [08:53:36] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/ipoid: apply [08:54:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet [08:54:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet [08:54:33] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [08:55:45] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:57:22] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet [08:57:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet [08:57:27] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, [08:57:27] IPv4: Active - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:57:29] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codf [08:57:29] 2/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:57:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org [08:58:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage [08:58:31] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [08:58:56] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:00:22] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4044.ulsfo.wmnet [09:00:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64071 and previous config saved to /var/cache/conftool/dbconfig/20240605-090053-root.json [09:01:05] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4052.ulsfo.wmnet [09:01:30] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1001.eqiad.wmnet on all recursors [09:01:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1001.eqiad.wmnet on all recursors [09:02:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage [09:02:20] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1001.eqiad.wmnet with OS bullseye [09:02:30] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:03:26] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:03:44] RESOLVED: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:04:22] (03CR) 10Kamila SoučkovĂĄ: [C:03+1] mw-web, mw-api-ext: Raise replicas for 90% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038732 (https://phabricator.wikimedia.org/T362323) (owner: 10ClĂ©ment Goubert) [09:06:10] !log brouberol@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [09:06:32] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [09:06:50] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet [09:07:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64072 and previous config saved to /var/cache/conftool/dbconfig/20240605-090745-root.json [09:09:28] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6 [09:09:28] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:32] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [09:09:32] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:27] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet [09:11:28] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:32] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:48] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet [09:12:38] (03PS1) 10Marostegui: Revert "db1227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038836 [09:13:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:15:38] !log brouberol@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [09:16:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64073 and previous config saved to /var/cache/conftool/dbconfig/20240605-091559-root.json [09:17:04] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1001.eqiad.wmnet with reason: host reimage [09:18:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:18:33] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet [09:19:32] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw [09:19:32] /IPv6: Connect - kubernetes-ml-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:19:34] (03PS3) 10Hnowlan: kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) [09:19:34] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64602/IPv6: Connect - kubernetes-codfw [09:19:34] /IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:19:48] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet [09:19:50] PROBLEM - Host ms-be2053 is DOWN: PING CRITICAL - Packet loss = 100% [09:20:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1001.eqiad.wmnet with reason: host reimage [09:20:18] RECOVERY - Host ms-be2053 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [09:21:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:52] (03CR) 10Hnowlan: [C:03+2] kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [09:22:22] (03CR) 10Marostegui: [C:03+2] Revert "db1227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038836 (owner: 10Marostegui) [09:22:42] hnowlan: good to merge? [09:22:50] marostegui: yep, please do [09:22:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64074 and previous config saved to /var/cache/conftool/dbconfig/20240605-092251-root.json [09:23:04] hnowlan: merging! [09:23:16] thanks! [09:23:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64075 and previous config saved to /var/cache/conftool/dbconfig/20240605-092324-root.json [09:23:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1227.eqiad.wmnet with OS bookworm [09:23:35] (03CR) 10Hashar: [C:03+2] plugins: Add wm-schedule-deployment plugin (031 comment) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [09:24:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet [09:24:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:24:32] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:36] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet [09:24:55] (03PS1) 10Marostegui: db1227: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039182 [09:25:34] (03CR) 10Marostegui: [C:03+2] db1227: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039182 (owner: 10Marostegui) [09:26:02] (03Abandoned) 10Stevemunene: Change datahub service to use dse ingress [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [09:26:03] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet [09:26:08] (03PS4) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [09:26:08] (03PS16) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [09:26:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:35] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1030.eqiad.wmnet [09:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:27:09] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [09:29:16] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1400 to wikikube-worker1008.eqiad.wmnet [09:29:30] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw1400 to wikikube-worker1008.eqiad.wmnet [09:30:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1400 to wikikube-worker1008 [09:30:11] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [09:31:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64076 and previous config saved to /var/cache/conftool/dbconfig/20240605-093105-root.json [09:31:15] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1401 to wikikube-worker1009 [09:31:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:27] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1030.eqiad.wmnet [09:31:40] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1410 to wikikube-worker1010.eqiad.wmnet [09:31:44] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw1410 to wikikube-worker1010.eqiad.wmnet [09:31:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:32:19] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [labs/private] - 10https://gerrit.wikimedia.org/r/1039173 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff) [09:32:36] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A [09:32:36] v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:32:40] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [09:32:40] v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:33:07] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1400 to wikikube-worker1008 - hnowlan@cumin1002" [09:33:19] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1410 to wikikube-worker1010 [09:33:22] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1039173 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff) [09:34:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2003.wikimedia.org [09:34:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on 6 hosts with reason: Reimage x2 eqiad master T366677 [09:34:24] T366677: Reimage x2 eqiad master - https://phabricator.wikimedia.org/T366677 [09:34:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:34:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: Reimage x2 eqiad master T366677 [09:34:38] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:38] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:35:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1151 to temp x2 eqiad master T366677', diff saved to https://phabricator.wikimedia.org/P64077 and previous config saved to /var/cache/conftool/dbconfig/20240605-093507-root.json [09:35:16] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [09:35:22] (03PS5) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [09:35:22] (03PS17) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [09:35:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1038792 (https://phabricator.wikimedia.org/T366678) [09:35:48] (03PS1) 10Marostegui: db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039183 (https://phabricator.wikimedia.org/T366677) [09:36:14] (03CR) 10Marostegui: [C:03+2] db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039183 (https://phabricator.wikimedia.org/T366677) (owner: 10Marostegui) [09:36:16] claime: sre.dns.netbox is asking me to set wikikube-worker1001 to failed - is that okay> [09:36:23] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [09:36:29] hnowlan: huh what [09:36:31] it's not failed [09:36:41] wait no [09:36:41] sorry [09:36:51] it's setting `profile::netbox::host::status: active` [09:36:57] ah yeah that's fine [09:37:08] I'd run the cookbook after changing the status, weird [09:37:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1152.eqiad.wmnet with OS bookworm [09:37:56] (03PS1) 10Muehlenhoff: Remove obsolete LDAP stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1039184 [09:37:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64078 and previous config saved to /var/cache/conftool/dbconfig/20240605-093757-root.json [09:38:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64079 and previous config saved to /var/cache/conftool/dbconfig/20240605-093830-root.json [09:38:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2003.wikimedia.org [09:38:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1400 to wikikube-worker1008 - hnowlan@cumin1002" [09:38:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:38:40] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1008 [09:38:46] (03CR) 10Alexandros Kosiaris: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:38:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet [09:39:40] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:39:42] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:39:45] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1401 to wikikube-worker1009 - hnowlan@cumin1002" [09:40:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1008 [09:40:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1400 to wikikube-worker1008 [09:41:00] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [09:41:15] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1401 to wikikube-worker1009 - hnowlan@cumin1002" [09:41:15] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:41:16] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1009 [09:41:41] (03CR) 10Muehlenhoff: [C:03+2] Switch maps/eqiad to PKI as well [puppet] - 10https://gerrit.wikimedia.org/r/1038815 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [09:42:15] (03PS1) 10Marostegui: wmnet: Add CNAMEs for es6 and es7 [dns] - 10https://gerrit.wikimedia.org/r/1039185 (https://phabricator.wikimedia.org/T365098) [09:42:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:26] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1010 [09:43:16] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1009 [09:43:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1401 to wikikube-worker1009 [09:43:34] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1001.eqiad.wmnet with OS bullseye [09:43:40] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:43:44] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:44:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1428 to wikikube-worker1011 [09:44:08] !log homer 'cr*eqiad*' commit 'T351074' [09:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:10] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [09:44:22] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [09:44:45] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1010 [09:44:51] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1456 to wikikube-worker1012 [09:44:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1410 to wikikube-worker1010 [09:45:23] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw1456 to wikikube-worker1012 [09:45:38] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:45:47] (03PS12) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [09:46:01] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw1428 to wikikube-worker1011 [09:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64080 and previous config saved to /var/cache/conftool/dbconfig/20240605-094611-root.json [09:46:42] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1428 to wikikube-worker1011 [09:46:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6 [09:46:46] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:46:47] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [09:47:12] (03CR) 10Ladsgroup: [C:03+1] wmnet: Add CNAMEs for es6 and es7 [dns] - 10https://gerrit.wikimedia.org/r/1039185 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui) [09:47:41] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9862852 (10akosiaris) [09:47:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [09:47:44] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:51] (03PS1) 10Muehlenhoff: profile::maps::tlsproxy: Unconditionally use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778) [09:49:13] (03Abandoned) 10Ladsgroup: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1038792 (https://phabricator.wikimedia.org/T366678) (owner: 10Gerrit maintenance bot) [09:49:15] (03PS10) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:49:30] (03Abandoned) 10Ladsgroup: mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/941917 (https://phabricator.wikimedia.org/T342947) (owner: 10Gerrit maintenance bot) [09:49:36] (03Abandoned) 10Ladsgroup: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942787 (https://phabricator.wikimedia.org/T343078) (owner: 10Gerrit maintenance bot) [09:49:39] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1428 to wikikube-worker1011 - hnowlan@cumin1002" [09:49:42] (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/959966 (https://phabricator.wikimedia.org/T347140) (owner: 10Gerrit maintenance bot) [09:49:44] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:49:46] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:49:47] (03Abandoned) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/997489 (https://phabricator.wikimedia.org/T356650) (owner: 10Gerrit maintenance bot) [09:49:57] (03Abandoned) 10Ladsgroup: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/997488 (https://phabricator.wikimedia.org/T356650) (owner: 10Gerrit maintenance bot) [09:50:14] (03Abandoned) 10Ladsgroup: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1016377 (https://phabricator.wikimedia.org/T361780) (owner: 10Gerrit maintenance bot) [09:50:45] (03CR) 10Marostegui: [C:03+2] wmnet: Add CNAMEs for es6 and es7 [dns] - 10https://gerrit.wikimedia.org/r/1039185 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui) [09:50:54] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1428 to wikikube-worker1011 - hnowlan@cumin1002" [09:50:54] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:50:54] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1011 [09:51:18] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1456 to wikikube-worker1012 [09:51:24] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [09:51:32] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1031.eqiad.wmnet [09:51:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage [09:51:44] (03PS1) 10Muehlenhoff: Remove tendril stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1039189 [09:51:48] (03PS13) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [09:52:06] (03Abandoned) 10Ladsgroup: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025917 (https://phabricator.wikimedia.org/T364067) (owner: 10Gerrit maintenance bot) [09:52:16] (03Abandoned) 10Ladsgroup: mariadb: Promote db1192 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1028939 (https://phabricator.wikimedia.org/T364541) (owner: 10Gerrit maintenance bot) [09:52:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1011 [09:52:27] (03Abandoned) 10Ladsgroup: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1024756 (https://phabricator.wikimedia.org/T363689) (owner: 10Gerrit maintenance bot) [09:52:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1428 to wikikube-worker1011 [09:52:35] (03Abandoned) 10Ladsgroup: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1036601 (https://phabricator.wikimedia.org/T366241) (owner: 10Gerrit maintenance bot) [09:52:58] (03Abandoned) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024757 (https://phabricator.wikimedia.org/T363689) (owner: 10Gerrit maintenance bot) [09:53:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64081 and previous config saved to /var/cache/conftool/dbconfig/20240605-095303-root.json [09:53:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64082 and previous config saved to /var/cache/conftool/dbconfig/20240605-095336-root.json [09:53:39] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1456 to wikikube-worker1012 - hnowlan@cumin1002" [09:53:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:53:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:54:21] !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox [09:54:25] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [09:54:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [09:54:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage [09:54:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1456 to wikikube-worker1012 - hnowlan@cumin1002" [09:54:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:53] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1012 [09:55:02] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1008.eqiad.wmnet wikikube-worker1009.eqiad.wmnet wikikube-worker1010.eqiad.wmnet wikikube-worker1011.eqiad.wmnet wikikube-worker1012.eqiad.wmnet on all recursors [09:55:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1008.eqiad.wmnet wikikube-worker1009.eqiad.wmnet wikikube-worker1010.eqiad.wmnet wikikube-worker1011.eqiad.wmnet wikikube-worker1012.eqiad.wmnet on all recursors [09:55:28] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1008.eqiad.wmnet with OS bullseye [09:55:42] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1009.eqiad.wmnet with OS bullseye [09:55:46] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:55:48] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:55:56] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1010.eqiad.wmnet with OS bullseye [09:56:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:41] (03PS1) 10Marostegui: Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038838 [09:56:50] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:57:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1012 [09:57:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1456 to wikikube-worker1012 [09:58:29] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680 (10MoritzMuehlenhoff) 03NEW [09:58:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9862914 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:58:41] (03PS11) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:58:51] !log pooling and uncordoning wikikube-worker1001 - T351074 [09:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:54] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [09:59:01] !log cgoubert@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker1001.eqiad.wmnet,cluster=kubernetes,service=kubesvc [09:59:04] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors [09:59:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors [09:59:29] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1038793 (https://phabricator.wikimedia.org/T366682) [09:59:33] (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038794 (https://phabricator.wikimedia.org/T366682) [09:59:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:00:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1000) [10:00:08] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1011.eqiad.wmnet with OS bullseye [10:00:09] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet [10:00:15] (03CR) 10CI reject: [V:04-1] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038794 (https://phabricator.wikimedia.org/T366682) (owner: 10Gerrit maintenance bot) [10:00:16] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1012.eqiad.wmnet with OS bullseye [10:00:20] !log disabling puppet on cp4037 to test Benthos performances (T358109) [10:00:27] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:31] T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109 [10:00:45] FIRING: [2x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64083 and previous config saved to /var/cache/conftool/dbconfig/20240605-100117-root.json [10:01:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A [10:01:48] v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:01:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [10:01:50] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:03:21] (03CR) 10Klausman: base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [10:03:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:03:48] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:03:49] (03CR) 10Alexandros Kosiaris: [C:04-1] "I 'd follow the same approach for puppetserver AND puppetmaster manifests. In this patch I am commenting on, the approach differs." [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:03:50] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:06:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:33] (03Abandoned) 10Ladsgroup: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038794 (https://phabricator.wikimedia.org/T366682) (owner: 10Gerrit maintenance bot) [10:06:52] (03Abandoned) 10Ladsgroup: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1038793 (https://phabricator.wikimedia.org/T366682) (owner: 10Gerrit maintenance bot) [10:07:02] PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100% [10:07:52] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:52] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:58] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64084 and previous config saved to /var/cache/conftool/dbconfig/20240605-100810-root.json [10:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64085 and previous config saved to /var/cache/conftool/dbconfig/20240605-100842-root.json [10:08:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:09:08] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1008.eqiad.wmnet with reason: host reimage [10:09:43] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1010.eqiad.wmnet with reason: host reimage [10:10:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1152 back to x2 eqiad master T366677', diff saved to https://phabricator.wikimedia.org/P64086 and previous config saved to /var/cache/conftool/dbconfig/20240605-101019-root.json [10:10:23] T366677: Reimage x2 eqiad master - https://phabricator.wikimedia.org/T366677 [10:11:45] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1008.eqiad.wmnet with reason: host reimage [10:11:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:56] (03CR) 10Marostegui: [C:03+2] Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038838 (owner: 10Marostegui) [10:11:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:12:05] RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [10:13:17] (03PS12) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:13:34] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1012.eqiad.wmnet with reason: host reimage [10:13:36] !log dcaro@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudcephosd1031.eqiad.wmnet [10:13:45] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:59] (03CR) 10Effie Mouzeli: "done" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:14:02] (03PS1) 10Muehlenhoff: Remove kartotherian.discovery.wmnet.crt cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778) [10:14:32] (03CR) 10Effie Mouzeli: [C:03+1] Remove kartotherian.discovery.wmnet.crt cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:14:33] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: June 2024 Bullseye database backups reboots - https://phabricator.wikimedia.org/T366684 (10Marostegui) 03NEW [10:15:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [10:15:02] (03PS1) 10Muehlenhoff: Remove kartotherian stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1039191 (https://phabricator.wikimedia.org/T360778) [10:15:06] (03CR) 10Klausman: base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [10:15:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1010.eqiad.wmnet with reason: host reimage [10:15:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [10:15:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64087 and previous config saved to /var/cache/conftool/dbconfig/20240605-101521-ladsgroup.json [10:15:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:15:39] (03PS4) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 [10:16:11] (03CR) 10Btullis: [V:03+1 C:03+2] Prepare stat100[4-7] for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1038329 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [10:16:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1152.eqiad.wmnet with OS bookworm [10:16:43] (03CR) 10Alexandros Kosiaris: [C:04-1] "Functionaly ok, 2 nitpicks and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:17:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P64088 and previous config saved to /var/cache/conftool/dbconfig/20240605-101744-ladsgroup.json [10:18:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1012.eqiad.wmnet with reason: host reimage [10:19:32] (03CR) 10CI reject: [V:04-1] base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [10:20:49] FIRING: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet [10:21:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet [10:21:53] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:21:53] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet [10:22:38] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet [10:22:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [10:22:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [10:22:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64090 and previous config saved to /var/cache/conftool/dbconfig/20240605-102252-ladsgroup.json [10:22:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:23:00] (03CR) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:23:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64091 and previous config saved to /var/cache/conftool/dbconfig/20240605-102348-root.json [10:23:54] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863049 (10cmooney) @Jclark-ctr @VRiley-WMF unfortunately these switch upgrades require us to shift some cables around before/after the upgrade to avoid disrupting services.... [10:24:55] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A [10:24:55] v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:24:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [10:24:55] v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:47] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:26:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox [10:28:34] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863056 (10VRiley-WMF) @cmooney as it turns out, I will be out until June 10th. [10:29:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:29:23] (03PS2) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [10:30:16] (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [10:30:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet [10:30:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1008.eqiad.wmnet with OS bullseye [10:31:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet [10:31:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:31:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:32:04] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1038795 (https://phabricator.wikimedia.org/T366687) [10:32:07] (03CR) 10Muehlenhoff: [C:03+2] Remove kartotherian.discovery.wmnet.crt cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:32:08] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038796 (https://phabricator.wikimedia.org/T366687) [10:32:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [10:32:14] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet [10:32:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64093 and previous config saved to /var/cache/conftool/dbconfig/20240605-103251-ladsgroup.json [10:33:06] (03CR) 10Giuseppe Lavagetto: [C:03+1] sextant cache: Allow defining mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 (owner: 10Alexandros Kosiaris) [10:33:07] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove kartotherian stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1039191 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:34:02] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1010.eqiad.wmnet with OS bullseye [10:34:58] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:34:58] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:35:45] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [10:36:52] (03PS3) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [10:37:15] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1012.eqiad.wmnet with OS bullseye [10:37:26] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet [10:37:33] (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [10:38:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64094 and previous config saved to /var/cache/conftool/dbconfig/20240605-103854-root.json [10:39:00] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6 [10:39:00] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6 [10:39:00] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:39:25] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:47] (03PS4) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [10:39:51] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet [10:40:32] (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [10:40:49] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [10:41:06] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:41:47] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863128 (10cmooney) >>! In T366361#9863056, @VRiley-WMF wrote: > @cmooney as it turns out, I will be out until June 10th. No probs, enjoy the time off. I'll see if maybe J... [10:41:58] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:42:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [10:44:25] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet [10:46:49] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet [10:47:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64096 and previous config saved to /var/cache/conftool/dbconfig/20240605-104757-ladsgroup.json [10:49:25] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [10:50:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [10:51:00] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A [10:51:00] v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:51:02] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [10:51:02] 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:52:03] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [10:52:12] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1009.eqiad.wmnet with OS bullseye [10:53:00] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:00] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:04] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1009.eqiad.wmnet with OS bullseye [10:53:38] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9863185 (10MoritzMuehlenhoff) [10:53:41] !log hnowlan@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1011.eqiad.wmnet with OS bullseye [10:53:54] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1011.eqiad.wmnet with OS bullseye [10:54:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64097 and previous config saved to /var/cache/conftool/dbconfig/20240605-105400-root.json [10:54:25] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:43] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet [10:55:02] PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:29] (03CR) 10Elukey: "Should we wait for the new docker image with the heavy-rev-id logic?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [10:56:29] 06SRE, 10Maps, 06serviceops, 13Patch-For-Review: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9863179 (10MoritzMuehlenhoff) 05Open→03Resolved a:05jijiki→03MoritzMuehlenhoff maps is now using cfssl. [10:57:30] RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [10:57:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:25] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [11:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1100). [11:03:01] (03PS5) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [11:03:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64098 and previous config saved to /var/cache/conftool/dbconfig/20240605-110303-ladsgroup.json [11:03:06] !log restarted send_tile_invalidations.service on maps1009 [11:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:52] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS bullseye [11:04:03] PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100% [11:04:25] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:39] RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:04:54] (03PS1) 10Dreamy Jazz: Follow-up: Don't run interact with block buttons if they don't exist [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493) [11:05:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [11:05:05] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:06:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [11:06:05] 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:06:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [11:06:37] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1009.eqiad.wmnet with reason: host reimage [11:06:50] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1011.eqiad.wmnet with reason: host reimage [11:07:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:05] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:42] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1009.eqiad.wmnet with reason: host reimage [11:09:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:11:17] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863237 (10Volans) I'd like to know if there is a wider agreement on this before implementing it. It seems reasonable to me but it will affe... [11:12:41] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863251 (10MoritzMuehlenhoff) Sure thing, but there's also no real impact, anyone who continues to pass the --alias for these kind of cookbo... [11:12:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1011.eqiad.wmnet with reason: host reimage [11:15:08] (03PS1) 10Urbanecm: [beta] Create frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039194 (https://phabricator.wikimedia.org/T366691) [11:16:40] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863260 (10Volans) To add a dumb change that makes aliases and query optional and then checks for them later is easy. But at this point it w... [11:17:54] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage [11:18:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [11:18:07] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:18:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [11:18:09] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:49] (03CR) 10Urbanecm: [C:03+2] [beta] Create frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039194 (https://phabricator.wikimedia.org/T366691) (owner: 10Urbanecm) [11:20:05] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:20:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:20:26] (03Merged) 10jenkins-bot: [beta] Create frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039194 (https://phabricator.wikimedia.org/T366691) (owner: 10Urbanecm) [11:21:09] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage [11:23:05] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9863270 (10Ladsgroup) >>! In T365915#9862244, @Bodhisattwa wrote: > Seeing the ESEAP mailing list, I think, it would be OK, if we get the name as wikimoitree@lists.wikimedia.org It is no... [11:23:07] PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100% [11:24:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:37] RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:25:40] (03CR) 10ClĂ©ment Goubert: Add new chart statsd-exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [11:27:10] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1009.eqiad.wmnet with OS bullseye [11:29:39] (03PS6) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [11:29:39] (03PS18) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [11:30:04] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet [11:30:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [11:31:37] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1011.eqiad.wmnet with OS bullseye [11:31:48] !log running homer to configure bgp on 5 new k8s workers [11:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:09] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:32:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:32:19] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:09] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:15] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:21] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:34:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:35] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [11:37:27] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1031.eqiad.wmnet with OS bullseye [11:38:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [11:38:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet [11:38:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet [11:38:44] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:13] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1008.eqiad.wmnet|wikikube-worker1009.eqiad.wmnet|wikikube-worker1010.eqiad.wmnet|wikikube-worker1011.eqiad.wmnet|wikikube-worker1012.eqiad.wmnet),cluster=kubernetes,service=kubesvc [11:39:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet [11:39:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:41:06] (03CR) 10Urbanecm: [C:04-1] "function-wise, lgtm, but i think we either want to remove both of the notes, or none of them, and this patch removes just one. -1 for visi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [11:41:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [11:41:28] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet [11:41:41] (03PS2) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) [11:41:48] (03PS6) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) [11:43:20] (03PS1) 10Hnowlan: mw-web, mw-api-ext: Raise replicas for 95% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039196 (https://phabricator.wikimedia.org/T362323) [11:44:05] (03PS1) 10Effie Mouzeli: mc.php: if $_SERVER['MCROUTER_SERVER'] is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) [11:44:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [11:44:41] (03CR) 10CI reject: [V:04-1] mc.php: if $_SERVER['MCROUTER_SERVER'] is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli) [11:45:04] (03PS2) 10Effie Mouzeli: mc.php: if $MCROUTER_SERVER is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) [11:45:41] (03CR) 10CI reject: [V:04-1] mc.php: if $MCROUTER_SERVER is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli) [11:45:45] (03CR) 10Ilias Sarantopoulos: "yes! I created this before the new patch, so I'll w8 to update the image as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [11:46:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [11:46:11] g%23BGP_status [11:46:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [11:46:23] g%23BGP_status [11:47:21] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:48:21] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:48:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [11:49:06] 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695 (10MoritzMuehlenhoff) 03NEW [11:49:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:49:15] 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9863378 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:49:21] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:49:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [11:50:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet [11:52:04] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet [11:52:48] (03PS1) 10Muehlenhoff: Add new ping servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1039199 (https://phabricator.wikimedia.org/T366695) [11:53:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:54:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:49] (03CR) 10Muehlenhoff: [C:03+2] Add new ping servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1039199 (https://phabricator.wikimedia.org/T366695) (owner: 10Muehlenhoff) [11:57:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [11:58:02] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [11:58:05] (03PS5) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 [11:58:17] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A [11:58:17] v6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:58:25] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [11:58:25] v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:15] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:20] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet [12:00:25] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:31] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [12:00:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:03] (03PS1) 10Muehlenhoff: Switch statistics::explorer to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1039200 (https://phabricator.wikimedia.org/T349619) [12:03:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [12:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:04:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [12:04:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [12:05:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ping2004.codfw.wmnet [12:05:33] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:05:48] FIRING: PuppetDisabled: Puppet disabled on mc2049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [12:05:56] (03CR) 10Muehlenhoff: [C:03+2] Switch statistics::explorer to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1039200 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:08:28] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:08:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [12:28:55] Ok I'll stop the cookbook and uncordon for the backport window and restart afterwards [12:29:05] Okay. Thanks. [12:29:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [12:29:25] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:29:35] It's a little messy as I need to uncordon the nodes manually [12:29:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet [12:29:45] I'll stop it once that batch of 5 is done [12:31:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T364299)', diff saved to https://phabricator.wikimedia.org/P64099 and previous config saved to /var/cache/conftool/dbconfig/20240605-123059-marostegui.json [12:31:05] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:31:37] There is also https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630 which has been scheduled and I think also needs use of `scap backport` [12:31:58] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet [12:32:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [12:32:48] I hope I'll be done by 1630 UTC [12:33:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ping2004.codfw.wmnet with reason: host reimage [12:33:05] 👍 [12:33:17] (even with stopping for the backport window) [12:33:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [12:33:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [12:33:47] If I'm not, I'll do the same and stop it then restart it later, I don't like letting cookbooks like this one run after the end of my day anyways [12:33:57] (03CR) 10Fabfur: [C:03+2] hiera: enable IPIP for high-traffic1@magru for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [12:34:20] (What I really need to do is fix the rollback for that cookbook to uncordon the nodes so I don't have to do it manually) [12:34:25] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:34:40] (03PS1) 10Muehlenhoff: Configure memcached on idp-test hosts to run as 'memcache' [puppet] - 10https://gerrit.wikimedia.org/r/1039206 (https://phabricator.wikimedia.org/T273950) [12:35:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6 [12:35:33] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:35:53] !log disabling puppet on A:cp-text to test IPIP encapsulation on magru (T366466) [12:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:56] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [12:36:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ping2004.codfw.wmnet with reason: host reimage [12:36:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [12:36:30] 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:37:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:37:34] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:38:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [12:38:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [12:39:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet [12:39:38] (03CR) 10Fabfur: [C:03+2] cache:hiera: enable IPIP on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [12:39:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:48] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet [12:40:45] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:wikikube-worker-codfw [12:40:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:42:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039206 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff) [12:43:31] !log failover ganeti masters in drmrs [12:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:55] (03PS14) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [12:44:07] (03CR) 10Volans: base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [12:45:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [12:45:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [12:45:24] Dreamy_Jazz: ok, cookbook stopped and nodes uncordoned you should be g2g [12:45:39] Thanks. [12:45:50] PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:45:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet [12:45:56] I will start my patches earlier so that you can start up again quicker. [12:45:56] (03CR) 10ClĂ©ment Goubert: [C:03+1] sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey) [12:46:02] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet [12:46:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P64100 and previous config saved to /var/cache/conftool/dbconfig/20240605-124607-marostegui.json [12:46:45] (03CR) 10Elukey: [C:03+2] sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey) [12:46:54] Dreamy_Jazz: <3 [12:47:11] ping me if you run into any issues, it should be ok though [12:47:16] PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:48:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493) (owner: 10Dreamy Jazz) [12:48:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [12:48:28] (03PS9) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [12:48:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493) (owner: 10Dreamy Jazz) [12:48:37] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [12:49:12] (03Merged) 10jenkins-bot: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [12:49:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db1246 T363119', diff saved to https://phabricator.wikimedia.org/P64101 and previous config saved to /var/cache/conftool/dbconfig/20240605-124918-arnaudb.json [12:49:22] T363119: db1246 crashed - https://phabricator.wikimedia.org/T363119 [12:49:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: maintenance [12:49:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: maintenance [12:50:37] (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey) [12:51:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet [12:51:32] (03PS1) 10Phedenskog: wmftest: Add new Graphite instance for performance test data. [dns] - 10https://gerrit.wikimedia.org/r/1039207 (https://phabricator.wikimedia.org/T366669) [12:51:47] (03PS15) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [12:52:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ping2004.codfw.wmnet with OS bookworm [12:52:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping2004.codfw.wmnet [12:52:48] 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9863550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping2004.codfw.wmnet with OS bookworm completed: - ping2004 (**PASS**) - Removed from Puppe... [12:53:56] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet [12:53:57] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1013386|[CheckUser] Stop writing old for event table migration on testwiki (T360686)]] [12:54:00] T360686: Stop writing old on testwiki - https://phabricator.wikimedia.org/T360686 [12:54:12] Proceeding with my config change first as the wmf.8 backport is likely to take a while in gate-and-submit-wmf [12:54:25] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:54:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet [12:55:24] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet [12:56:24] !log elukey@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker [12:59:25] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1300). [13:00:04] Dreamy_Jazz and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] \o [13:00:20] I am currently deploying my config change [13:00:31] My other change is in gate-and-submit-wmf [13:01:09] (03PS1) 10Cwhite: Revert "multiversion: Add tests for MWMultiVersion::getMediaWiki()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038840 [13:01:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P64102 and previous config saved to /var/cache/conftool/dbconfig/20240605-130115-marostegui.json [13:01:26] Got a warning that `check_testservers_baremetal` exceeded the 120s timeout. [13:01:40] It is asking me to retry or continue [13:01:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9863599 (10Jclark-ctr) replaced broken cable server went 2 weeks with out fault returning [13:02:01] o/ [13:02:31] Going to continue as I think it should be fine [13:02:33] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1013386|[CheckUser] Stop writing old for event table migration on testwiki (T360686)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:02:40] Dreamy_Jazz: let me know when you are done. [13:02:45] T360686: Stop writing old on testwiki - https://phabricator.wikimedia.org/T360686 [13:02:46] Sure [13:02:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet [13:03:01] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet [13:03:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:04:18] (03PS1) 10Elukey: profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211 [13:04:19] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [13:04:25] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:04:46] (03CR) 10CI reject: [V:04-1] profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey) [13:05:53] (03PS2) 10Elukey: profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211 [13:06:01] !log restarting pybal on lvs7001/lvs7003 to appy IPIP conf (T366466) [13:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:11] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [13:07:41] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2754/co" [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey) [13:08:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [13:08:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey) [13:08:37] Dreamy_Jazz: Did it tell you what server exceeded that timeout? [13:08:46] No [13:09:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:14] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker [13:10:17] (03CR) 10Elukey: [V:03+1 C:03+2] profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey) [13:10:23] (03CR) 10Daimona Eaytoy: "I seem to remember from past deployments that it's generally better to do one file per patch, as files can only be synced to the servers i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038862 (https://phabricator.wikimedia.org/T363199) (owner: 10Mhorsey) [13:10:28] (03PS1) 10Bartosz DziewoƄski: MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039212 (https://phabricator.wikimedia.org/T366657) [13:11:58] (03Merged) 10jenkins-bot: Follow-up: Don't run interact with block buttons if they don't exist [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493) (owner: 10Dreamy Jazz) [13:13:10] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1013386|[CheckUser] Stop writing old for event table migration on testwiki (T360686)]] (duration: 19m 13s) [13:13:13] T360686: Stop writing old on testwiki - https://phabricator.wikimedia.org/T360686 [13:13:36] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [13:13:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet [13:14:14] (03PS1) 10Ebrahim: Enable numeric sorting for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039213 (https://phabricator.wikimedia.org/T366703) [13:14:20] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1038839|Follow-up: Don't run interact with block buttons if they don't exist (T329493)]] [13:14:22] T329493: Replace Special:CheckUser's 'get users' block form with a usage of Special:InvestigateBlock - https://phabricator.wikimedia.org/T329493 [13:14:40] (03PS3) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) [13:16:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T364299)', diff saved to https://phabricator.wikimedia.org/P64103 and previous config saved to /var/cache/conftool/dbconfig/20240605-131623-marostegui.json [13:16:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance [13:16:27] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:16:28] cwhite: i proposed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1039212 as an alternative to your revert [13:16:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9863672 (10ABran-WMF) leaving the host depooled until tomorrow to see if it stays stable, will close the task upon repool. [13:16:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance [13:16:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T364299)', diff saved to https://phabricator.wikimedia.org/P64104 and previous config saved to /var/cache/conftool/dbconfig/20240605-131647-marostegui.json [13:17:00] !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1038839|Follow-up: Don't run interact with block buttons if they don't exist (T329493)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:29] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [13:17:38] !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync [13:17:58] (03CR) 10Elukey: [C:03+1] ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [13:18:07] (03PS1) 10Fabfur: Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841 [13:18:40] (03CR) 10Elukey: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [13:18:46] Dreamy_Jazz: I haven't done a config deployment in a while... Remind me please... can I just use scap backport, and it knows what to do? [13:19:06] (03PS2) 10Fabfur: Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841 (https://phabricator.wikimedia.org/T366466) [13:19:10] (03CR) 10Elukey: ml-services: use multi-processing for viwiki in ml-staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [13:19:33] duesen: Yes [13:19:37] (03CR) 10Vgutierrez: [C:03+1] Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [13:19:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet [13:20:08] (03PS4) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) [13:20:19] (03CR) 10Fabfur: [C:03+2] Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [13:20:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet [13:21:10] !log enable magru DC after applying IPIP encapsulation patches (T366466) [13:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:12] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [13:21:26] (03CR) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [13:21:30] (03CR) 10Bartosz DziewoƄski: [C:03+1] "Alternative: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1039212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038840 (owner: 10Cwhite) [13:22:43] (03CR) 10Elukey: [C:03+1] ml-services: use multi-processing for viwiki in ml-staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [13:23:46] Dreamy_Jazz: cool thanks. Are you still deploying? [13:23:54] duesen: Yes [13:24:08] k [13:24:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:23] (03CR) 10JMeybohm: "I don't really like the fact that this creates an implicit dependency to the mw-script namespace being created, but I think its the most s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [13:25:31] jouncebot: next [13:25:31] In 0 hour(s) and 34 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1400) [13:25:43] I'll sneak a graphite1005 reboot now [13:25:47] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [13:25:56] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet [13:25:59] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1038839|Follow-up: Don't run interact with block buttons if they don't exist (T329493)]] (duration: 11m 39s) [13:26:02] T329493: Replace Special:CheckUser's 'get users' block form with a usage of Special:InvestigateBlock - https://phabricator.wikimedia.org/T329493 [13:26:07] duesen: I'm done with my patch. [13:26:14] You can proceed with your config change. [13:26:17] Dreamy_Jazz: excellent, thank you! [13:26:28] I'll go ahead with my config patch, then [13:26:34] I would recommend running the command in screen / tmux in case your connection drops [13:26:42] ouch I didn't realize a deployment was in progress, my bad! anyways graphite1005 will be back soon btw [13:26:48] (03Merged) 10jenkins-bot: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [13:26:56] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [13:26:56] jouncebot: now and next [13:26:56] For the next 0 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1300) [13:27:04] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet [13:27:09] that's what I wanted [13:27:10] MatmaRex: we can try your alternate proposal first. [13:27:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [13:27:25] thanks [13:27:43] Chances are it will solve the issue, but leave the bug as-is. [13:27:57] !log systemctl reset-failed prometheus-redis-exporter@6380.service redis-instance-tcp_6380.service on netbox[12]002 + apt-get purge of redis-server and prometheus-redis-exporter packages to clean up stale configs (no local redis is used) [13:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:23] (03CR) 10Andrew Bogott: [C:03+1] openstack: wmfkeystonehooks: Use project name for Wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1039204 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah) [13:28:35] grrr, merge conflict [13:28:44] FIRING: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet [13:29:21] (03PS2) 10Daniel Kinzler: Set LinterParseOnDerivedDataUpdate to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) [13:29:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet [13:29:36] (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [13:30:11] (03Merged) 10jenkins-bot: Set LinterParseOnDerivedDataUpdate to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [13:30:27] (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Use project name for Wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1039204 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah) [13:30:43] !log daniel@deploy1002 Started scap: Backport for [[gerrit:1038688|Set LinterParseOnDerivedDataUpdate to false (T361013)]] [13:30:47] T361013: Update lint tables independently of changeprop/restbase - https://phabricator.wikimedia.org/T361013 [13:33:44] RESOLVED: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:16] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:34:32] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:34:33] !log daniel@deploy1002 daniel: Backport for [[gerrit:1038688|Set LinterParseOnDerivedDataUpdate to false (T361013)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:34:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet [13:35:49] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T364577#9863782 (10Jhancock.wm) 05Open→03Resolved [13:35:59] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet [13:37:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet [13:37:18] (03PS1) 10Jelto: aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004) [13:37:21] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet [13:37:40] (03CR) 10CI reject: [V:04-1] aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004) (owner: 10Jelto) [13:37:41] !log filippo@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host graphite1005.eqiad.wmnet [13:37:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet [13:38:59] (03PS2) 10Jelto: aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004) [13:39:29] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet [13:39:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:08] (03PS3) 10Jelto: aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004) [13:40:08] !log daniel@deploy1002 daniel: Continuing with sync [13:42:29] (03CR) 10JMeybohm: deployment_server: Add a mwscript-k8s cleanup script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [13:42:44] (03PS1) 10Majavah: openstack: wmfkeystonehooks: Add missing self argument [puppet] - 10https://gerrit.wikimedia.org/r/1039218 [13:42:47] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Support pct based aborted runs [puppet] - 10https://gerrit.wikimedia.org/r/1039219 [13:43:23] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet [13:43:28] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet [13:43:29] (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Add missing self argument [puppet] - 10https://gerrit.wikimedia.org/r/1039218 (owner: 10Majavah) [13:43:46] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [13:43:54] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet [13:44:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:31] !log bking@an-db1001 install acl pkg T363001 [13:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:34] T363001: Create a helm chart for airflow that is appropriate to our needs - https://phabricator.wikimedia.org/T363001 [13:46:14] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet [13:46:15] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet [13:46:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet [13:46:21] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet [13:46:35] !log factory reset for sretest1001 to test the new provision cookbook - T365372 [13:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:38] T365372: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372 [13:46:45] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet [13:46:46] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-05-23-164021 to 2024-06-05-003919 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039220 (https://phabricator.wikimedia.org/T340561) [13:46:48] (03PS2) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) [13:47:03] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-05-28-185827 to 2024-05-31-163732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039221 (https://phabricator.wikimedia.org/T360676) [13:47:20] (03CR) 10Alexandros Kosiaris: [C:04-1] "LGTM, minor nitpicks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [13:47:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [13:47:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [13:48:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ping1004.eqiad.wmnet [13:48:28] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:48:33] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:1038688|Set LinterParseOnDerivedDataUpdate to false (T361013)]] (duration: 17m 50s) [13:48:36] T361013: Update lint tables independently of changeprop/restbase - https://phabricator.wikimedia.org/T361013 [13:48:56] !log bking@an-db1001 install python3-psycopg2 pkg T363001 [13:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:28] duesen: all clear for another backport? [13:50:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [13:51:27] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:52:21] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet [13:52:22] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet [13:52:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet [13:52:31] (03PS13) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [13:52:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2074.codfw.wmnet [13:52:46] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet [13:53:33] (03CR) 10ClĂ©ment Goubert: mediawiki-image-download: Support pct based aborted runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris) [13:54:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet [13:54:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:35] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping1004.eqiad.wmnet - jmm@cumin2002" [13:55:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:19] (03CR) 10Muehlenhoff: [C:03+2] Configure memcached on idp-test hosts to run as 'memcache' [puppet] - 10https://gerrit.wikimedia.org/r/1039206 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff) [13:56:33] (03PS1) 10Majavah: openstack: wmfkeystonehooks: Project is a dict, not an object [puppet] - 10https://gerrit.wikimedia.org/r/1039222 [13:56:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [13:57:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet [13:57:51] (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Project is a dict, not an object [puppet] - 10https://gerrit.wikimedia.org/r/1039222 (owner: 10Majavah) [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1400) [14:00:19] Hey hey. [14:00:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping1004.eqiad.wmnet - jmm@cumin2002" [14:00:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:22] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ping1004.eqiad.wmnet on all recursors [14:00:22] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-05-23-164021 to 2024-06-05-003919 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039220 (https://phabricator.wikimedia.org/T340561) (owner: 10Jforrester) [14:00:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ping1004.eqiad.wmnet on all recursors [14:00:29] cwhite: Are you deploying then? [14:00:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2074.codfw.wmnet [14:00:41] James_F: I'll wait until you're done to restart the reboots I guess :p [14:00:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet [14:00:52] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ping1004.eqiad.wmnet - jmm@cumin2002" [14:01:04] claime: Oh, sorry! What are you rebooting? Might be OK to go in parallel. [14:01:20] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-05-23-164021 to 2024-06-05-003919 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039220 (https://phabricator.wikimedia.org/T340561) (owner: 10Jforrester) [14:01:24] James_F: Not really, I'm rebooting all k8s codfw [14:01:39] claime: Hmm. Maybe not ideal if I'm deploying to k8s, fair. [14:01:46] I still have ~60 nodes to go, and that will cordon them all, making deployments a little difficult [14:01:58] I'll be fast! [14:01:59] Although for wf it should fit [14:02:01] no worries [14:02:04] (He says, waiting for the git update to land on deploy1002.) [14:02:14] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:02:21] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:02:32] Oh dear. [14:02:51] Deploy failed. [14:02:58] are we still doing the backport window? [14:02:59] claime: I want to deploy. Was waiting for the all clear though. [14:03:06] "Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch "function-orchestrator-main-orchestrator-tls-proxy-certs" with kind Certificate: Internal error occurred
" [14:03:24] claime: Does this mean you need to reboot first, or is it a different issue? [14:03:33] different issue [14:03:44] Hmm. [14:03:45] especially in staging, I'm rebooting the prod cluster [14:03:48] Ack. [14:04:01] Well, if I can't deploy even to staging I can't validate. [14:04:09] So I suppose I should revert and give up? [14:04:11] (03PS1) 10Vgutierrez: depool text@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039223 (https://phabricator.wikimedia.org/T366466) [14:04:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ping1004.eqiad.wmnet - jmm@cumin2002" [14:04:29] MatmaRex: patches were deployed as part of the backkport window yes [14:04:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ping1004.eqiad.wmnet with OS bookworm [14:04:50] cwhite is waiting for the go ahead from duesen that his patch deployed correctly and he can proceed [14:04:53] 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9863878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm [14:05:13] James_F: let me check the diff [14:05:19] claime: the patches i was interested in weren't [14:05:27] so i'm wondering if the window is done or in progress [14:05:42] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet [14:05:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2075.codfw.wmnet [14:05:51] (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038840 / https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1039212) [14:05:56] MatmaRex: It's over, I'm now meant to be deploying my services. [14:06:02] alright [14:06:03] MatmaRex: what were those patches? because I think everything that was in the deployment calendar except cwhite's patch were deployed [14:06:16] (Though k8s service deploys and MW deploys don't really interact.) [14:07:16] claime: they're the same patches [14:07:23] ah [14:07:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED [14:07:46] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic1@eqsin for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466) [14:07:48] (03PS1) 10Vgutierrez: hiera: enable IPIP on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) [14:08:01] (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [14:08:17] (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [14:08:26] (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event tables migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [14:09:21] (03CR) 10Alexandros Kosiaris: mediawiki-image-download: Support pct based aborted runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris) [14:09:25] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:35] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:09:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:27] James_F: I'm going to try to deploy your wf patch to see what happens [14:10:30] Ack. [14:10:46] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:52] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:12:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P64105 and previous config saved to /var/cache/conftool/dbconfig/20240605-141210-ladsgroup.json [14:12:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:13:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2075.codfw.wmnet [14:13:28] James_F: erm. [14:13:32] PROBLEM - Host mw1377 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:33] It did pull your change [14:13:37] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet [14:13:56] the fucntion-orchestrator pod runs docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator:2024-06-05-003919 [14:14:06] Yes. [14:14:09] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:14:19] But it seems to be in a failed state? [14:14:25] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:26] but it fails at redeploying the tls proxy for some reason [14:15:13] (03PS1) 10Muehlenhoff: Fix Hiera option name [puppet] - 10https://gerrit.wikimedia.org/r/1039226 (https://phabricator.wikimedia.org/T273950) [14:15:18] (03PS14) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [14:15:26] A helm framework issue? [14:15:30] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2756/console" [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:15:43] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1076.eqiad.wmnet [14:15:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2076.codfw.wmnet [14:15:49] (03PS6) 10Effie Mouzeli: mc.php: store mcrouter location in apcu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) [14:15:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039226 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff) [14:16:04] RECOVERY - Host mw1377 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [14:16:29] James_F: looks more like a certificate issue which is... strange [14:17:01] jayme: did something change recently for tls on staging-eqiad? [14:17:01] Yeah. [14:17:23] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2757/co" [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:17:27] (03PS15) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [14:17:40] claime: not that I know of...let me read backlog [14:18:03] jayme: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch "function-orchestrator-main-orchestrator-tls-proxy-certs" with kind Cert [14:18:05] ificate: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": [14:18:07] x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "cert-manager-webhook-ca") [14:18:41] oh, interesting... [14:18:58] isn't it [14:19:23] that's probably the apiserver failing to call the cert-manager webhook [14:19:25] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:27] I can take a look [14:20:19] (03CR) 10ClĂ©ment Goubert: [C:03+1] mediawiki-image-download: Support pct based aborted runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris) [14:20:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T364069)', diff saved to https://phabricator.wikimedia.org/P64106 and previous config saved to /var/cache/conftool/dbconfig/20240605-142018-marostegui.json [14:20:23] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:21:29] (03CR) 10Muehlenhoff: [C:03+2] Fix Hiera option name [puppet] - 10https://gerrit.wikimedia.org/r/1039226 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff) [14:21:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:49] jayme: thanks [14:23:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [14:23:45] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2076.codfw.wmnet [14:23:50] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1076.eqiad.wmnet [14:24:27] (03CR) 10Fabfur: [C:03+1] depool text@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039223 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:24:32] (03PS1) 10Muehlenhoff: Remove obsolete virt-star stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1039227 [14:24:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:04] (03CR) 10Vgutierrez: [C:03+2] depool text@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039223 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:26:29] (03CR) 10Fabfur: [C:03+1] hiera: Enable IPIP on high-traffic1@eqsin for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:27:05] (03CR) 10Fabfur: [C:03+1] hiera: enable IPIP on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:27:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P64107 and previous config saved to /var/cache/conftool/dbconfig/20240605-142718-ladsgroup.json [14:28:00] !log depool text@eqsin before enabling IPIP encapsulation - T366466 [14:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:03] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [14:29:03] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [14:29:08] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9863961 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [14:29:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [14:29:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet [14:31:56] (03PS1) 10Muehlenhoff: Configure memcached on idp hosts to run as 'memcache' [puppet] - 10https://gerrit.wikimedia.org/r/1039229 (https://phabricator.wikimedia.org/T273950) [14:32:34] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863988 (10cmooney) [14:33:34] 06SRE, 10Cloud-Services, 06serviceops, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9863991 (10MoritzMuehlenhoff) [14:34:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P64108 and previous config saved to /var/cache/conftool/dbconfig/20240605-143526-marostegui.json [14:36:49] (03PS7) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) [14:38:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:47] (03CR) 10Effie Mouzeli: mc.php: store mcrouter location in apcu (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli) [14:39:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P64109 and previous config saved to /var/cache/conftool/dbconfig/20240605-144227-ladsgroup.json [14:42:54] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [14:43:26] claime: just so you know I'm still around and hoping to get this backport deployed [14:43:39] cwhite: yeah, I haven't restarted the reboots [14:43:49] I think I'll give up on them for today and will finish tomorrow [14:44:05] tbh I think you should go ahead with your backport [14:44:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [14:44:09] (03PS7) 10Effie Mouzeli: mc.php: store mcrouter location in apcu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) [14:44:32] ok, thank you :) [14:44:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:17] (03CR) 10Effie Mouzeli: mc.php: store mcrouter location in apcu (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli) [14:45:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cwhite@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039212 (https://phabricator.wikimedia.org/T366657) (owner: 10Bartosz DziewoƄski) [14:45:25] (03CR) 10JHathaway: [V:03+1] "I think your concerns have been addressed, please take another look." [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [14:45:43] (03CR) 10Cyndywikime: [C:04-1] "For visibility, needs rebase :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [14:46:00] (03PS3) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) [14:46:03] (03PS7) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) [14:46:06] (03Merged) 10jenkins-bot: MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039212 (https://phabricator.wikimedia.org/T366657) (owner: 10Bartosz DziewoƄski) [14:46:35] !log cwhite@deploy1002 Started scap: Backport for [[gerrit:1039212|MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings (T366657)]] [14:46:38] T366657: Lots of logs: "PHP Notice: Undefined Index: PATH_INFO" - https://phabricator.wikimedia.org/T366657 [14:47:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [14:47:56] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic1@eqsin for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:48:24] (03PS3) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) [14:48:30] (03PS4) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) [14:48:34] (03PS8) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) [14:48:38] (03PS4) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) [14:49:05] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: enable IPIP on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:49:08] !log cwhite@deploy1002 matmarex and cwhite: Backport for [[gerrit:1039212|MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings (T366657)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:49:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:50:26] wiki still renders - continuing [14:50:28] !log cwhite@deploy1002 matmarex and cwhite: Continuing with sync [14:50:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039229 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff) [14:50:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P64110 and previous config saved to /var/cache/conftool/dbconfig/20240605-145034-marostegui.json [14:50:44] (03PS1) 10Majavah: openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158) [14:51:08] (03CR) 10CI reject: [V:04-1] openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah) [14:51:14] (03CR) 10Klausman: [C:03+2] base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [14:52:03] (03PS6) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [14:52:06] (03PS2) 10Majavah: openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158) [14:52:44] (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:53:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:29] !log rolling restart of pybal on lvs5006 and lvs5004 - T366466 [14:55:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [14:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:32] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [14:55:34] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [14:55:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:57] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [14:56:07] jayme: Should I give up on the deployment and revert my deployment-charts patch? [14:57:08] James_F: I need a bit more time, but if it's fine by you you can leave the change inteact and I can deploy to staging as soon as i've figured out what's wrong [14:57:15] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [14:57:17] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [14:57:17] Ack. That'd be great. [14:57:18] and ping you affter for prod deployments [14:57:24] jayme: Thank you! [14:57:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P64111 and previous config saved to /var/cache/conftool/dbconfig/20240605-145735-ladsgroup.json [14:57:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [14:57:38] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:57:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [14:57:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T352010)', diff saved to https://phabricator.wikimedia.org/P64112 and previous config saved to /var/cache/conftool/dbconfig/20240605-145757-ladsgroup.json [14:59:07] !log cwhite@deploy1002 Finished scap: Backport for [[gerrit:1039212|MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings (T366657)]] (duration: 12m 32s) [14:59:10] T366657: Lots of logs: "PHP Notice: Undefined Index: PATH_INFO" - https://phabricator.wikimedia.org/T366657 [15:00:30] (03PS1) 10Vgutierrez: Revert "depool text@eqsin before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038842 (https://phabricator.wikimedia.org/T366466) [15:00:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:00:55] (03Abandoned) 10Cwhite: Revert "multiversion: Add tests for MWMultiVersion::getMediaWiki()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038840 (owner: 10Cwhite) [15:01:11] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:01:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [15:02:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED [15:04:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:43] (03CR) 10Vgutierrez: [C:03+2] Revert "depool text@eqsin before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038842 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [15:04:50] !log repool text@eqsin with IPIP encapsulation enabled - T366466 [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:54] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [15:04:55] (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [15:05:03] bblack, urandom, claime, Emperor: ^^ [15:05:21] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [15:05:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T364069)', diff saved to https://phabricator.wikimedia.org/P64113 and previous config saved to /var/cache/conftool/dbconfig/20240605-150542-marostegui.json [15:05:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [15:05:45] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:05:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [15:05:58] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [15:05:59] jouncebot: nowandnext [15:05:59] No deployments scheduled for the next 1 hour(s) and 24 minute(s) [15:05:59] In 1 hour(s) and 24 minute(s): One-off deployment for T365155 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630) [15:06:00] T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155 [15:06:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T364069)', diff saved to https://phabricator.wikimedia.org/P64114 and previous config saved to /var/cache/conftool/dbconfig/20240605-150605-marostegui.json [15:07:23] !log jnuche@deploy1002 Installing scap version "4.86.0" for 286 hosts [15:08:36] !log jnuche@deploy1002 Installing scap version "4.86.0" for 285 hosts [15:09:06] (03CR) 10Elukey: [C:03+2] "To keep archives happy: after resetting bios and factory reset via idrac, the cookbook worked nicely." [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:09:15] !log jnuche@deploy1002 Installation of scap version "4.86.0" completed for 285 hosts [15:09:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:51] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [15:10:36] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001'] [15:10:40] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [15:10:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet [15:13:02] (03Merged) 10jenkins-bot: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:13:07] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1077.eqiad.wmnet [15:13:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [15:17:21] (03PS1) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [15:17:22] (03PS1) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [15:18:13] (03CR) 10CI reject: [V:04-1] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [15:18:28] (03CR) 10CI reject: [V:04-1] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [15:18:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [15:19:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet [15:19:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2078.codfw.wmnet [15:20:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1077.eqiad.wmnet [15:21:03] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1078.eqiad.wmnet [15:24:18] !log sukhe@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM pybal-test2003.codfw.wmnet [15:24:25] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ping1004.eqiad.wmnet with OS bookworm [15:25:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ping1004.eqiad.wmnet [15:25:15] 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9864158 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm executed with errors: - ping1004 (**FAIL**) - Removed... [15:26:11] (03CR) 10Giuseppe Lavagetto: Add new chart statsd-exporter (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [15:26:21] (03PS7) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [15:26:21] (03PS2) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [15:26:21] (03PS2) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [15:26:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM pybal-test2003.codfw.wmnet [15:27:02] (03CR) 10CI reject: [V:04-1] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [15:27:05] (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [15:27:07] (03CR) 10CI reject: [V:04-1] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [15:27:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2078.codfw.wmnet [15:28:10] (03PS1) 10Andrea Denisse: traffic: Add discovery entries for the pyrra, slo, and slos domains [puppet] - 10https://gerrit.wikimedia.org/r/1039236 (https://phabricator.wikimedia.org/T356386) [15:28:11] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2079.codfw.wmnet [15:28:37] (03CR) 10Filippo Giunchedi: [C:03+1] traffic: Add discovery entries for the pyrra, slo, and slos domains [puppet] - 10https://gerrit.wikimedia.org/r/1039236 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [15:28:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1078.eqiad.wmnet [15:28:41] (03CR) 10Andrea Denisse: [C:03+2] traffic: Add discovery entries for the pyrra, slo, and slos domains [puppet] - 10https://gerrit.wikimedia.org/r/1039236 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [15:29:25] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:29] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1079.eqiad.wmnet [15:30:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001'] [15:32:59] !log rebalancing drmrs Ganeti clusters [15:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet [15:34:25] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:41] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [15:34:54] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [15:36:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2079.codfw.wmnet [15:36:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2080.codfw.wmnet [15:37:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1079.eqiad.wmnet [15:37:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet [15:37:29] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1080.eqiad.wmnet [15:39:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet [15:40:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2080.codfw.wmnet [15:43:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet [15:43:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1080.eqiad.wmnet [15:44:20] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1081.eqiad.wmnet [15:46:03] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [15:46:10] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [15:46:48] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9864270 (10ssingh) [15:49:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T352010)', diff saved to https://phabricator.wikimedia.org/P64115 and previous config saved to /var/cache/conftool/dbconfig/20240605-155023-ladsgroup.json [15:50:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:50:38] (03PS1) 10Sergio Gimeno: Improve navigation link handling in CommunityConfiguration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038843 (https://phabricator.wikimedia.org/T364938) [15:51:07] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [15:51:47] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1081.eqiad.wmnet [15:51:50] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1082.eqiad.wmnet [15:52:32] (03PS8) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [15:52:36] (03PS3) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [15:52:40] (03PS3) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [15:52:44] (03PS1) 10AOkoth: miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796) [15:52:53] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864316 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [15:53:15] (03CR) 10AOkoth: [C:03+2] miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [15:53:19] (03CR) 10Jelto: [C:03+1] miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [15:53:29] (03CR) 10AOkoth: [V:03+2 C:03+2] miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [15:54:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:05] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9864343 (10elukey) First roadblock: https://www.supermicro.com/en/support/BMC_Unique_Password It seems that every s... [15:56:47] !log aokoth@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:57:01] (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah) [15:57:07] !log aokoth@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:57:17] (03PS2) 10Ebrahim: Enable numeric sorting for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039213 (https://phabricator.wikimedia.org/T366703) [15:57:25] (03PS4) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [15:57:25] (03PS4) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [15:58:23] !log aokoth@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:58:44] !log aokoth@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:59:31] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1082.eqiad.wmnet [15:59:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:59:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:59:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T352010)', diff saved to https://phabricator.wikimedia.org/P64116 and previous config saved to /var/cache/conftool/dbconfig/20240605-155955-ladsgroup.json [15:59:58] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:01:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64117 and previous config saved to /var/cache/conftool/dbconfig/20240605-160116-ladsgroup.json [16:01:20] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [16:01:26] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [16:01:26] !log aokoth@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:01:46] !log aokoth@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:04:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:09] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:05:49] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:06:25] James_F: ^ [16:06:54] Aha. [16:06:57] And it worked? [16:07:02] oh, sorry...yes :D [16:07:07] :-D [16:07:11] Excellent, thank you! [16:07:24] yw [16:08:02] (03CR) 10Urbanecm: "That was actually not true, fwiw :). operations/mediawiki-config is very aggressive about rebase warnings, and it shows them even when the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [16:08:49] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:09:09] RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:09:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:00] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:10:16] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:10:29] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [16:10:42] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [16:10:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:11:47] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:12:16] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-05-28-185827 to 2024-05-31-163732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039221 (https://phabricator.wikimedia.org/T360676) (owner: 10Jforrester) [16:12:29] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1032.eqiad.wmnet [16:13:07] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-05-28-185827 to 2024-05-31-163732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039221 (https://phabricator.wikimedia.org/T360676) (owner: 10Jforrester) [16:13:26] (03PS13) 10GergƑ Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [16:14:09] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:15:39] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:15:48] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [16:16:14] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:16:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Maint over', diff saved to https://phabricator.wikimedia.org/P64118 and previous config saved to /var/cache/conftool/dbconfig/20240605-161622-ladsgroup.json [16:18:51] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:18:53] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1032.eqiad.wmnet [16:18:58] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:19:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:57] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:23:13] (03PS1) 10JHathaway: mw1365: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395) [16:24:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:24:25] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:25:59] (03PS1) 10Milimetric: dumps/other: remove unused links [puppet] - 10https://gerrit.wikimedia.org/r/1039246 [16:26:57] (03PS2) 10Milimetric: dumps/other: remove unused links [puppet] - 10https://gerrit.wikimedia.org/r/1039246 [16:29:25] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:29:35] (03CR) 10Dzahn: [V:03+1 C:03+2] "compiling works now and shows noop on active prod host, https://puppet-compiler.wmflabs.org/output/1037621/2761/phab2002.codfw.wmnet/index" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:30:02] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:30:03] (03PS1) 10Hnowlan: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T36399) [16:30:04] Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do One-off deployment for T365155 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630). [16:30:04] dr0ptp4kt: A patch you scheduled for One-off deployment for T365155 is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:30:05] T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155 [16:30:54] o/ [16:30:58] I deploy now [16:31:04] +1 [16:31:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64119 and previous config saved to /var/cache/conftool/dbconfig/20240605-163129-ladsgroup.json [16:32:08] (03PS3) 10Dr0ptp4kt: Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) [16:32:55] !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubestage1003.eqiad.wmnet [16:32:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt) [16:33:36] (03Merged) 10jenkins-bot: Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt) [16:34:00] (03PS2) 10CDanis: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:34:05] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1038392|Bump XML dump schema to version 0.11 (T365155)]] [16:34:25] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:44] (03PS3) 10Milimetric: dumps/other: remove unused links [puppet] - 10https://gerrit.wikimedia.org/r/1039246 [16:34:55] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [16:35:00] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [16:36:34] !log ladsgroup@deploy1002 ladsgroup and dr0ptp4kt: Backport for [[gerrit:1038392|Bump XML dump schema to version 0.11 (T365155)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:36:37] T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155 [16:37:22] (03PS2) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) [16:37:27] (03PS2) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685) [16:37:59] (03CR) 10Dzahn: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:38:52] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [16:38:56] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [16:39:40] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1003.eqiad.wmnet [16:40:55] FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:13] ^ this has been alerting every 5 minutes for like 6 hours [16:42:41] would be nice if we could reduce the noise by a downtime or something [16:43:26] is it notifying someone in other ways? [16:43:46] !log ladsgroup@deploy1002 ladsgroup and dr0ptp4kt: Continuing with sync [16:45:19] (03CR) 10Dzahn: "We have an alert about this every 5 minutes:" [puppet] - 10https://gerrit.wikimedia.org/r/1038329 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [16:45:50] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [16:46:05] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [16:46:13] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [16:46:20] (03PS1) 10JHathaway: phab: fix ferm ensure [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) [16:46:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64120 and previous config saved to /var/cache/conftool/dbconfig/20240605-164635-ladsgroup.json [16:46:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:46:46] (03CR) 10Dzahn: [C:03+1] phab: fix ferm ensure [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:47:46] (03PS16) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [16:48:50] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [16:49:40] RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:52] (03CR) 10Dzahn: [C:03+2] phab: fix ferm ensure [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:52:28] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1038392|Bump XML dump schema to version 0.11 (T365155)]] (duration: 18m 23s) [16:52:30] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T366724 (10phaultfinder) 03NEW [16:52:31] T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155 [16:53:35] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1004.eqiad.wmnet with reason: decom T353785 [16:53:38] T353785: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785 [16:53:48] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1004.eqiad.wmnet with reason: decom T353785 [16:54:13] !log downtimed stat1004 for 10 days to avoid alerting spam during decom process - T353785 [16:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:40] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:00] (03PS1) 10JHathaway: Revert "Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"""" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 [16:55:10] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (owner: 10JHathaway) [16:55:11] (03CR) 10Dzahn: [C:03+2] "noop on prod server, removed new firewall rule on failover server, all good" [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:55:55] FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:09] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1005.eqiad.wmnet with reason: decom T353785 [16:56:15] (03CR) 10Hnowlan: services: add data-gateway service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [16:56:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1005.eqiad.wmnet with reason: decom T353785 [16:56:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001'] [17:00:04] Amir1: Your horoscope predicts another One-off deployment for T365155 deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630). [17:00:04] dr0ptp4kt: A patch you scheduled for One-off deployment for T365155 is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1700) [17:00:05] T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155 [17:02:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T364299)', diff saved to https://phabricator.wikimedia.org/P64121 and previous config saved to /var/cache/conftool/dbconfig/20240605-170200-marostegui.json [17:02:05] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:02:55] (03PS2) 10JHathaway: Revert "Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"""" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (https://phabricator.wikimedia.org/T365395) [17:03:02] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [17:04:29] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [17:04:35] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [17:04:40] RESOLVED: [2x] SystemdUnitFailed: rsync-published.service on stat1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:45] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1006.eqiad.wmnet with reason: decom T353785 [17:05:48] T353785: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785 [17:05:58] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1006.eqiad.wmnet with reason: decom T353785 [17:06:34] every 5 min for 5 hosts is a lot of noise [17:06:41] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1007.eqiad.wmnet with reason: decom T353785 [17:06:46] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1033.eqiad.wmnet [17:06:53] (03CR) 10JHathaway: [C:03+2] Revert "Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"""" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [17:06:54] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1007.eqiad.wmnet with reason: decom T353785 [17:09:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64122 and previous config saved to /var/cache/conftool/dbconfig/20240605-170938-ladsgroup.json [17:09:42] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:10:17] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1033.eqiad.wmnet [17:10:41] !log phabricator email now egressing via mx-out{1001,2001}.wikimedia.org, which should solve the SPF warnings in your inbox [17:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9864875 (10jhathaway) [17:12:49] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [17:12:55] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [17:13:09] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [17:13:50] (03PS1) 10Ladsgroup: Stop writing to pagelinks old columns in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039256 (https://phabricator.wikimedia.org/T352010) [17:17:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P64123 and previous config saved to /var/cache/conftool/dbconfig/20240605-171708-marostegui.json [17:17:57] (03PS1) 10Btullis: Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) [17:18:19] (03CR) 10CI reject: [V:04-1] Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [17:19:05] (03PS2) 10Btullis: Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) [17:19:31] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [17:24:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P64124 and previous config saved to /var/cache/conftool/dbconfig/20240605-172446-ladsgroup.json [17:24:48] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [17:25:03] (03Abandoned) 10Ebrahim: Enable numeric sorting for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039213 (https://phabricator.wikimedia.org/T366703) (owner: 10Ebrahim) [17:25:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9864928 (10colewhite) The correct link to the docs for setting up kerberos: https://wikitech.wikimedia.o... [17:27:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001'] [17:27:52] jouncebot: nowandnext [17:27:52] For the next 0 hour(s) and 2 minute(s): One-off deployment for T365155 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630) [17:27:52] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1700) [17:27:52] In 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1800) [17:27:52] T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155 [17:28:15] (03CR) 10Ladsgroup: [C:03+2] Stop writing to pagelinks old columns in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039256 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [17:28:56] (03Merged) 10jenkins-bot: Stop writing to pagelinks old columns in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039256 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [17:29:04] \o/ [17:29:23] And then just s2 to do? Nice. [17:29:48] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1039256|Stop writing to pagelinks old columns in enwiki (T352010)]] [17:29:51] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:30:19] (03CR) 10Btullis: [C:03+2] Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [17:30:40] (03CR) 10Xcollazo: [C:03+1] Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis) [17:30:43] (03PS17) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [17:31:52] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [17:32:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P64125 and previous config saved to /var/cache/conftool/dbconfig/20240605-173216-marostegui.json [17:32:32] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1039256|Stop writing to pagelinks old columns in enwiki (T352010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:33:38] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [17:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:36:27] (03CR) 10Scott French: "Thank you all for the reviews. I'll aim to get this merged today and the service turned up in staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [17:38:17] (03PS1) 10Bking: an-db1001: add `airflow_test_k8s` user and db [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001) [17:38:49] (03CR) 10Scott French: [C:03+2] services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [17:39:00] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [17:39:45] (03Merged) 10jenkins-bot: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [17:39:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P64126 and previous config saved to /var/cache/conftool/dbconfig/20240605-173954-ladsgroup.json [17:42:07] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1039256|Stop writing to pagelinks old columns in enwiki (T352010)]] (duration: 12m 19s) [17:42:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:46:04] PROBLEM - Host logging-hd1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:09] (03CR) 10Btullis: [C:03+1] "Nice, thanks Bking." [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [17:47:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T364299)', diff saved to https://phabricator.wikimedia.org/P64127 and previous config saved to /var/cache/conftool/dbconfig/20240605-174724-marostegui.json [17:47:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2199.codfw.wmnet with reason: Maintenance [17:47:28] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:47:32] RECOVERY - Host logging-hd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:47:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2199.codfw.wmnet with reason: Maintenance [17:48:40] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd1001 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f0e245b8210: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [17:48:40] a.org/wiki/Search%23Administration [17:50:40] RECOVERY - OpenSearch health check for shards on 9200 on logging-hd1001 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 755, active_shards: 1525, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 247, delayed_unassigne [17:50:40] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.06094808126412 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:50:58] !log kamila@cumin1002 START - Cookbook sre.hosts.dhcp for host wikikube-ctrl1001.eqiad.wmnet [17:55:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64128 and previous config saved to /var/cache/conftool/dbconfig/20240605-175503-ladsgroup.json [17:55:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:57:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64129 and previous config saved to /var/cache/conftool/dbconfig/20240605-175725-ladsgroup.json [18:00:00] PROBLEM - Host logging-hd1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:04] dduvall and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1800). [18:01:34] RECOVERY - Host logging-hd1002 is UP: PING OK - Packet loss = 0%, RTA = 5.10 ms [18:02:38] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd1002 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7facc93c6e10: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [18:02:38] a.org/wiki/Search%23Administration [18:04:02] o/ [18:04:38] RECOVERY - OpenSearch health check for shards on 9200 on logging-hd1002 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 755, active_shards: 1524, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 248, delayed_unassigne [18:04:38] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.00451467268623 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:06:44] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [18:07:36] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1001.eqiad.wmnet [18:11:41] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1001.eqiad.wmnet [18:12:16] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [18:12:19] (03CR) 10Xcollazo: "Are these dead links then?" [puppet] - 10https://gerrit.wikimedia.org/r/1039246 (owner: 10Milimetric) [18:12:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64130 and previous config saved to /var/cache/conftool/dbconfig/20240605-181234-ladsgroup.json [18:13:02] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [18:13:06] (03PS18) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [18:18:42] (03PS19) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [18:21:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:52] (03PS20) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [18:27:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64131 and previous config saved to /var/cache/conftool/dbconfig/20240605-182742-ladsgroup.json [18:30:27] (03CR) 10CI reject: [V:04-1] wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [18:32:31] (03CR) 10BCornwall: [C:03+2] Move ncmonitor credentials to its own profile [labs/private] - 10https://gerrit.wikimedia.org/r/1037857 (owner: 10BCornwall) [18:32:33] (03CR) 10BCornwall: [V:03+2 C:03+2] Move ncmonitor credentials to its own profile [labs/private] - 10https://gerrit.wikimedia.org/r/1037857 (owner: 10BCornwall) [18:39:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [18:40:16] PROBLEM - Host logging-hd1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:42] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039272 (https://phabricator.wikimedia.org/T361402) [18:41:44] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039272 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [18:41:53] (03PS21) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [18:42:24] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039272 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [18:42:48] (03PS22) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [18:42:50] RECOVERY - Host logging-hd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [18:42:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64132 and previous config saved to /var/cache/conftool/dbconfig/20240605-184250-ladsgroup.json [18:42:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:44:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [18:49:25] (03CR) 10Bking: [C:03+2] an-db1001: add `airflow_test_k8s` user and db [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [18:51:05] (03PS1) 10BCornwall: ncmonitor: Reformat credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1039275 [18:51:17] (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Reformat credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1039275 (owner: 10BCornwall) [18:53:03] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [18:53:18] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.8 refs T361402 [18:53:21] T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402 [18:53:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [18:57:15] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [18:57:40] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9865257 (10jhathaway) [18:57:41] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out - https://phabricator.wikimedia.org/T325407#9865258 (10jhathaway) [18:58:36] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [18:58:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [18:58:56] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9865254 (10cmooney) a:05MatthewVernon→03cmooney [19:01:22] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9865262 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney [19:02:00] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9865284 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:02:15] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [19:03:36] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865298 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney [19:03:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865304 (10cmooney) [19:03:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [19:04:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865307 (10cmooney) >>! In T365988#9837257, @MatthewVernon wrote: > From the swift POV, this is just checking the cluster is hap... [19:06:38] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9865316 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:06:50] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9865331 (10cmooney) [19:08:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9865335 (10wiki_willy) Hi @dcaro - just following up on this. Can you provide the racking information for us, to start this install? Thanks, Willy [19:08:30] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [19:08:59] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9865360 (10cmooney) [19:09:39] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [19:09:51] (03PS3) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [19:10:01] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9865354 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:11:31] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9865362 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:12:07] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744 (10jhathaway) 03NEW [19:12:30] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865408 (10jhathaway) p:05Triage→03Medium a:03jhathaway [19:12:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9865379 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:13:15] RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [19:13:17] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9865417 (10cmooney) [19:13:27] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9865412 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:16:57] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9865429 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney [19:17:04] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9865435 (10cmooney) I spoke to @Jclark-ctr earlier, we will do this commencing at 12:00 UTC tomorrow Thurs 6th Jun. [19:21:28] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in - https://phabricator.wikimedia.org/T325406#9865447 (10jhathaway) [19:22:13] (03PS1) 10JHathaway: email: add node definitions for mx-in boxen [puppet] - 10https://gerrit.wikimedia.org/r/1039280 (https://phabricator.wikimedia.org/T325406) [19:24:50] (03CR) 10JHathaway: [C:03+2] email: add node definitions for mx-in boxen [puppet] - 10https://gerrit.wikimedia.org/r/1039280 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [19:26:13] (03PS4) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [19:27:27] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [19:28:12] (03PS5) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [19:28:13] (03CR) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [19:29:06] (03PS1) 10BCornwall: ncmonitor: Move ssh key block to end of the file [labs/private] - 10https://gerrit.wikimedia.org/r/1039281 [19:29:21] (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Move ssh key block to end of the file [labs/private] - 10https://gerrit.wikimedia.org/r/1039281 (owner: 10BCornwall) [19:36:51] !log jhathaway@cumin1002 START - Cookbook sre.ganeti.makevm for new host mx-in1001.wikimedia.org [19:36:53] !log jhathaway@cumin1002 START - Cookbook sre.dns.netbox [19:38:58] !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in1001.wikimedia.org - jhathaway@cumin1002" [19:43:54] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in1001.wikimedia.org - jhathaway@cumin1002" [19:43:54] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:55] !log jhathaway@cumin1002 START - Cookbook sre.dns.wipe-cache mx-in1001.wikimedia.org on all recursors [19:43:58] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mx-in1001.wikimedia.org on all recursors [19:44:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9865522 (10cmooney) >>! In T360789#9855905, @Papaul wrote: > @cmooney all good on lsw1-d4, lsw1-c2 and lsw1-d8 Thanks! Confirmed all looks good. What was... [19:44:24] !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in1001.wikimedia.org - jhathaway@cumin1002" [19:45:10] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in1001.wikimedia.org - jhathaway@cumin1002" [19:46:42] (03PS1) 10BCornwall: ncmonitor: Temporary removal of passwords lookup [puppet] - 10https://gerrit.wikimedia.org/r/1039285 [19:47:01] (03CR) 10Ssingh: [C:03+1] ncmonitor: Temporary removal of passwords lookup [puppet] - 10https://gerrit.wikimedia.org/r/1039285 (owner: 10BCornwall) [19:47:04] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host mx-in1001.wikimedia.org with OS bookworm [19:47:11] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1002 for host mx-in1001.wikimedia.org with OS bookworm [19:47:13] (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Temporary removal of passwords lookup [puppet] - 10https://gerrit.wikimedia.org/r/1039285 (owner: 10BCornwall) [19:50:02] (03PS6) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [19:52:05] (03PS1) 10Urbanecm: Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) [19:52:43] (03CR) 10CI reject: [V:04-1] Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) (owner: 10Urbanecm) [19:54:04] (03PS2) 10Urbanecm: Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) [19:57:02] the fancy scheduling tool doesn't seem to be doing anything :( [19:57:14] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mx-in1001.wikimedia.org with reason: host reimage [19:58:10] (03PS1) 10BCornwall: ncmonitor: Add test key to solve pcc error [labs/private] - 10https://gerrit.wikimedia.org/r/1039288 [19:59:07] (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Add test key to solve pcc error [labs/private] - 10https://gerrit.wikimedia.org/r/1039288 (owner: 10BCornwall) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T2000). [20:00:05] Dreamy_Jazz and sergi0: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] i can deploy today [20:00:15] hello [20:00:15] \o [20:00:15] hi Dreamy_Jazz and sergi0! [20:00:22] Hi there. [20:00:30] sergi0: do you want to do the backports for the testwiki as well? [20:00:36] or just beta today? [20:01:00] (03PS3) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) [20:01:01] (03PS7) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [20:01:04] (03CR) 10Urbanecm: [C:03+2] [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [20:01:11] (03PS5) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) [20:01:14] (03CR) 10Urbanecm: [C:03+2] Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm) [20:01:17] urbanecm: I'd prefer just beta, so we can accumulate other possible backports to testwiki [20:01:46] (03Merged) 10jenkins-bot: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz) [20:01:51] sergi0: ack. we could do just backports to save time tomorrow, or we can get everything tomorrow too [20:01:57] (03Merged) 10jenkins-bot: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm) [20:01:58] (03PS9) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) [20:02:01] (03CR) 10Urbanecm: [C:03+2] [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [20:02:16] I'll be able to test my config patch by inspecting the DB after performing a log action. [20:02:38] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx-in1001.wikimedia.org with reason: host reimage [20:02:40] (03Merged) 10jenkins-bot: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [20:02:45] Dreamy_Jazz: ack [20:03:38] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1038740|[CheckUser] Stop writing old for event tables migration on group0 (T360685)]], [[gerrit:1038882|Growth: Use `growthexperiments` DB list for enabling GrowthExperiments (T364892)]], [[gerrit:1035473|[Beta] Enable CommunityConfiguration extension in all wikis (T364892)]] [20:03:42] T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685 [20:03:42] T364892: Enable CommunityConfiguration on all beta wikis with GrowthExperiments - https://phabricator.wikimedia.org/T364892 [20:03:52] (03PS8) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [20:04:40] (03PS1) 10CDanis: otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) [20:05:50] (03CR) 10Sergio Gimeno: [C:03+1] Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519) (owner: 10Urbanecm) [20:06:09] (03CR) 10Sergio Gimeno: [C:03+1] Improve navigation link handling in CommunityConfiguration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038843 (https://phabricator.wikimedia.org/T364938) (owner: 10Sergio Gimeno) [20:06:18] !log urbanecm@deploy1002 urbanecm and sgimeno and dreamyjazz: Backport for [[gerrit:1038740|[CheckUser] Stop writing old for event tables migration on group0 (T360685)]], [[gerrit:1038882|Growth: Use `growthexperiments` DB list for enabling GrowthExperiments (T364892)]], [[gerrit:1035473|[Beta] Enable CommunityConfiguration extension in all wikis (T364892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/M [20:06:18] wdebug) [20:06:22] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [20:06:26] Dreamy_Jazz: can you do the testing, please? [20:06:31] Sure. [20:06:41] (03PS1) 10BCornwall: ncmonitor: Fix namespacing of keys [labs/private] - 10https://gerrit.wikimedia.org/r/1039293 [20:07:11] sergi0: and let's spot test the growthexperiments list part as well (just that Growth features don't disappear, i guess) [20:07:19] (03PS2) 10BCornwall: ncmonitor: Fix namespacing of keys [labs/private] - 10https://gerrit.wikimedia.org/r/1039293 [20:07:48] (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Fix namespacing of keys [labs/private] - 10https://gerrit.wikimedia.org/r/1039293 (owner: 10BCornwall) [20:08:05] urbanecm: alright [20:08:20] urbanecm: Test successful. [20:08:21] (03PS16) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [20:08:25] Dreamy_Jazz: great! [20:10:01] urbanecm: as per running the migration script, how should we proceed? Within this window? At least for testwiki and some betas? [20:10:18] (03PS9) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [20:10:41] sergi0: feel free to run the script in beta at any time (during the window or after it, up2you). for testwiki, i think the script cannot be executed until we enable the feature there? [20:10:55] (I'm OK with doing that today, but you seemed like you want to wait) [20:11:19] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2783/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [20:12:42] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [20:14:07] sergi0: i checked a couple of wikis, GE features appear available. ok to sync from your side? [20:15:02] urbanecm: Yes, I'll start running the script after. For testwiki I prefer to wait. [20:15:08] ok [20:15:53] Interesting that the logmsgbot message ended up having the URL truncated in the on-wiki phab message [20:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [20:16:14] i.e. https://wikitech.wikimedia.org/wiki/M being the URL [20:16:15] Dreamy_Jazz: the mwdebug one? [20:16:22] yeah, that's for irc max length constraints [20:16:39] i know that for ~3 patches, the URL is the only thing that gets cut out sometimes (depending on commit messages) [20:16:42] Ah yeah, that would explain it. [20:16:59] !log urbanecm@deploy1002 urbanecm and sgimeno and dreamyjazz: Continuing with sync [20:18:21] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mx-in1001.wikimedia.org with OS bookworm [20:18:21] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mx-in1001.wikimedia.org [20:18:27] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865636 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1002 for host mx-in1001.wikimedia.org with OS bookworm completed: - m... [20:19:09] (03PS1) 10CDanis: otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750) [20:21:54] !log jhathaway@cumin1002 START - Cookbook sre.ganeti.makevm for new host mx-in2001.wikimedia.org [20:21:55] !log jhathaway@cumin1002 START - Cookbook sre.dns.netbox [20:22:57] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [20:23:43] (03PS2) 10CDanis: otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) [20:23:43] (03PS2) 10CDanis: otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750) [20:24:01] !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in2001.wikimedia.org - jhathaway@cumin1002" [20:25:08] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in2001.wikimedia.org - jhathaway@cumin1002" [20:25:08] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:25:08] !log jhathaway@cumin1002 START - Cookbook sre.dns.wipe-cache mx-in2001.wikimedia.org on all recursors [20:25:12] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mx-in2001.wikimedia.org on all recursors [20:25:34] urbanecm: It seems that GrowthExperiments complains about "Invalid suggested edits configuration". Does that mean that for prod wikis we should split each enabling and run the script in between? [20:25:38] !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in2001.wikimedia.org - jhathaway@cumin1002" [20:25:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1038740|[CheckUser] Stop writing old for event tables migration on group0 (T360685)]], [[gerrit:1038882|Growth: Use `growthexperiments` DB list for enabling GrowthExperiments (T364892)]], [[gerrit:1035473|[Beta] Enable CommunityConfiguration extension in all wikis (T364892)]] (duration: 22m 04s) [20:25:50] sergi0: complains where? [20:25:51] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865652 (10Dwisehaupt) @jhathaway Question about the routing of mail with these hosts. Currently the civicrm host receives mail... [20:25:52] T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685 [20:25:52] T364892: Enable CommunityConfiguration on all beta wikis with GrowthExperiments - https://phabricator.wikimedia.org/T364892 [20:26:07] Thanks! [20:26:11] no problem Dreamy_Jazz [20:26:30] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in2001.wikimedia.org - jhathaway@cumin1002" [20:26:31] urbanecm: eg: https://beta-logs.wmcloud.org/goto/d72e84a040b77bbeba4f9670e75fb0a1 [20:26:45] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host mx-in2001.wikimedia.org with OS bookworm [20:27:03] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1002 for host mx-in2001.wikimedia.org with OS bookworm [20:29:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2206.codfw.wmnet with reason: Maintenance [20:29:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2206.codfw.wmnet with reason: Maintenance [20:29:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T364299)', diff saved to https://phabricator.wikimedia.org/P64133 and previous config saved to /var/cache/conftool/dbconfig/20240605-202949-marostegui.json [20:29:56] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:30:26] (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [20:30:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [20:30:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:32:15] sergi0: good question...we probably should enable wmgUseCommunityConfiguration first, run script second and then enable the GE flag [20:32:21] that way, this should not happen [20:33:11] urbanecm: ack [20:34:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865703 (10jhathaway) [20:35:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [20:35:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:36:15] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [20:36:15] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:38:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865710 (10jhathaway) >>! In T365395#9865652, @Dwisehaupt wrote: > @jhathaway Question about the routing of mail with these host... [20:42:54] !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mx-in2001.wikimedia.org with reason: host reimage [20:45:14] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx-in2001.wikimedia.org with reason: host reimage [20:46:00] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [20:46:00] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:51:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:56:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T2100) [21:02:19] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mx-in2001.wikimedia.org with OS bookworm [21:02:19] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mx-in2001.wikimedia.org [21:02:27] 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1002 for host mx-in2001.wikimedia.org with OS bookworm completed: - m... [21:04:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [21:04:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:07:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [21:08:29] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [21:09:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [21:09:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:10:15] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [21:10:15] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:18:39] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [21:30:00] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [21:30:00] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:33:23] (03PS12) 10Dzahn: peopleweb: introduce script to warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [21:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:36:35] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555 [21:37:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [21:41:59] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555 [21:42:23] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555 [21:43:09] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Hail mary - eevans@cumin1002 [21:46:54] (03CR) 10Dzahn: [C:03+2] peopleweb: introduce script to warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [21:51:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [21:51:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:56:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [21:56:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:57:10] (03PS1) 10Dzahn: peopleweb: fix file permission and typo in script config [puppet] - 10https://gerrit.wikimedia.org/r/1039303 (https://phabricator.wikimedia.org/T343364) [21:59:56] (03CR) 10Dzahn: [C:03+2] "forgot to link https://phabricator.wikimedia.org/T343364" [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [22:00:26] (03CR) 10Dzahn: [C:03+2] peopleweb: fix file permission and typo in script config [puppet] - 10https://gerrit.wikimedia.org/r/1039303 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [22:03:13] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Hail mary - eevans@cumin1002 [22:13:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 13.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:13:39] (03PS1) 10Dzahn: peopleweb: set warning threshold for home dirs to 2GB [puppet] - 10https://gerrit.wikimedia.org/r/1039305 (https://phabricator.wikimedia.org/T343364) [22:15:45] (03CR) 10Dzahn: [C:03+2] peopleweb: set warning threshold for home dirs to 2GB [puppet] - 10https://gerrit.wikimedia.org/r/1039305 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [22:16:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.42000034901261735s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:16:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: 0.21520063012904037s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE [22:16:21] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.223s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:18:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 14.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:19:28] (03PS1) 10BryanDavis: wikitech: Update Phabricator Conduit calls to disable/enable users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039307 (https://phabricator.wikimedia.org/T366587) [22:21:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.42000034901261735s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [22:21:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: ... [22:21:15] 0.21520063012904037s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:21:21] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.223s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:21:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [22:24:15] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:26:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [22:27:55] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:15] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:37:55] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:38:15] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:42:55] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:53] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:44:12] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [22:44:15] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:47:55] RESOLVED: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:48:27] (03PS2) 10BryanDavis: wikitech: Replace OSM class in Gerrit blocking hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [22:48:27] (03PS3) 10BryanDavis: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [22:49:04] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9866001 (10Scott_French) [22:50:11] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555 [22:52:55] FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:53:53] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:53:57] PROBLEM - Host logstash1026 is DOWN: PING CRITICAL - Packet loss = 100% [22:54:21] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [22:55:27] RECOVERY - Host logstash1026 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [23:02:55] RESOLVED: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:29] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9866023 (10Scott_French) The service is turned up in staging and was verified against the commons impact metrics dataset present in cassandra staging a... [23:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:41] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:29:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [23:29:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [23:29:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T352010)', diff saved to https://phabricator.wikimedia.org/P64134 and previous config saved to /var/cache/conftool/dbconfig/20240605-232926-ladsgroup.json [23:29:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:29:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [23:30:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [23:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038799 [23:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038799 (owner: 10TrainBranchBot) [23:44:07] (03PS1) 10Stoyofuku-wmf: Refine list of pages where font size controls are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039310 (https://phabricator.wikimedia.org/T366334) [23:45:32] (03PS2) 10Stoyofuku-wmf: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) [23:45:57] (03CR) 10BryanDavis: [C:03+1] "I will roll this out along with I8aa283b88ed7896e8dddd16fd9c3fe4588e2e51e, probably on 2024-06-06" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [23:46:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T364299)', diff saved to https://phabricator.wikimedia.org/P64135 and previous config saved to /var/cache/conftool/dbconfig/20240605-234643-marostegui.json [23:46:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:59:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038799 (owner: 10TrainBranchBot)