[00:00:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038789 (owner: 10TrainBranchBot)
[00:04:25] <wikibugs>	 (03PS1) 10Cwhite: logstash: add drop for php notice unedefined index issue [puppet] - 10https://gerrit.wikimedia.org/r/1038790 (https://phabricator.wikimedia.org/T366657)
[00:08:04] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: add drop for php notice unedefined index issue [puppet] - 10https://gerrit.wikimedia.org/r/1038790 (https://phabricator.wikimedia.org/T366657) (owner: 10Cwhite)
[00:20:28] <wikibugs>	 (03PS1) 10JHathaway: dummy ssl key [labs/private] - 10https://gerrit.wikimedia.org/r/1038920
[00:22:38] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] dummy ssl key [labs/private] - 10https://gerrit.wikimedia.org/r/1038920 (owner: 10JHathaway)
[00:22:41] <wikibugs>	 (03CR) 10JHathaway: [V:03+2 C:03+2] dummy ssl key [labs/private] - 10https://gerrit.wikimedia.org/r/1038920 (owner: 10JHathaway)
[00:25:09] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[01:03:43] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9862244 (10Bodhisattwa) Seeing the ESEAP mailing list, I think, it would be OK, if we get the name as wikimoitree@lists.wikimedia.org
[01:08:45] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:08:45] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:10:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:34:02] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[02:34:15] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[02:34:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P64041 and previous config saved to /var/cache/conftool/dbconfig/20240605-023423-ladsgroup.json
[02:34:26] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[02:38:44] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:55:45] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:57:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:13:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P64042 and previous config saved to /var/cache/conftool/dbconfig/20240605-031310-ladsgroup.json
[03:13:13] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[03:27:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T364299)', diff saved to https://phabricator.wikimedia.org/P64043 and previous config saved to /var/cache/conftool/dbconfig/20240605-032704-marostegui.json
[03:27:07] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[03:28:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P64044 and previous config saved to /var/cache/conftool/dbconfig/20240605-032817-ladsgroup.json
[03:42:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P64045 and previous config saved to /var/cache/conftool/dbconfig/20240605-034212-marostegui.json
[03:43:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P64046 and previous config saved to /var/cache/conftool/dbconfig/20240605-034326-ladsgroup.json
[03:57:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P64047 and previous config saved to /var/cache/conftool/dbconfig/20240605-035719-marostegui.json
[03:58:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T364069)', diff saved to https://phabricator.wikimedia.org/P64048 and previous config saved to /var/cache/conftool/dbconfig/20240605-035831-marostegui.json
[03:58:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P64049 and previous config saved to /var/cache/conftool/dbconfig/20240605-035832-ladsgroup.json
[03:58:35] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[03:58:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[03:58:37] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[03:58:48] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[03:58:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T352010)', diff saved to https://phabricator.wikimedia.org/P64050 and previous config saved to /var/cache/conftool/dbconfig/20240605-035855-ladsgroup.json
[04:12:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T364299)', diff saved to https://phabricator.wikimedia.org/P64051 and previous config saved to /var/cache/conftool/dbconfig/20240605-041227-marostegui.json
[04:12:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2155.codfw.wmnet with reason: Maintenance
[04:12:31] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[04:12:44] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2155.codfw.wmnet with reason: Maintenance
[04:12:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance
[04:12:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance
[04:13:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T364299)', diff saved to https://phabricator.wikimedia.org/P64052 and previous config saved to /var/cache/conftool/dbconfig/20240605-041306-marostegui.json
[04:13:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P64053 and previous config saved to /var/cache/conftool/dbconfig/20240605-041339-marostegui.json
[04:28:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P64054 and previous config saved to /var/cache/conftool/dbconfig/20240605-042847-marostegui.json
[04:43:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T364069)', diff saved to https://phabricator.wikimedia.org/P64055 and previous config saved to /var/cache/conftool/dbconfig/20240605-044355-marostegui.json
[04:43:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[04:43:58] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[04:44:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[04:44:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T364069)', diff saved to https://phabricator.wikimedia.org/P64056 and previous config saved to /var/cache/conftool/dbconfig/20240605-044418-marostegui.json
[05:08:13] <wikibugs>	 (03PS1) 10Marostegui: es6,es7: Add candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/1038925 (https://phabricator.wikimedia.org/T365098)
[05:09:09] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1038925 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui)
[05:09:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es6,es7: Add candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/1038925 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui)
[05:11:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:17:11] <icinga-wm_>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1457 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[05:19:59] <wikibugs>	 (03CR) 10Sg912: [V:03+1 C:03+1] cassandra: create new commons_impact_analytics role [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T0600)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:17:11] <icinga-wm_>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[06:55:13] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810 (owner: 10Hashar)
[06:55:43] <wikibugs>	 (03Merged) 10jenkins-bot: Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810 (owner: 10Hashar)
[06:57:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T0700)
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:03:44] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney)
[07:07:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366038
[07:07:53] <wikibugs>	 (03CR) 10DCausse: "thanks for the fixes!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[07:07:54] <stashbot>	 T366038: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T366038
[07:07:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 0 T366038', diff saved to https://phabricator.wikimedia.org/P64057 and previous config saved to /var/cache/conftool/dbconfig/20240605-070758-root.json
[07:08:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T366038
[07:08:17] <wikibugs>	 (03PS10) 10DCausse: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[07:08:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good (once approval for analytics-privatedata-users is in)" [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn)
[07:08:48] <wikibugs>	 (03CR) 10DCausse: [C:03+1] wdqs.data-reload: fix regex escaping (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[07:09:47] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting  permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9862426 (10MoritzMuehlenhoff) >>! In T364715#9840315, @colewhite wrote: > Added Data Engineering tag fo...
[07:10:10] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039067 (https://phabricator.wikimedia.org/T366038)
[07:10:14] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1035872 (https://phabricator.wikimedia.org/T366038) (owner: 10Gerrit maintenance bot)
[07:10:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039067 (https://phabricator.wikimedia.org/T366038) (owner: 10Marostegui)
[07:16:53] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] opensearch/roll-restart-reboot: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1031063 (owner: 10Ryan Kemper)
[07:18:06] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "typo, comment in line 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[07:18:29] <wikibugs>	 (03PS4) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892)
[07:19:39] <wikibugs>	 (03PS5) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892)
[07:19:56] <wikibugs>	 (03CR) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[07:20:56] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch/roll-restart-reboot: fix usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1031063 (owner: 10Ryan Kemper)
[07:24:09] <marostegui>	 !log Starting s2 codfw failover from db2207 to db2204 - T366038
[07:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:12] <stashbot>	 T366038: Switchover s2 master (db2207 -> db2204) - https://phabricator.wikimedia.org/T366038
[07:24:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2204 to s2 primary T366038', diff saved to https://phabricator.wikimedia.org/P64058 and previous config saved to /var/cache/conftool/dbconfig/20240605-072427-marostegui.json
[07:25:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2207 T366038', diff saved to https://phabricator.wikimedia.org/P64059 and previous config saved to /var/cache/conftool/dbconfig/20240605-072509-root.json
[07:25:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9862490 (10Reedy)
[07:25:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:27:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2207.codfw.wmnet with reason: Long schema change
[07:27:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2207.codfw.wmnet with reason: Long schema change
[07:28:12] <marostegui>	 !log dbmaint codfw s2 deploy schema change on db2207 T364299
[07:28:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:14] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:29:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/1039166
[07:30:04] <wikibugs>	 (03PS1) 10Marostegui: db1186: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039167 (https://phabricator.wikimedia.org/T366556)
[07:30:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1186', diff saved to https://phabricator.wikimedia.org/P64060 and previous config saved to /var/cache/conftool/dbconfig/20240605-073024-root.json
[07:30:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1186: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039167 (https://phabricator.wikimedia.org/T366556) (owner: 10Marostegui)
[07:30:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install1004.wikimedia.org
[07:30:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1186.eqiad.wmnet with reason: Reimage
[07:31:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install2004.wikimedia.org
[07:31:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1186.eqiad.wmnet with reason: Reimage
[07:31:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover URL downloaders for reboot [dns] - 10https://gerrit.wikimedia.org/r/1039166 (owner: 10Muehlenhoff)
[07:35:09] <wikibugs>	 (03PS3) 10Clément Goubert: miscweb: Use a random miscweb image for default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518)
[07:35:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1004.wikimedia.org
[07:35:16] <wikibugs>	 (03CR) 10Clément Goubert: miscweb: Use a random miscweb image for default value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[07:35:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2004.wikimedia.org
[07:35:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar)
[07:36:47] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the preparation. Let me know when this should be merged." [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar)
[07:37:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bookworm
[07:37:31] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db1186.eqiad.wmnet with OS bookworm
[07:37:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] gerrit: remove mac algos no more supported by Mina SSHD [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar)
[07:38:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bookworm
[07:38:19] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host db1186.eqiad.wmnet with OS bookworm
[07:38:43] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1021.eqiad.wmnet
[07:38:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1186.eqiad.wmnet with OS bookworm
[07:40:18] <icinga-wm_>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:40:22] <icinga-wm_>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:45:07] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1021.eqiad.wmnet
[07:47:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T364299)', diff saved to https://phabricator.wikimedia.org/P64061 and previous config saved to /var/cache/conftool/dbconfig/20240605-074739-marostegui.json
[07:47:43] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:50:00] <wikibugs>	 (03PS3) 10Ayounsi: Netbox deploy for 4.0.2 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275)
[07:50:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mirror1001.wikimedia.org
[07:50:32] <wikibugs>	 (03CR) 10Ayounsi: Netbox deploy for 4.0.2 (032 comments) [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[07:52:12] <wikibugs>	 (03CR) 10Clément Goubert: "That would be great to add, but possibly would be more at home in the reboot function itself so it could be reused by all cookbooks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[07:53:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage
[07:53:32] <wikibugs>	 (03PS4) 10Ayounsi: Netbox deploy for 4.0.3 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275)
[07:53:33] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm now, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[07:54:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[07:54:31] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1025.eqiad.wmnet
[07:54:43] <wikibugs>	 (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[07:55:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[07:56:31] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1186.eqiad.wmnet with reason: host reimage
[07:57:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mirror1001.wikimedia.org
[07:58:14] <wikibugs>	 (03CR) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[07:59:26] <wikibugs>	 (03PS8) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865
[07:59:40] <wikibugs>	 (03CR) 10Volans: sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[08:00:08] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw
[08:00:43] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1025.eqiad.wmnet
[08:01:11] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1026.eqiad.wmnet
[08:02:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P64062 and previous config saved to /var/cache/conftool/dbconfig/20240605-080247-marostegui.json
[08:04:08] <wikibugs>	 (03CR) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[08:05:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete thanos-query.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff)
[08:07:15] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS
[08:07:15] <icinga-wm_>	 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:07:57] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1026.eqiad.wmnet
[08:08:18] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1027.eqiad.wmnet
[08:08:35] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "mw1358: Put back insetup::serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/1038834
[08:09:13] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:11:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[08:14:30] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1027.eqiad.wmnet
[08:17:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P64063 and previous config saved to /var/cache/conftool/dbconfig/20240605-081755-marostegui.json
[08:18:27] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1038771 (https://phabricator.wikimedia.org/T365503) (owner: 10Brouberol)
[08:18:50] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[08:19:17] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6
[08:19:17] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:19:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1186.eqiad.wmnet with OS bookworm
[08:21:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9862589 (10Clement_Goubert) Thanks!
[08:21:17] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:22:49] <wikibugs>	 (03PS11) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[08:23:14] <wikibugs>	 (03CR) 10Volans: "This could be a nice addition to the parent class in `sre/__init__.py`. Spicerack has already an `uptime()` method and we could collect al" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[08:23:44] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:23:52] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Revert "mw1358: Put back insetup::serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/1038834 (owner: 10Clément Goubert)
[08:24:05] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert "mw1358: Put back insetup::serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/1038834 (owner: 10Clément Goubert)
[08:24:20] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] analytics_test_cluster_coordinator: upgrade mariadb to version 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1038771 (https://phabricator.wikimedia.org/T365503) (owner: 10Brouberol)
[08:27:23] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:27:25] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:27:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw1358 to wikikube-worker1001
[08:27:47] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[08:29:59] <wikibugs>	 (03CR) 10Hashar: [C:03+2] plugins: Add wm-schedule-deployment plugin (034 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[08:30:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet
[08:30:33] <wikibugs>	 (03Merged) 10jenkins-bot: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[08:30:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64064 and previous config saved to /var/cache/conftool/dbconfig/20240605-083041-root.json
[08:30:48] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@b91b3bd]: Use a wildcard TypeScript include for plugins
[08:30:56] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@b91b3bd]: Use a wildcard TypeScript include for plugins (duration: 00m 08s)
[08:31:17] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1028.eqiad.wmnet
[08:31:21] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@7ea913b]: plugins: Add wm-schedule-deployment plugin - T366512
[08:31:29] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@7ea913b]: plugins: Add wm-schedule-deployment plugin - T366512 (duration: 00m 07s)
[08:31:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1358 to wikikube-worker1001 - cgoubert@cumin1002"
[08:33:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T364299)', diff saved to https://phabricator.wikimedia.org/P64065 and previous config saved to /var/cache/conftool/dbconfig/20240605-083304-marostegui.json
[08:33:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2172.codfw.wmnet with reason: Maintenance
[08:33:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1358 to wikikube-worker1001 - cgoubert@cumin1002"
[08:33:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:33:18] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1001
[08:33:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2172.codfw.wmnet with reason: Maintenance
[08:33:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T364299)', diff saved to https://phabricator.wikimedia.org/P64066 and previous config saved to /var/cache/conftool/dbconfig/20240605-083328-marostegui.json
[08:33:42] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1186: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038835
[08:34:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet
[08:34:23] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:34:25] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:34:30] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1001
[08:34:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1039173 (https://phabricator.wikimedia.org/T360414)
[08:34:38] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1358 to wikikube-worker1001
[08:35:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9862621 (10darthmon_wmde) hereby I, as "direct supervisor" of Ricki's,  aprove for Ricki to get  access to analytics-privatedata-users. Since this is cru...
[08:35:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[08:37:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1186: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038835 (owner: 10Marostegui)
[08:37:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64067 and previous config saved to /var/cache/conftool/dbconfig/20240605-083733-root.json
[08:37:52] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1028.eqiad.wmnet
[08:38:44] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:38:51] <wikibugs>	 (03PS1) 10Marostegui: db1186: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039176
[08:39:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1186: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039176 (owner: 10Marostegui)
[08:39:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[08:41:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db1227.eqiad.wmnet with reason: Reimage
[08:41:57] <wikibugs>	 (03PS1) 10Marostegui: db1127: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039178 (https://phabricator.wikimedia.org/T362745)
[08:42:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1227.eqiad.wmnet with reason: Reimage
[08:42:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1227', diff saved to https://phabricator.wikimedia.org/P64068 and previous config saved to /var/cache/conftool/dbconfig/20240605-084211-root.json
[08:42:52] <wikibugs>	 (03PS2) 10Marostegui: db1227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039178 (https://phabricator.wikimedia.org/T362745)
[08:43:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1227: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039178 (https://phabricator.wikimedia.org/T362745) (owner: 10Marostegui)
[08:43:41] <Dreamy_Jazz>	 jouncebot: nowandnext
[08:43:41] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 16 minute(s)
[08:43:41] <jouncebot>	 In 1 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1000)
[08:43:44] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:43:55] <wikibugs>	 (03PS5) 10Effie Mouzeli: ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French)
[08:44:25] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-co
[08:44:25] <icinga-wm_>	 602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:44:25] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw
[08:44:25] <icinga-wm_>	 /IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:44:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1227.eqiad.wmnet with OS bookworm
[08:44:56] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4044.ulsfo.wmnet
[08:45:17] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet
[08:45:46] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1029.eqiad.wmnet
[08:45:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64069 and previous config saved to /var/cache/conftool/dbconfig/20240605-084547-root.json
[08:45:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2002.codfw.wmnet
[08:45:59] <claime>	 Dreamy_Jazz: Deployments may fail as I'm rebooting the whole k8s cluster in codfw, which means most of the nodes are cordoned off
[08:46:20] <Dreamy_Jazz>	 Thanks for the heads up.
[08:46:29] <Dreamy_Jazz>	 Any thoughts on when this might be complete?
[08:47:44] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet
[08:47:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet
[08:48:27] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:48:27] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:48:32] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French)
[08:49:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9862669 (10cmooney)
[08:49:28] <claime>	 Dreamy_Jazz: I fear it's going to take most of the day, although we may be able to run deployments once we cross a certain threshold of rebooted nodes
[08:49:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2002.codfw.wmnet
[08:50:01] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French)
[08:50:23] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4044.ulsfo.wmnet
[08:50:36] <logmsgbot>	 !log fabfur@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cp4044.ulsfo.wmnet
[08:51:03] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4044.ulsfo.wmnet
[08:51:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[08:51:32] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4052.ulsfo.wmnet
[08:52:10] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1029.eqiad.wmnet
[08:52:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64070 and previous config saved to /var/cache/conftool/dbconfig/20240605-085239-root.json
[08:52:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:53:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org
[08:53:36] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[08:54:02] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet
[08:54:11] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet
[08:54:33] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[08:55:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:57:22] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet
[08:57:27] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet
[08:57:27] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw,
[08:57:27] <icinga-wm_>	 IPv4: Active - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:57:29] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codf
[08:57:29] <icinga-wm_>	 2/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:57:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org
[08:58:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage
[08:58:31] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[08:58:56] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[09:00:22] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4044.ulsfo.wmnet
[09:00:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64071 and previous config saved to /var/cache/conftool/dbconfig/20240605-090053-root.json
[09:01:05] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4052.ulsfo.wmnet
[09:01:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1001.eqiad.wmnet on all recursors
[09:01:33] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1001.eqiad.wmnet on all recursors
[09:02:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage
[09:02:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1001.eqiad.wmnet with OS bullseye
[09:02:30] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:03:26] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:03:44] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:04:22] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw-web, mw-api-ext: Raise replicas for 90% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038732 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[09:06:10] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons.
[09:06:32] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet
[09:06:50] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet
[09:07:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64072 and previous config saved to /var/cache/conftool/dbconfig/20240605-090745-root.json
[09:09:28] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6
[09:09:28] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:09:32] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS
[09:09:32] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:11:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:11:27] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet
[09:11:28] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:11:32] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:11:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet
[09:12:38] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038836
[09:13:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[09:15:38] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons.
[09:16:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64073 and previous config saved to /var/cache/conftool/dbconfig/20240605-091559-root.json
[09:17:04] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1001.eqiad.wmnet with reason: host reimage
[09:18:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[09:18:33] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet
[09:19:32] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw
[09:19:32] <icinga-wm_>	 /IPv6: Connect - kubernetes-ml-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:19:34] <wikibugs>	 (03PS3) 10Hnowlan: kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323)
[09:19:34] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64602/IPv6: Connect - kubernetes-codfw
[09:19:34] <icinga-wm_>	 /IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:19:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet
[09:19:50] <icinga-wm_>	 PROBLEM - Host ms-be2053 is DOWN: PING CRITICAL - Packet loss = 100%
[09:20:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1001.eqiad.wmnet with reason: host reimage
[09:20:18] <icinga-wm_>	 RECOVERY - Host ms-be2053 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms
[09:21:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:21:52] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[09:22:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1227: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038836 (owner: 10Marostegui)
[09:22:42] <marostegui>	 hnowlan: good to merge?
[09:22:50] <hnowlan>	 marostegui: yep, please do
[09:22:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64074 and previous config saved to /var/cache/conftool/dbconfig/20240605-092251-root.json
[09:23:04] <marostegui>	 hnowlan: merging!
[09:23:16] <hnowlan>	 thanks!
[09:23:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64075 and previous config saved to /var/cache/conftool/dbconfig/20240605-092324-root.json
[09:23:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1227.eqiad.wmnet with OS bookworm
[09:23:35] <wikibugs>	 (03CR) 10Hashar: [C:03+2] plugins: Add wm-schedule-deployment plugin (031 comment) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[09:24:12] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet
[09:24:26] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:24:32] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:24:36] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:24:50] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet
[09:24:55] <wikibugs>	 (03PS1) 10Marostegui: db1227: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039182
[09:25:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1227: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039182 (owner: 10Marostegui)
[09:26:02] <wikibugs>	 (03Abandoned) 10Stevemunene: Change datahub service to use dse ingress [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene)
[09:26:03] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet
[09:26:08] <wikibugs>	 (03PS4) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869
[09:26:08] <wikibugs>	 (03PS16) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[09:26:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:26:35] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1030.eqiad.wmnet
[09:26:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:27:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[09:29:16] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1400 to wikikube-worker1008.eqiad.wmnet
[09:29:30] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw1400 to wikikube-worker1008.eqiad.wmnet
[09:30:05] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1400 to wikikube-worker1008
[09:30:11] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[09:31:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64076 and previous config saved to /var/cache/conftool/dbconfig/20240605-093105-root.json
[09:31:15] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1401 to wikikube-worker1009
[09:31:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:31:27] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1030.eqiad.wmnet
[09:31:40] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1410 to wikikube-worker1010.eqiad.wmnet
[09:31:44] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw1410 to wikikube-worker1010.eqiad.wmnet
[09:31:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:32:19] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [labs/private] - 10https://gerrit.wikimedia.org/r/1039173 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff)
[09:32:36] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A
[09:32:36] <icinga-wm_>	 v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:32:40] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A
[09:32:40] <icinga-wm_>	 v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:33:07] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1400 to wikikube-worker1008 - hnowlan@cumin1002"
[09:33:19] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1410 to wikikube-worker1010
[09:33:22] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1039173 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff)
[09:34:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2003.wikimedia.org
[09:34:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on 6 hosts with reason: Reimage x2 eqiad master T366677
[09:34:24] <stashbot>	 T366677: Reimage x2 eqiad master - https://phabricator.wikimedia.org/T366677
[09:34:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:34:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 6 hosts with reason: Reimage x2 eqiad master T366677
[09:34:38] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:34:38] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:35:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1151 to temp x2 eqiad master T366677', diff saved to https://phabricator.wikimedia.org/P64077 and previous config saved to /var/cache/conftool/dbconfig/20240605-093507-root.json
[09:35:16] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[09:35:22] <wikibugs>	 (03PS5) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869
[09:35:22] <wikibugs>	 (03PS17) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[09:35:35] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1038792 (https://phabricator.wikimedia.org/T366678)
[09:35:48] <wikibugs>	 (03PS1) 10Marostegui: db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039183 (https://phabricator.wikimedia.org/T366677)
[09:36:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1152: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1039183 (https://phabricator.wikimedia.org/T366677) (owner: 10Marostegui)
[09:36:16] <hnowlan>	 claime: sre.dns.netbox is asking me to set wikikube-worker1001 to failed - is that okay> 
[09:36:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[09:36:29] <claime>	 hnowlan: huh what
[09:36:31] <claime>	 it's not failed
[09:36:41] <hnowlan>	 wait no
[09:36:41] <hnowlan>	 sorry
[09:36:51] <hnowlan>	 it's setting `profile::netbox::host::status: active`
[09:36:57] <claime>	 ah yeah that's fine
[09:37:08] <claime>	 I'd run the cookbook after changing the status, weird
[09:37:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1152.eqiad.wmnet with OS bookworm
[09:37:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete LDAP stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1039184
[09:37:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64078 and previous config saved to /var/cache/conftool/dbconfig/20240605-093757-root.json
[09:38:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64079 and previous config saved to /var/cache/conftool/dbconfig/20240605-093830-root.json
[09:38:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2003.wikimedia.org
[09:38:40] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1400 to wikikube-worker1008 - hnowlan@cumin1002"
[09:38:40] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:38:40] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1008
[09:38:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[09:38:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet
[09:39:40] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:39:42] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:39:45] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1401 to wikikube-worker1009 - hnowlan@cumin1002"
[09:40:06] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1008
[09:40:14] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1400 to wikikube-worker1008
[09:41:00] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[09:41:15] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1401 to wikikube-worker1009 - hnowlan@cumin1002"
[09:41:15] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:41:16] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1009
[09:41:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch maps/eqiad to PKI as well [puppet] - 10https://gerrit.wikimedia.org/r/1038815 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[09:42:15] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Add CNAMEs for es6 and es7 [dns] - 10https://gerrit.wikimedia.org/r/1039185 (https://phabricator.wikimedia.org/T365098)
[09:42:26] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:42:26] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1010
[09:43:16] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1009
[09:43:24] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1401 to wikikube-worker1009
[09:43:34] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1001.eqiad.wmnet with OS bullseye
[09:43:40] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:43:44] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:44:05] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1428 to wikikube-worker1011
[09:44:08] <claime>	 !log homer 'cr*eqiad*' commit 'T351074'
[09:44:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:10] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[09:44:22] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[09:44:45] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1010
[09:44:51] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1456 to wikikube-worker1012
[09:44:53] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1410 to wikikube-worker1010
[09:45:23] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw1456 to wikikube-worker1012
[09:45:38] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[09:45:47] <wikibugs>	 (03PS12) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[09:46:01] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw1428 to wikikube-worker1011
[09:46:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64080 and previous config saved to /var/cache/conftool/dbconfig/20240605-094611-root.json
[09:46:42] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1428 to wikikube-worker1011
[09:46:46] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6
[09:46:46] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:46:47] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[09:47:12] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] wmnet: Add CNAMEs for es6 and es7 [dns] - 10https://gerrit.wikimedia.org/r/1039185 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui)
[09:47:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9862852 (10akosiaris)
[09:47:44] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS
[09:47:44] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:48:51] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::maps::tlsproxy: Unconditionally use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778)
[09:49:13] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1038792 (https://phabricator.wikimedia.org/T366678) (owner: 10Gerrit maintenance bot)
[09:49:15] <wikibugs>	 (03PS10) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[09:49:30] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/941917 (https://phabricator.wikimedia.org/T342947) (owner: 10Gerrit maintenance bot)
[09:49:36] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1183 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/942787 (https://phabricator.wikimedia.org/T343078) (owner: 10Gerrit maintenance bot)
[09:49:39] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1428 to wikikube-worker1011 - hnowlan@cumin1002"
[09:49:42] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/959966 (https://phabricator.wikimedia.org/T347140) (owner: 10Gerrit maintenance bot)
[09:49:44] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:49:46] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:49:47] <wikibugs>	 (03Abandoned) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/997489 (https://phabricator.wikimedia.org/T356650) (owner: 10Gerrit maintenance bot)
[09:49:57] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/997488 (https://phabricator.wikimedia.org/T356650) (owner: 10Gerrit maintenance bot)
[09:50:14] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1016377 (https://phabricator.wikimedia.org/T361780) (owner: 10Gerrit maintenance bot)
[09:50:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Add CNAMEs for es6 and es7 [dns] - 10https://gerrit.wikimedia.org/r/1039185 (https://phabricator.wikimedia.org/T365098) (owner: 10Marostegui)
[09:50:54] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1428 to wikikube-worker1011 - hnowlan@cumin1002"
[09:50:54] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:50:54] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1011
[09:51:18] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1456 to wikikube-worker1012
[09:51:24] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[09:51:32] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1031.eqiad.wmnet
[09:51:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage
[09:51:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove tendril stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1039189
[09:51:48] <wikibugs>	 (03PS13) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[09:52:06] <wikibugs>	 (03Abandoned) 10Ladsgroup: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1025917 (https://phabricator.wikimedia.org/T364067) (owner: 10Gerrit maintenance bot)
[09:52:16] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1192 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1028939 (https://phabricator.wikimedia.org/T364541) (owner: 10Gerrit maintenance bot)
[09:52:21] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1011
[09:52:27] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1024756 (https://phabricator.wikimedia.org/T363689) (owner: 10Gerrit maintenance bot)
[09:52:29] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1428 to wikikube-worker1011
[09:52:35] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db2204 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1036601 (https://phabricator.wikimedia.org/T366241) (owner: 10Gerrit maintenance bot)
[09:52:58] <wikibugs>	 (03Abandoned) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024757 (https://phabricator.wikimedia.org/T363689) (owner: 10Gerrit maintenance bot)
[09:53:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64081 and previous config saved to /var/cache/conftool/dbconfig/20240605-095303-root.json
[09:53:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64082 and previous config saved to /var/cache/conftool/dbconfig/20240605-095336-root.json
[09:53:39] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1456 to wikikube-worker1012 - hnowlan@cumin1002"
[09:53:44] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:53:46] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:54:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.netbox.restart-reboot rolling reboot on A:netbox
[09:54:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[09:54:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[09:54:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1152.eqiad.wmnet with reason: host reimage
[09:54:53] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1456 to wikikube-worker1012 - hnowlan@cumin1002"
[09:54:53] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:54:53] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1012
[09:55:02] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1008.eqiad.wmnet wikikube-worker1009.eqiad.wmnet wikikube-worker1010.eqiad.wmnet wikikube-worker1011.eqiad.wmnet wikikube-worker1012.eqiad.wmnet on all recursors
[09:55:06] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1008.eqiad.wmnet wikikube-worker1009.eqiad.wmnet wikikube-worker1010.eqiad.wmnet wikikube-worker1011.eqiad.wmnet wikikube-worker1012.eqiad.wmnet on all recursors
[09:55:28] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1008.eqiad.wmnet with OS bullseye
[09:55:42] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1009.eqiad.wmnet with OS bullseye
[09:55:46] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:55:48] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:55:56] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1010.eqiad.wmnet with OS bullseye
[09:56:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:56:41] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038838
[09:56:50] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:57:06] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1012
[09:57:14] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1456 to wikikube-worker1012
[09:58:29] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680 (10MoritzMuehlenhoff) 03NEW
[09:58:36] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9862914 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:58:41] <wikibugs>	 (03PS11) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[09:58:51] <claime>	 !log pooling and uncordoning wikikube-worker1001 - T351074
[09:58:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:54] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[09:59:01] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker1001.eqiad.wmnet,cluster=kubernetes,service=kubesvc
[09:59:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netbox.discovery.wmnet. on all recursors
[09:59:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox.discovery.wmnet. on all recursors
[09:59:29] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1038793 (https://phabricator.wikimedia.org/T366682)
[09:59:33] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038794 (https://phabricator.wikimedia.org/T366682)
[09:59:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[10:00:05] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1000)
[10:00:08] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1011.eqiad.wmnet with OS bullseye
[10:00:09] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet
[10:00:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038794 (https://phabricator.wikimedia.org/T366682) (owner: 10Gerrit maintenance bot)
[10:00:16] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1012.eqiad.wmnet with OS bullseye
[10:00:20] <fabfur>	 !log disabling puppet on cp4037 to test Benthos performances (T358109)
[10:00:27] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[10:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:31] <stashbot>	 T358109: Install new Benthos instance on cp hosts - https://phabricator.wikimedia.org/T358109
[10:00:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:01:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64083 and previous config saved to /var/cache/conftool/dbconfig/20240605-100117-root.json
[10:01:48] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A
[10:01:48] <icinga-wm_>	 v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:01:50] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS
[10:01:50] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:03:21] <wikibugs>	 (03CR) 10Klausman: base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[10:03:44] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:03:48] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:03:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "I 'd follow the same approach for puppetserver AND puppetmaster manifests. In this patch I am commenting on, the approach differs." [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[10:03:50] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:06:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:06:33] <wikibugs>	 (03Abandoned) 10Ladsgroup: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038794 (https://phabricator.wikimedia.org/T366682) (owner: 10Gerrit maintenance bot)
[10:06:52] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1038793 (https://phabricator.wikimedia.org/T366682) (owner: 10Gerrit maintenance bot)
[10:07:02] <icinga-wm_>	 PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100%
[10:07:52] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:07:52] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:07:58] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:08:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64084 and previous config saved to /var/cache/conftool/dbconfig/20240605-100810-root.json
[10:08:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64085 and previous config saved to /var/cache/conftool/dbconfig/20240605-100842-root.json
[10:08:48] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:09:08] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1008.eqiad.wmnet with reason: host reimage
[10:09:43] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1010.eqiad.wmnet with reason: host reimage
[10:10:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1152 back to x2 eqiad master T366677', diff saved to https://phabricator.wikimedia.org/P64086 and previous config saved to /var/cache/conftool/dbconfig/20240605-101019-root.json
[10:10:23] <stashbot>	 T366677: Reimage x2 eqiad master - https://phabricator.wikimedia.org/T366677
[10:11:45] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1008.eqiad.wmnet with reason: host reimage
[10:11:47] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:11:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1152: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038838 (owner: 10Marostegui)
[10:11:57] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:12:05] <icinga-wm_>	 RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[10:13:17] <wikibugs>	 (03PS12) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[10:13:34] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1012.eqiad.wmnet with reason: host reimage
[10:13:36] <logmsgbot>	 !log dcaro@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudcephosd1031.eqiad.wmnet
[10:13:45] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:13:59] <wikibugs>	 (03CR) 10Effie Mouzeli: "done" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[10:14:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove kartotherian.discovery.wmnet.crt cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778)
[10:14:32] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Remove kartotherian.discovery.wmnet.crt cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[10:14:33] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 3 others: June 2024 Bullseye database backups reboots - https://phabricator.wikimedia.org/T366684 (10Marostegui) 03NEW
[10:15:00] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance
[10:15:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove kartotherian stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1039191 (https://phabricator.wikimedia.org/T360778)
[10:15:06] <wikibugs>	 (03CR) 10Klausman: base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[10:15:09] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1010.eqiad.wmnet with reason: host reimage
[10:15:13] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance
[10:15:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64087 and previous config saved to /var/cache/conftool/dbconfig/20240605-101521-ladsgroup.json
[10:15:24] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[10:15:39] <wikibugs>	 (03PS4) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759
[10:16:11] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Prepare stat100[4-7] for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1038329 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis)
[10:16:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1152.eqiad.wmnet with OS bookworm
[10:16:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "Functionaly ok, 2 nitpicks and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[10:17:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P64088 and previous config saved to /var/cache/conftool/dbconfig/20240605-101744-ladsgroup.json
[10:18:48] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1012.eqiad.wmnet with reason: host reimage
[10:19:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[10:20:49] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:21:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:21:46] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet
[10:21:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet
[10:21:53] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:21:53] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:22:33] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet
[10:22:38] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet
[10:22:43] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance
[10:22:45] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance
[10:22:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64090 and previous config saved to /var/cache/conftool/dbconfig/20240605-102252-ladsgroup.json
[10:22:56] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[10:23:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[10:23:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64091 and previous config saved to /var/cache/conftool/dbconfig/20240605-102348-root.json
[10:23:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863049 (10cmooney) @Jclark-ctr @VRiley-WMF unfortunately these switch upgrades require us to shift some cables around before/after the upgrade to avoid disrupting services....
[10:24:55] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A
[10:24:55] <icinga-wm_>	 v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:24:55] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A
[10:24:55] <icinga-wm_>	 v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:26:47] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[10:26:55] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:26:55] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:27:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.netbox.restart-reboot (exit_code=0) rolling reboot on A:netbox
[10:28:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863056 (10VRiley-WMF) @cmooney as it turns out, I will be out until June 10th.
[10:29:18] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[10:29:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[10:30:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[10:30:36] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet
[10:30:43] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1008.eqiad.wmnet with OS bullseye
[10:31:04] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet
[10:31:56] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:31:58] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:32:04] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1038795 (https://phabricator.wikimedia.org/T366687)
[10:32:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove kartotherian.discovery.wmnet.crt cergen cert [puppet] - 10https://gerrit.wikimedia.org/r/1039190 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[10:32:08] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1038796 (https://phabricator.wikimedia.org/T366687)
[10:32:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet
[10:32:14] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet
[10:32:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64093 and previous config saved to /var/cache/conftool/dbconfig/20240605-103251-ladsgroup.json
[10:33:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] sextant cache: Allow defining mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 (owner: 10Alexandros Kosiaris)
[10:33:07] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove kartotherian stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1039191 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[10:34:02] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1010.eqiad.wmnet with OS bullseye
[10:34:58] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:34:58] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:35:45] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw
[10:36:52] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[10:37:15] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1012.eqiad.wmnet with OS bullseye
[10:37:26] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet
[10:37:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[10:38:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64094 and previous config saved to /var/cache/conftool/dbconfig/20240605-103854-root.json
[10:39:00] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6
[10:39:00] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:39:00] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6
[10:39:00] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:39:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:39:47] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[10:39:51] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet
[10:40:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[10:40:49] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:40:51] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet
[10:41:06] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:41:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863128 (10cmooney) >>! In T366361#9863056, @VRiley-WMF wrote: > @cmooney as it turns out, I will be out until June 10th.  No probs, enjoy the time off.  I'll see if maybe J...
[10:41:58] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:42:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet
[10:44:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:46:22] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet
[10:46:49] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet
[10:47:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64096 and previous config saved to /var/cache/conftool/dbconfig/20240605-104757-ladsgroup.json
[10:49:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:50:06] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet
[10:50:28] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet
[10:51:00] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A
[10:51:00] <icinga-wm_>	 v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:51:02] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS
[10:51:02] <icinga-wm_>	 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:52:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw
[10:52:12] <logmsgbot>	 !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1009.eqiad.wmnet with OS bullseye
[10:53:00] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:00] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:04] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1009.eqiad.wmnet with OS bullseye
[10:53:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9863185 (10MoritzMuehlenhoff)
[10:53:41] <logmsgbot>	 !log hnowlan@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1011.eqiad.wmnet with OS bullseye
[10:53:54] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1011.eqiad.wmnet with OS bullseye
[10:54:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64097 and previous config saved to /var/cache/conftool/dbconfig/20240605-105400-root.json
[10:54:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:54:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:54:43] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet
[10:55:02] <icinga-wm_>	 PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100%
[10:55:29] <wikibugs>	 (03CR) 10Elukey: "Should we wait for the new docker image with the heavy-rev-id logic?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[10:56:29] <wikibugs>	 06SRE, 10Maps, 06serviceops, 13Patch-For-Review: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9863179 (10MoritzMuehlenhoff) 05Open→03Resolved a:05jijiki→03MoritzMuehlenhoff maps is now using cfssl.
[10:57:30] <icinga-wm_>	 RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[10:57:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:59:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:59:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw
[11:00:04] <jouncebot>	 mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1100).
[11:03:01] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[11:03:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64098 and previous config saved to /var/cache/conftool/dbconfig/20240605-110303-ladsgroup.json
[11:03:06] <claime>	 !log restarted send_tile_invalidations.service on maps1009
[11:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:52] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1031.eqiad.wmnet with OS bullseye
[11:04:03] <icinga-wm_>	 PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100%
[11:04:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:04:39] <icinga-wm_>	 RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[11:04:54] <wikibugs>	 (03PS1) 10Dreamy Jazz: Follow-up: Don't run interact with block buttons if they don't exist [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493)
[11:05:05] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS
[11:05:05] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:06:05] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS
[11:06:05] <icinga-wm_>	 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:06:15] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet
[11:06:37] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1009.eqiad.wmnet with reason: host reimage
[11:06:50] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1011.eqiad.wmnet with reason: host reimage
[11:07:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:08:09] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:09:05] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:09:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:42] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1009.eqiad.wmnet with reason: host reimage
[11:09:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:11:17] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863237 (10Volans) I'd like to know if there is a wider agreement on this before implementing it. It seems reasonable to me but it will affe...
[11:12:41] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863251 (10MoritzMuehlenhoff) Sure thing, but there's also no real impact, anyone who continues to pass the --alias for these kind of cookbo...
[11:12:51] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1011.eqiad.wmnet with reason: host reimage
[11:15:08] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Create frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039194 (https://phabricator.wikimedia.org/T366691)
[11:16:40] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Don't require passing an alias if only one alias is possible - https://phabricator.wikimedia.org/T366680#9863260 (10Volans) To add a dumb change that makes aliases and query optional and then checks for them later is easy. But at this point it w...
[11:17:54] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage
[11:18:07] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS
[11:18:07] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:18:09] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS
[11:18:09] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:19:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:19:49] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [beta] Create frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039194 (https://phabricator.wikimedia.org/T366691) (owner: 10Urbanecm)
[11:20:05] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:20:09] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:20:26] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Create frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039194 (https://phabricator.wikimedia.org/T366691) (owner: 10Urbanecm)
[11:21:09] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1031.eqiad.wmnet with reason: host reimage
[11:23:05] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9863270 (10Ladsgroup) >>! In T365915#9862244, @Bodhisattwa wrote: > Seeing the ESEAP mailing list, I think, it would be OK, if we get the name as wikimoitree@lists.wikimedia.org   It is no...
[11:23:07] <icinga-wm_>	 PROBLEM - Host mw1401 is DOWN: PING CRITICAL - Packet loss = 100%
[11:24:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:24:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:25:37] <icinga-wm_>	 RECOVERY - Host mw1401 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[11:25:40] <wikibugs>	 (03CR) 10Clément Goubert: Add new chart statsd-exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[11:27:10] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1009.eqiad.wmnet with OS bullseye
[11:29:39] <wikibugs>	 (03PS6) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869
[11:29:39] <wikibugs>	 (03PS18) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[11:30:04] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet
[11:30:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet
[11:31:37] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1011.eqiad.wmnet with OS bullseye
[11:31:48] <hnowlan>	 !log running homer to configure bgp on 5 new k8s workers 
[11:31:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:09] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:32:15] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:32:19] <icinga-wm_>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:34:09] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:34:15] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:34:21] <icinga-wm_>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:34:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:36:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad
[11:37:27] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1031.eqiad.wmnet with OS bullseye
[11:38:03] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet
[11:38:25] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet
[11:38:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet
[11:38:44] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:39:13] <logmsgbot>	 !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1008.eqiad.wmnet|wikikube-worker1009.eqiad.wmnet|wikikube-worker1010.eqiad.wmnet|wikikube-worker1011.eqiad.wmnet|wikikube-worker1012.eqiad.wmnet),cluster=kubernetes,service=kubesvc
[11:39:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet
[11:39:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:40:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:41:06] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] "function-wise, lgtm, but i think we either want to remove both of the notes, or none of them, and this patch removes just one. -1 for visi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[11:41:25] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet
[11:41:28] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet
[11:41:41] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892)
[11:41:48] <wikibugs>	 (03PS6) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892)
[11:43:20] <wikibugs>	 (03PS1) 10Hnowlan: mw-web, mw-api-ext: Raise replicas for 95% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039196 (https://phabricator.wikimedia.org/T362323)
[11:44:05] <wikibugs>	 (03PS1) 10Effie Mouzeli: mc.php: if $_SERVER['MCROUTER_SERVER'] is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186)
[11:44:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad
[11:44:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mc.php: if $_SERVER['MCROUTER_SERVER'] is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli)
[11:45:04] <wikibugs>	 (03PS2) 10Effie Mouzeli: mc.php: if $MCROUTER_SERVER is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186)
[11:45:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mc.php: if $MCROUTER_SERVER is set, resolve it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli)
[11:45:45] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "yes! I created this before the new patch, so I'll w8 to update the image as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[11:46:11] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_
[11:46:11] <icinga-wm_>	 g%23BGP_status
[11:46:23] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_
[11:46:23] <icinga-wm_>	 g%23BGP_status
[11:47:21] <icinga-wm_>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:48:21] <icinga-wm_>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:48:31] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet
[11:49:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695 (10MoritzMuehlenhoff) 03NEW
[11:49:13] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:49:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9863378 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:49:21] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:49:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:49:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet
[11:50:25] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet
[11:52:04] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet
[11:52:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new ping servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1039199 (https://phabricator.wikimedia.org/T366695)
[11:53:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:54:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:55:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add new ping servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1039199 (https://phabricator.wikimedia.org/T366695) (owner: 10Muehlenhoff)
[11:57:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet
[11:58:02] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet
[11:58:05] <wikibugs>	 (03PS5) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759
[11:58:17] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A
[11:58:17] <icinga-wm_>	 v6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:58:25] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A
[11:58:25] <icinga-wm_>	 v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:00:15] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:00:20] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet
[12:00:25] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:00:31] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet
[12:00:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:01:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch statistics::explorer to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1039200 (https://phabricator.wikimedia.org/T349619)
[12:03:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet
[12:03:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:04:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:04:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report
[12:04:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[12:05:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ping2004.codfw.wmnet
[12:05:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:05:48] <jinxer-wm>	 FIRING: PuppetDisabled: Puppet disabled on mc2049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[12:05:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch statistics::explorer to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1039200 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:08:28] <wikibugs>	 (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[12:08:32] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet
[12:28:55] <claime>	 Ok I'll stop the cookbook and uncordon for the backport window and restart afterwards
[12:29:05] <Dreamy_Jazz>	 Okay. Thanks.
[12:29:18] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet
[12:29:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:29:35] <claime>	 It's a little messy as I need to uncordon the nodes manually
[12:29:40] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1066.eqiad.wmnet
[12:29:45] <claime>	 I'll stop it once that batch of 5 is done
[12:31:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T364299)', diff saved to https://phabricator.wikimedia.org/P64099 and previous config saved to /var/cache/conftool/dbconfig/20240605-123059-marostegui.json
[12:31:05] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[12:31:37] <Dreamy_Jazz>	 There is also https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630 which has been scheduled and I think also needs use of `scap backport`
[12:31:58] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1067.eqiad.wmnet
[12:32:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet
[12:32:48] <claime>	 I hope I'll be done by 1630 UTC
[12:33:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ping2004.codfw.wmnet with reason: host reimage
[12:33:05] <Dreamy_Jazz>	 👍
[12:33:17] <claime>	 (even with stopping for the backport window)
[12:33:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet
[12:33:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet
[12:33:47] <claime>	 If I'm not, I'll do the same and stop it then restart it later, I don't like letting cookbooks like this one run after the end of my day anyways
[12:33:57] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable IPIP for high-traffic1@magru for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[12:34:20] <claime>	 (What I really need to do is fix the rollback for that cookbook to uncordon the nodes so I don't have to do it manually)
[12:34:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:34:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure memcached on idp-test hosts to run as 'memcache' [puppet] - 10https://gerrit.wikimedia.org/r/1039206 (https://phabricator.wikimedia.org/T273950)
[12:35:33] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6
[12:35:33] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:35:53] <fabfur>	 !log disabling puppet on A:cp-text to test IPIP encapsulation on magru (T366466) 
[12:35:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:56] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[12:36:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ping2004.codfw.wmnet with reason: host reimage
[12:36:30] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS
[12:36:30] <icinga-wm_>	 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:37:30] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:37:34] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:38:28] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet
[12:38:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet
[12:39:26] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1067.eqiad.wmnet
[12:39:38] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache:hiera: enable IPIP on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[12:39:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:39:48] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1068.eqiad.wmnet
[12:40:45] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:wikikube-worker-codfw
[12:40:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:42:53] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039206 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff)
[12:43:31] <moritzm>	 !log failover ganeti masters in drmrs
[12:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:55] <wikibugs>	 (03PS14) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[12:44:07] <wikibugs>	 (03CR) 10Volans: base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[12:45:01] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet
[12:45:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet
[12:45:24] <claime>	 Dreamy_Jazz: ok, cookbook stopped and nodes uncordoned you should be g2g
[12:45:39] <Dreamy_Jazz>	 Thanks.
[12:45:50] <icinga-wm_>	 PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:45:53] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1068.eqiad.wmnet
[12:45:56] <Dreamy_Jazz>	 I will start my patches earlier so that you can start up again quicker.
[12:45:56] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey)
[12:46:02] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet
[12:46:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P64100 and previous config saved to /var/cache/conftool/dbconfig/20240605-124607-marostegui.json
[12:46:45] <wikibugs>	 (03CR) 10Elukey: [C:03+2] sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey)
[12:46:54] <claime>	 Dreamy_Jazz: <3
[12:47:11] <claime>	 ping me if you run into any issues, it should be ok though
[12:47:16] <icinga-wm_>	 PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:48:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493) (owner: 10Dreamy Jazz)
[12:48:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[12:48:28] <wikibugs>	 (03PS9) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[12:48:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493) (owner: 10Dreamy Jazz)
[12:48:37] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[12:49:12] <wikibugs>	 (03Merged) 10jenkins-bot: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[12:49:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db1246 T363119', diff saved to https://phabricator.wikimedia.org/P64101 and previous config saved to /var/cache/conftool/dbconfig/20240605-124918-arnaudb.json
[12:49:22] <stashbot>	 T363119: db1246 crashed - https://phabricator.wikimedia.org/T363119
[12:49:32] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: maintenance
[12:49:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:49:46] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: maintenance
[12:50:37] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey)
[12:51:16] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet
[12:51:32] <wikibugs>	 (03PS1) 10Phedenskog: wmftest: Add new Graphite instance for performance test data. [dns] - 10https://gerrit.wikimedia.org/r/1039207 (https://phabricator.wikimedia.org/T366669)
[12:51:47] <wikibugs>	 (03PS15) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[12:52:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ping2004.codfw.wmnet with OS bookworm
[12:52:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping2004.codfw.wmnet
[12:52:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9863550 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping2004.codfw.wmnet with OS bookworm completed: - ping2004 (**PASS**)   - Removed from Puppe...
[12:53:56] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet
[12:53:57] <logmsgbot>	 !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1013386|[CheckUser] Stop writing old for event table migration on testwiki (T360686)]]
[12:54:00] <stashbot>	 T360686: Stop writing old on testwiki - https://phabricator.wikimedia.org/T360686
[12:54:12] <Dreamy_Jazz>	 Proceeding with my config change first as the wmf.8 backport is likely to take a while in gate-and-submit-wmf
[12:54:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:54:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:55:19] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet
[12:55:24] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet
[12:56:24] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:aux-worker
[12:59:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1300).
[13:00:04] <jouncebot>	 Dreamy_Jazz and duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <Dreamy_Jazz>	 \o
[13:00:20] <Dreamy_Jazz>	 I am currently deploying my config change
[13:00:31] <Dreamy_Jazz>	 My other change is in gate-and-submit-wmf
[13:01:09] <wikibugs>	 (03PS1) 10Cwhite: Revert "multiversion: Add tests for MWMultiVersion::getMediaWiki()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038840
[13:01:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P64102 and previous config saved to /var/cache/conftool/dbconfig/20240605-130115-marostegui.json
[13:01:26] <Dreamy_Jazz>	 Got a warning that `check_testservers_baremetal` exceeded the 120s timeout.
[13:01:40] <Dreamy_Jazz>	 It is asking me to retry or continue
[13:01:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9863599 (10Jclark-ctr) replaced broken cable      server went 2 weeks with out fault returning
[13:02:01] <duesen>	 o/
[13:02:31] <Dreamy_Jazz>	 Going to continue as I think it should be fine
[13:02:33] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1013386|[CheckUser] Stop writing old for event table migration on testwiki (T360686)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:02:40] <duesen>	 Dreamy_Jazz: let me know when you are done.
[13:02:45] <stashbot>	 T360686: Stop writing old on testwiki - https://phabricator.wikimedia.org/T360686
[13:02:46] <Dreamy_Jazz>	 Sure
[13:02:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet
[13:03:01] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet
[13:03:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:04:18] <wikibugs>	 (03PS1) 10Elukey: profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211
[13:04:19] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync
[13:04:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:04:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey)
[13:05:53] <wikibugs>	 (03PS2) 10Elukey: profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211
[13:06:01] <fabfur>	 !log restarting pybal on lvs7001/lvs7003 to appy IPIP conf (T366466)
[13:06:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:11] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[13:07:41] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2754/co" [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey)
[13:08:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[13:08:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey)
[13:08:37] <claime>	 Dreamy_Jazz: Did it tell you what server exceeded that timeout?
[13:08:46] <Dreamy_Jazz>	 No
[13:09:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:09:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:10:14] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:aux-worker
[13:10:17] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] profile::maps::tlsproxy: add SAN to CFSSL TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/1039211 (owner: 10Elukey)
[13:10:23] <wikibugs>	 (03CR) 10Daimona Eaytoy: "I seem to remember from past deployments that it's generally better to do one file per patch, as files can only be synced to the servers i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038862 (https://phabricator.wikimedia.org/T363199) (owner: 10Mhorsey)
[13:10:28] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039212 (https://phabricator.wikimedia.org/T366657)
[13:11:58] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up: Don't run interact with block buttons if they don't exist [extensions/CheckUser] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038839 (https://phabricator.wikimedia.org/T329493) (owner: 10Dreamy Jazz)
[13:13:10] <logmsgbot>	 !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1013386|[CheckUser] Stop writing old for event table migration on testwiki (T360686)]] (duration: 19m 13s)
[13:13:13] <stashbot>	 T360686: Stop writing old on testwiki - https://phabricator.wikimedia.org/T360686
[13:13:36] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet
[13:13:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet
[13:14:14] <wikibugs>	 (03PS1) 10Ebrahim: Enable numeric sorting for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039213 (https://phabricator.wikimedia.org/T366703)
[13:14:20] <logmsgbot>	 !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1038839|Follow-up: Don't run interact with block buttons if they don't exist (T329493)]]
[13:14:22] <stashbot>	 T329493: Replace Special:CheckUser's 'get users' block form with a usage of Special:InvestigateBlock - https://phabricator.wikimedia.org/T329493
[13:14:40] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274)
[13:16:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T364299)', diff saved to https://phabricator.wikimedia.org/P64103 and previous config saved to /var/cache/conftool/dbconfig/20240605-131623-marostegui.json
[13:16:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance
[13:16:27] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[13:16:28] <MatmaRex>	 cwhite: i proposed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1039212 as an alternative to your revert
[13:16:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9863672 (10ABran-WMF) leaving the host depooled until tomorrow to see if it stays stable, will close the task upon repool.
[13:16:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance
[13:16:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T364299)', diff saved to https://phabricator.wikimedia.org/P64104 and previous config saved to /var/cache/conftool/dbconfig/20240605-131647-marostegui.json
[13:17:00] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz: Backport for [[gerrit:1038839|Follow-up: Don't run interact with block buttons if they don't exist (T329493)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:17:29] <wikibugs>	 (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[13:17:38] <logmsgbot>	 !log dreamyjazz@deploy1002 dreamyjazz: Continuing with sync
[13:17:58] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[13:18:07] <wikibugs>	 (03PS1) 10Fabfur: Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841
[13:18:40] <wikibugs>	 (03CR) 10Elukey: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[13:18:46] <duesen>	 Dreamy_Jazz: I haven't done a config deployment in a while... Remind me please... can I just use scap backport, and it knows what to do?
[13:19:06] <wikibugs>	 (03PS2) 10Fabfur: Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841 (https://phabricator.wikimedia.org/T366466)
[13:19:10] <wikibugs>	 (03CR) 10Elukey: ml-services: use multi-processing for viwiki in ml-staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[13:19:33] <Dreamy_Jazz>	 duesen: Yes
[13:19:37] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[13:19:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:19:55] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet
[13:20:08] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274)
[13:20:19] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "depool text@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038841 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[13:20:52] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet
[13:21:10] <fabfur>	 !log enable magru DC after applying IPIP encapsulation patches (T366466)
[13:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:12] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[13:21:26] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[13:21:30] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] "Alternative: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1039212" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038840 (owner: 10Cwhite)
[13:22:43] <wikibugs>	 (03CR) 10Elukey: [C:03+1] ml-services: use multi-processing for viwiki in ml-staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[13:23:46] <duesen>	 Dreamy_Jazz: cool thanks. Are you still deploying? 
[13:23:54] <Dreamy_Jazz>	 duesen: Yes
[13:24:08] <duesen>	 k
[13:24:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:24:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:25:23] <wikibugs>	 (03CR) 10JMeybohm: "I don't really like the fact that this creates an implicit dependency to the mw-script namespace being created, but I think its the most s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[13:25:31] <godog>	 jouncebot: next
[13:25:31] <jouncebot>	 In 0 hour(s) and 34 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1400)
[13:25:43] <godog>	 I'll sneak a graphite1005 reboot now
[13:25:47] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[13:25:56] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host graphite1005.eqiad.wmnet
[13:25:59] <logmsgbot>	 !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1038839|Follow-up: Don't run interact with block buttons if they don't exist (T329493)]] (duration: 11m 39s)
[13:26:02] <stashbot>	 T329493: Replace Special:CheckUser's 'get users' block form with a usage of Special:InvestigateBlock - https://phabricator.wikimedia.org/T329493
[13:26:07] <Dreamy_Jazz>	 duesen: I'm done with my patch.
[13:26:14] <Dreamy_Jazz>	 You can proceed with your config change.
[13:26:17] <duesen>	 Dreamy_Jazz: excellent, thank you!
[13:26:28] <duesen>	 I'll go ahead with my config patch, then
[13:26:34] <Dreamy_Jazz>	 I would recommend running the command in screen / tmux in case your connection drops
[13:26:42] <godog>	 ouch I didn't realize a deployment was in progress, my bad! anyways graphite1005 will be back soon btw
[13:26:48] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos)
[13:26:56] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet
[13:26:56] <godog>	 jouncebot: now and next
[13:26:56] <jouncebot>	 For the next 0 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1300)
[13:27:04] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet
[13:27:09] <godog>	 that's what I wanted
[13:27:10] <cwhite>	 MatmaRex: we can try your alternate proposal first.
[13:27:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler)
[13:27:25] <MatmaRex>	 thanks
[13:27:43] <cwhite>	 Chances are it will solve the issue, but leave the bug as-is.
[13:27:57] <elukey>	 !log systemctl reset-failed prometheus-redis-exporter@6380.service redis-instance-tcp_6380.service on netbox[12]002 + apt-get purge of redis-server and prometheus-redis-exporter packages to clean up stale configs (no local redis is used)
[13:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] openstack: wmfkeystonehooks: Use project name for Wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1039204 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah)
[13:28:35] <duesen>	 grrr, merge conflict
[13:28:44] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:28:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet
[13:29:21] <wikibugs>	 (03PS2) 10Daniel Kinzler: Set LinterParseOnDerivedDataUpdate to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013)
[13:29:21] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet
[13:29:36] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler)
[13:30:11] <wikibugs>	 (03Merged) 10jenkins-bot: Set LinterParseOnDerivedDataUpdate to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler)
[13:30:27] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Use project name for Wikitech page [puppet] - 10https://gerrit.wikimedia.org/r/1039204 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah)
[13:30:43] <logmsgbot>	 !log daniel@deploy1002 Started scap: Backport for [[gerrit:1038688|Set LinterParseOnDerivedDataUpdate to false (T361013)]]
[13:30:47] <stashbot>	 T361013: Update lint tables independently of changeprop/restbase - https://phabricator.wikimedia.org/T361013
[13:33:44] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:34:16] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:34:32] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[13:34:33] <logmsgbot>	 !log daniel@deploy1002 daniel: Backport for [[gerrit:1038688|Set LinterParseOnDerivedDataUpdate to false (T361013)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:34:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:35:12] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet
[13:35:49] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T364577#9863782 (10Jhancock.wm) 05Open→03Resolved
[13:35:59] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet
[13:37:11] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet
[13:37:18] <wikibugs>	 (03PS1) 10Jelto: aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004)
[13:37:21] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet
[13:37:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004) (owner: 10Jelto)
[13:37:41] <logmsgbot>	 !log filippo@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host graphite1005.eqiad.wmnet
[13:37:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6001.drmrs.wmnet
[13:38:59] <wikibugs>	 (03PS2) 10Jelto: aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004)
[13:39:29] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet
[13:39:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:40:08] <wikibugs>	 (03PS3) 10Jelto: aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004)
[13:40:08] <logmsgbot>	 !log daniel@deploy1002 daniel: Continuing with sync
[13:42:29] <wikibugs>	 (03CR) 10JMeybohm: deployment_server: Add a mwscript-k8s cleanup script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[13:42:44] <wikibugs>	 (03PS1) 10Majavah: openstack: wmfkeystonehooks: Add missing self argument [puppet] - 10https://gerrit.wikimedia.org/r/1039218
[13:42:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Support pct based aborted runs [puppet] - 10https://gerrit.wikimedia.org/r/1039219
[13:43:23] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet
[13:43:28] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet
[13:43:29] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Add missing self argument [puppet] - 10https://gerrit.wikimedia.org/r/1039218 (owner: 10Majavah)
[13:43:46] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[13:43:54] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet
[13:44:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:45:31] <inflatador>	 !log bking@an-db1001 install acl pkg T363001
[13:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:34] <stashbot>	 T363001:  Create a helm chart for airflow that is appropriate to our needs - https://phabricator.wikimedia.org/T363001
[13:46:14] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet
[13:46:15] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet
[13:46:19] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet
[13:46:21] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet
[13:46:35] <elukey>	 !log factory reset for sretest1001 to test the new provision cookbook - T365372
[13:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:38] <stashbot>	 T365372: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372
[13:46:45] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet
[13:46:46] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-05-23-164021 to 2024-06-05-003919 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039220 (https://phabricator.wikimedia.org/T340561)
[13:46:48] <wikibugs>	 (03PS2) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954)
[13:47:03] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-05-28-185827 to 2024-05-31-163732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039221 (https://phabricator.wikimedia.org/T360676)
[13:47:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "LGTM, minor nitpicks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[13:47:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report
[13:47:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[13:48:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ping1004.eqiad.wmnet
[13:48:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:48:33] <logmsgbot>	 !log daniel@deploy1002 Finished scap: Backport for [[gerrit:1038688|Set LinterParseOnDerivedDataUpdate to false (T361013)]] (duration: 17m 50s)
[13:48:36] <stashbot>	 T361013: Update lint tables independently of changeprop/restbase - https://phabricator.wikimedia.org/T361013
[13:48:56] <inflatador>	 !log bking@an-db1001 install python3-psycopg2 pkg T363001
[13:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:50:28] <cwhite>	 duesen: all clear for another backport?
[13:50:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet
[13:51:27] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[13:52:21] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet
[13:52:22] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet
[13:52:24] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet
[13:52:31] <wikibugs>	 (03PS13) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[13:52:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2074.codfw.wmnet
[13:52:46] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet
[13:53:33] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki-image-download: Support pct based aborted runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris)
[13:54:09] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet
[13:54:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:55:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping1004.eqiad.wmnet - jmm@cumin2002"
[13:55:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:56:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Configure memcached on idp-test hosts to run as 'memcache' [puppet] - 10https://gerrit.wikimedia.org/r/1039206 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff)
[13:56:33] <wikibugs>	 (03PS1) 10Majavah: openstack: wmfkeystonehooks: Project is a dict, not an object [puppet] - 10https://gerrit.wikimedia.org/r/1039222
[13:56:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet
[13:57:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6001.drmrs.wmnet
[13:57:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Project is a dict, not an object [puppet] - 10https://gerrit.wikimedia.org/r/1039222 (owner: 10Majavah)
[14:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1400)
[14:00:19] <James_F>	 Hey hey.
[14:00:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping1004.eqiad.wmnet - jmm@cumin2002"
[14:00:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:00:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ping1004.eqiad.wmnet on all recursors
[14:00:22] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-05-23-164021 to 2024-06-05-003919 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039220 (https://phabricator.wikimedia.org/T340561) (owner: 10Jforrester)
[14:00:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ping1004.eqiad.wmnet on all recursors
[14:00:29] <claime>	 cwhite: Are you deploying then?
[14:00:34] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2074.codfw.wmnet
[14:00:41] <claime>	 James_F: I'll wait until you're done to restart the reboots I guess :p
[14:00:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6002.drmrs.wmnet
[14:00:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ping1004.eqiad.wmnet - jmm@cumin2002"
[14:01:04] <James_F>	 claime: Oh, sorry! What are you rebooting? Might be OK to go in parallel.
[14:01:20] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-05-23-164021 to 2024-06-05-003919 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039220 (https://phabricator.wikimedia.org/T340561) (owner: 10Jforrester)
[14:01:24] <claime>	 James_F: Not really, I'm rebooting all k8s codfw
[14:01:39] <James_F>	 claime: Hmm. Maybe not ideal if I'm deploying to k8s, fair.
[14:01:46] <claime>	 I still have ~60 nodes to go, and that will cordon them all, making deployments a little difficult 
[14:01:58] <James_F>	 I'll be fast!
[14:01:59] <claime>	 Although for wf it should fit
[14:02:01] <claime>	 no worries
[14:02:04] <James_F>	 (He says, waiting for the git update to land on deploy1002.)
[14:02:14] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:02:21] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:02:32] <James_F>	 Oh dear.
[14:02:51] <James_F>	 Deploy failed.
[14:02:58] <MatmaRex>	 are we still doing the backport window?
[14:02:59] <cwhite>	 claime: I want to deploy.  Was waiting for the all clear though.
[14:03:06] <James_F>	 "Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch "function-orchestrator-main-orchestrator-tls-proxy-certs" with kind Certificate: Internal error occurred…"
[14:03:24] <James_F>	 claime: Does this mean you need to reboot first, or is it a different issue?
[14:03:33] <claime>	 different issue 
[14:03:44] <James_F>	 Hmm.
[14:03:45] <claime>	 especially in staging, I'm rebooting the prod cluster
[14:03:48] <James_F>	 Ack.
[14:04:01] <James_F>	 Well, if I can't deploy even to staging I can't validate.
[14:04:09] <James_F>	 So I suppose I should revert and give up?
[14:04:11] <wikibugs>	 (03PS1) 10Vgutierrez: depool text@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039223 (https://phabricator.wikimedia.org/T366466)
[14:04:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ping1004.eqiad.wmnet - jmm@cumin2002"
[14:04:29] <claime>	 MatmaRex: patches were deployed as part of the backkport window yes
[14:04:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:04:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ping1004.eqiad.wmnet with OS bookworm
[14:04:50] <claime>	 cwhite is waiting for the go ahead from duesen that his patch deployed correctly and he can proceed
[14:04:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9863878 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm
[14:05:13] <claime>	 James_F: let me check the diff
[14:05:19] <MatmaRex>	 claime: the patches i was interested in weren't
[14:05:27] <MatmaRex>	 so i'm wondering if the window is done or in progress
[14:05:42] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet
[14:05:47] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2075.codfw.wmnet
[14:05:51] <MatmaRex>	 (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038840 / https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1039212)
[14:05:56] <James_F>	 MatmaRex: It's over, I'm now meant to be deploying my services.
[14:06:02] <MatmaRex>	 alright
[14:06:03] <claime>	 MatmaRex: what were those patches? because I think everything that was in the deployment calendar except cwhite's patch were deployed
[14:06:16] <James_F>	 (Though k8s service deploys and MW deploys don't really interact.)
[14:07:16] <cwhite>	 claime: they're the same patches
[14:07:23] <claime>	 ah
[14:07:25] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED
[14:07:46] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic1@eqsin for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466)
[14:07:48] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: enable IPIP on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466)
[14:08:01] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz)
[14:08:17] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz)
[14:08:26] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event tables migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz)
[14:09:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: mediawiki-image-download: Support pct based aborted runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris)
[14:09:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:09:35] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:09:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:10:27] <claime>	 James_F: I'm going to try to deploy your wf patch to see what happens
[14:10:30] <James_F>	 Ack.
[14:10:46] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:10:52] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:12:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P64105 and previous config saved to /var/cache/conftool/dbconfig/20240605-141210-ladsgroup.json
[14:12:14] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[14:13:25] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2075.codfw.wmnet
[14:13:28] <claime>	 James_F: erm.
[14:13:32] <icinga-wm_>	 PROBLEM - Host mw1377 is DOWN: PING CRITICAL - Packet loss = 100%
[14:13:33] <claime>	 It did pull your change
[14:13:37] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet
[14:13:56] <claime>	 the fucntion-orchestrator pod runs docker-registry.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator:2024-06-05-003919
[14:14:06] <James_F>	 Yes.
[14:14:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:14:19] <James_F>	 But it seems to be in a failed state?
[14:14:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:14:26] <claime>	 but it fails at redeploying the tls proxy for some reason
[14:15:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Hiera option name [puppet] - 10https://gerrit.wikimedia.org/r/1039226 (https://phabricator.wikimedia.org/T273950)
[14:15:18] <wikibugs>	 (03PS14) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[14:15:26] <James_F>	 A helm framework issue?
[14:15:30] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2756/console" [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:15:43] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1076.eqiad.wmnet
[14:15:48] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2076.codfw.wmnet
[14:15:49] <wikibugs>	 (03PS6) 10Effie Mouzeli: mc.php: store mcrouter location in apcu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186)
[14:15:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039226 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff)
[14:16:04] <icinga-wm_>	 RECOVERY - Host mw1377 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[14:16:29] <claime>	 James_F: looks more like a certificate issue which is... strange
[14:17:01] <claime>	 jayme: did something change recently for tls on staging-eqiad?
[14:17:01] <James_F>	 Yeah.
[14:17:23] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2757/co" [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:17:27] <wikibugs>	 (03PS15) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[14:17:40] <jayme>	 claime: not that I know of...let me read backlog
[14:18:03] <claime>	 jayme: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch "function-orchestrator-main-orchestrator-tls-proxy-certs" with kind Cert
[14:18:05] <claime>	 ificate: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
[14:18:07] <claime>	  x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "cert-manager-webhook-ca")
[14:18:41] <jayme>	 oh, interesting...
[14:18:58] <claime>	 isn't it
[14:19:23] <jayme>	 that's probably the apiserver failing to call the cert-manager webhook
[14:19:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:19:27] <jayme>	 I can take a look
[14:20:19] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki-image-download: Support pct based aborted runs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris)
[14:20:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T364069)', diff saved to https://phabricator.wikimedia.org/P64106 and previous config saved to /var/cache/conftool/dbconfig/20240605-142018-marostegui.json
[14:20:23] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:21:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix Hiera option name [puppet] - 10https://gerrit.wikimedia.org/r/1039226 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff)
[14:21:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:49] <claime>	 jayme: thanks
[14:23:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet
[14:23:45] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2076.codfw.wmnet
[14:23:50] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1076.eqiad.wmnet
[14:24:27] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] depool text@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039223 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:24:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete virt-star stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1039227
[14:24:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:26:04] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] depool text@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039223 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:26:29] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] hiera: Enable IPIP on high-traffic1@eqsin for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:27:05] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] hiera: enable IPIP on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:27:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P64107 and previous config saved to /var/cache/conftool/dbconfig/20240605-142718-ladsgroup.json
[14:28:00] <vgutierrez>	 !log depool text@eqsin before enabling IPIP encapsulation - T366466
[14:28:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:03] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[14:29:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[14:29:08] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9863961 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[14:29:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet
[14:29:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6002.drmrs.wmnet
[14:31:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure memcached on idp hosts to run as 'memcache' [puppet] - 10https://gerrit.wikimedia.org/r/1039229 (https://phabricator.wikimedia.org/T273950)
[14:32:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9863988 (10cmooney)
[14:33:34] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9863991 (10MoritzMuehlenhoff)
[14:34:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:35:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P64108 and previous config saved to /var/cache/conftool/dbconfig/20240605-143526-marostegui.json
[14:36:49] <wikibugs>	 (03PS7) 10JHathaway: phab: query for inbound mail servers [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395)
[14:38:44] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:47] <wikibugs>	 (03CR) 10Effie Mouzeli: mc.php: store mcrouter location in apcu (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli)
[14:39:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:42:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P64109 and previous config saved to /var/cache/conftool/dbconfig/20240605-144227-ladsgroup.json
[14:42:54] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[14:43:26] <cwhite>	 claime: just so you know I'm still around and hoping to get this backport deployed
[14:43:39] <claime>	 cwhite: yeah, I haven't restarted the reboots
[14:43:49] <claime>	 I think I'll give up on them for today and will finish tomorrow
[14:44:05] <claime>	 tbh I think you should go ahead with your backport
[14:44:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org
[14:44:09] <wikibugs>	 (03PS7) 10Effie Mouzeli: mc.php: store mcrouter location in apcu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186)
[14:44:32] <cwhite>	 ok, thank you :)
[14:44:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:45:17] <wikibugs>	 (03CR) 10Effie Mouzeli: mc.php: store mcrouter location in apcu (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039197 (https://phabricator.wikimedia.org/T363186) (owner: 10Effie Mouzeli)
[14:45:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cwhite@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039212 (https://phabricator.wikimedia.org/T366657) (owner: 10Bartosz Dziewoński)
[14:45:25] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "I think your concerns have been addressed, please take another look." [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[14:45:43] <wikibugs>	 (03CR) 10Cyndywikime: [C:04-1] "For visibility, needs rebase :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm)
[14:46:00] <wikibugs>	 (03PS3) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892)
[14:46:03] <wikibugs>	 (03PS7) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892)
[14:46:06] <wikibugs>	 (03Merged) 10jenkins-bot: MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039212 (https://phabricator.wikimedia.org/T366657) (owner: 10Bartosz Dziewoński)
[14:46:35] <logmsgbot>	 !log cwhite@deploy1002 Started scap: Backport for [[gerrit:1039212|MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings (T366657)]]
[14:46:38] <stashbot>	 T366657: Lots of logs: "PHP Notice: Undefined Index: PATH_INFO" - https://phabricator.wikimedia.org/T366657
[14:47:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org
[14:47:56] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic1@eqsin for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039224 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:48:24] <wikibugs>	 (03PS3) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954)
[14:48:30] <wikibugs>	 (03PS4) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892)
[14:48:34] <wikibugs>	 (03PS8) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892)
[14:48:38] <wikibugs>	 (03PS4) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954)
[14:49:05] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: enable IPIP on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1039225 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[14:49:08] <logmsgbot>	 !log cwhite@deploy1002 matmarex and cwhite: Backport for [[gerrit:1039212|MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings (T366657)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:49:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:50:26] <cwhite>	 wiki still renders - continuing
[14:50:28] <logmsgbot>	 !log cwhite@deploy1002 matmarex and cwhite: Continuing with sync
[14:50:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039229 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff)
[14:50:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P64110 and previous config saved to /var/cache/conftool/dbconfig/20240605-145034-marostegui.json
[14:50:44] <wikibugs>	 (03PS1) 10Majavah: openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158)
[14:51:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah)
[14:51:14] <wikibugs>	 (03CR) 10Klausman: [C:03+2] base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[14:52:03] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[14:52:06] <wikibugs>	 (03PS2) 10Majavah: openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158)
[14:52:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[14:53:44] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:55:29] <vgutierrez>	 !log rolling restart of pybal on lvs5006 and lvs5004 - T366466
[14:55:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet
[14:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:32] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[14:55:34] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[14:55:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:55:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[14:56:07] <James_F>	 jayme: Should I give up on the deployment and revert my deployment-charts patch?
[14:57:08] <jayme>	 James_F: I need a bit more time, but if it's fine by you you can leave the change inteact and I can deploy to staging as soon as i've figured out what's wrong
[14:57:15] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[14:57:17] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[14:57:17] <James_F>	 Ack. That'd be great.
[14:57:18] <jayme>	 and ping you affter for prod deployments
[14:57:24] <James_F>	 jayme: Thank you!
[14:57:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T352010)', diff saved to https://phabricator.wikimedia.org/P64111 and previous config saved to /var/cache/conftool/dbconfig/20240605-145735-ladsgroup.json
[14:57:37] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[14:57:38] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[14:57:50] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[14:57:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T352010)', diff saved to https://phabricator.wikimedia.org/P64112 and previous config saved to /var/cache/conftool/dbconfig/20240605-145757-ladsgroup.json
[14:59:07] <logmsgbot>	 !log cwhite@deploy1002 Finished scap: Backport for [[gerrit:1039212|MWMultiVersion: Fix "Undefined index: PATH_INFO" warnings (T366657)]] (duration: 12m 32s)
[14:59:10] <stashbot>	 T366657: Lots of logs: "PHP Notice: Undefined Index: PATH_INFO" - https://phabricator.wikimedia.org/T366657
[15:00:30] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "depool text@eqsin before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038842 (https://phabricator.wikimedia.org/T366466)
[15:00:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:00:55] <wikibugs>	 (03Abandoned) 10Cwhite: Revert "multiversion: Add tests for MWMultiVersion::getMediaWiki()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038840 (owner: 10Cwhite)
[15:01:11] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[15:01:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet
[15:02:02] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED
[15:04:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:04:43] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "depool text@eqsin before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1038842 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez)
[15:04:50] <vgutierrez>	 !log repool text@eqsin with IPIP encapsulation enabled - T366466
[15:04:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:54] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[15:04:55] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[15:05:03] <vgutierrez>	 bblack, urandom, claime, Emperor: ^^
[15:05:21] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[15:05:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T364069)', diff saved to https://phabricator.wikimedia.org/P64113 and previous config saved to /var/cache/conftool/dbconfig/20240605-150542-marostegui.json
[15:05:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[15:05:45] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[15:05:58] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[15:05:58] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[15:05:59] <jnuche>	 jouncebot: nowandnext
[15:05:59] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 24 minute(s)
[15:05:59] <jouncebot>	 In 1 hour(s) and 24 minute(s): One-off deployment for T365155 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630)
[15:06:00] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[15:06:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T364069)', diff saved to https://phabricator.wikimedia.org/P64114 and previous config saved to /var/cache/conftool/dbconfig/20240605-150605-marostegui.json
[15:07:23] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.86.0" for 286 hosts
[15:08:36] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.86.0" for 285 hosts
[15:09:06] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "To keep archives happy: after resetting bios and factory reset via idrac, the cookbook worked nicely." [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:09:15] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.86.0" completed for 285 hosts
[15:09:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:09:51] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001']
[15:10:36] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001']
[15:10:40] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001']
[15:10:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:12:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet
[15:13:02] <wikibugs>	 (03Merged) 10jenkins-bot: sre.host.provision: no-op refactor to highlight DELL-specific confs [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:13:07] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1077.eqiad.wmnet
[15:13:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet
[15:17:21] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265)
[15:17:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265)
[15:18:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[15:18:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[15:18:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet
[15:19:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:19:43] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet
[15:19:52] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2078.codfw.wmnet
[15:20:55] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1077.eqiad.wmnet
[15:21:03] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1078.eqiad.wmnet
[15:24:18] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM pybal-test2003.codfw.wmnet
[15:24:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:24:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:25:05] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ping1004.eqiad.wmnet with OS bookworm
[15:25:06] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ping1004.eqiad.wmnet
[15:25:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9864158 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm executed with errors: - ping1004 (**FAIL**)   - Removed...
[15:26:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add new chart statsd-exporter (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[15:26:21] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[15:26:21] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265)
[15:26:21] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265)
[15:26:28] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM pybal-test2003.codfw.wmnet
[15:27:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[15:27:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[15:27:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[15:27:31] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2078.codfw.wmnet
[15:28:10] <wikibugs>	 (03PS1) 10Andrea Denisse: traffic: Add discovery entries for the pyrra, slo, and slos domains [puppet] - 10https://gerrit.wikimedia.org/r/1039236 (https://phabricator.wikimedia.org/T356386)
[15:28:11] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2079.codfw.wmnet
[15:28:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] traffic: Add discovery entries for the pyrra, slo, and slos domains [puppet] - 10https://gerrit.wikimedia.org/r/1039236 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse)
[15:28:39] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1078.eqiad.wmnet
[15:28:41] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] traffic: Add discovery entries for the pyrra, slo, and slos domains [puppet] - 10https://gerrit.wikimedia.org/r/1039236 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse)
[15:29:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:29:29] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1079.eqiad.wmnet
[15:30:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001']
[15:32:59] <moritzm>	 !log rebalancing drmrs Ganeti clusters
[15:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet
[15:34:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:34:41] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[15:34:54] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[15:36:02] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2079.codfw.wmnet
[15:36:13] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2080.codfw.wmnet
[15:37:11] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1079.eqiad.wmnet
[15:37:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet
[15:37:29] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1080.eqiad.wmnet
[15:39:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:39:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1002.eqiad.wmnet
[15:40:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:43:33] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2080.codfw.wmnet
[15:43:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1002.eqiad.wmnet
[15:43:59] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1080.eqiad.wmnet
[15:44:20] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1081.eqiad.wmnet
[15:46:03] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[15:46:10] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[15:46:48] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9864270 (10ssingh)
[15:49:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:50:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T352010)', diff saved to https://phabricator.wikimedia.org/P64115 and previous config saved to /var/cache/conftool/dbconfig/20240605-155023-ladsgroup.json
[15:50:26] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[15:50:38] <wikibugs>	 (03PS1) 10Sergio Gimeno: Improve navigation link handling in CommunityConfiguration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038843 (https://phabricator.wikimedia.org/T364938)
[15:51:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[15:51:47] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1081.eqiad.wmnet
[15:51:50] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1082.eqiad.wmnet
[15:52:32] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265)
[15:52:36] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265)
[15:52:40] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265)
[15:52:44] <wikibugs>	 (03PS1) 10AOkoth: miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796)
[15:52:53] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864316 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[15:53:15] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth)
[15:53:19] <wikibugs>	 (03CR) 10Jelto: [C:03+1] miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth)
[15:53:29] <wikibugs>	 (03CR) 10AOkoth: [V:03+2 C:03+2] miscweb: update security-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039237 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth)
[15:54:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:55:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:56:05] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9864343 (10elukey) First roadblock: https://www.supermicro.com/en/support/BMC_Unique_Password  It seems that every s...
[15:56:47] <logmsgbot>	 !log aokoth@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[15:57:01] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wmfkeystonehooks: Use project name in created DNS zone names [puppet] - 10https://gerrit.wikimedia.org/r/1039231 (https://phabricator.wikimedia.org/T343158) (owner: 10Majavah)
[15:57:07] <logmsgbot>	 !log aokoth@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[15:57:17] <wikibugs>	 (03PS2) 10Ebrahim: Enable numeric sorting for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039213 (https://phabricator.wikimedia.org/T366703)
[15:57:25] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265)
[15:57:25] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265)
[15:58:23] <logmsgbot>	 !log aokoth@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[15:58:44] <logmsgbot>	 !log aokoth@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[15:59:31] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1082.eqiad.wmnet
[15:59:34] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:59:47] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:59:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T352010)', diff saved to https://phabricator.wikimedia.org/P64116 and previous config saved to /var/cache/conftool/dbconfig/20240605-155955-ladsgroup.json
[15:59:58] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[16:01:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64117 and previous config saved to /var/cache/conftool/dbconfig/20240605-160116-ladsgroup.json
[16:01:20] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[16:01:26] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864355 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[16:01:26] <logmsgbot>	 !log aokoth@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:01:46] <logmsgbot>	 !log aokoth@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:04:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:05:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:05:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:06:25] <jayme>	 James_F: ^
[16:06:54] <James_F>	 Aha.
[16:06:57] <James_F>	 And it worked?
[16:07:02] <jayme>	 oh, sorry...yes :D
[16:07:07] <James_F>	 :-D
[16:07:11] <James_F>	 Excellent, thank you!
[16:07:24] <jayme>	 yw
[16:08:02] <wikibugs>	 (03CR) 10Urbanecm: "That was actually not true, fwiw :). operations/mediawiki-config is very aggressive about rebase warnings, and it shows them even when the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm)
[16:08:49] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:09:09] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:09:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:10:00] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:10:16] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:10:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[16:10:42] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864398 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[16:10:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:11:47] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:12:16] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-05-28-185827 to 2024-05-31-163732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039221 (https://phabricator.wikimedia.org/T360676) (owner: 10Jforrester)
[16:12:29] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1032.eqiad.wmnet
[16:13:07] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-05-28-185827 to 2024-05-31-163732 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039221 (https://phabricator.wikimedia.org/T360676) (owner: 10Jforrester)
[16:13:26] <wikibugs>	 (03PS13) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[16:14:09] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:15:39] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:15:48] <jinxer-wm>	 FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[16:16:14] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:16:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Maint over', diff saved to https://phabricator.wikimedia.org/P64118 and previous config saved to /var/cache/conftool/dbconfig/20240605-161622-ladsgroup.json
[16:18:51] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:18:53] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1032.eqiad.wmnet
[16:18:58] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:19:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:20:57] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:23:13] <wikibugs>	 (03PS1) 10JHathaway: mw1365: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395)
[16:24:16] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:24:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:24:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:25:59] <wikibugs>	 (03PS1) 10Milimetric: dumps/other: remove unused links [puppet] - 10https://gerrit.wikimedia.org/r/1039246
[16:26:57] <wikibugs>	 (03PS2) 10Milimetric: dumps/other: remove unused links [puppet] - 10https://gerrit.wikimedia.org/r/1039246
[16:29:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:29:35] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "compiling works now and shows noop on active prod host, https://puppet-compiler.wmflabs.org/output/1037621/2761/phab2002.codfw.wmnet/index" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:30:02] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:30:03] <wikibugs>	 (03PS1) 10Hnowlan: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T36399)
[16:30:04] <jouncebot>	 Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do One-off deployment for T365155 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630).
[16:30:04] <jouncebot>	 dr0ptp4kt: A patch you scheduled for One-off deployment for T365155 is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:30:05] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[16:30:54] <Amir1>	 o/
[16:30:58] <Amir1>	 I deploy now
[16:31:04] <James_F>	 +1
[16:31:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64119 and previous config saved to /var/cache/conftool/dbconfig/20240605-163129-ladsgroup.json
[16:32:08] <wikibugs>	 (03PS3) 10Dr0ptp4kt: Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155)
[16:32:55] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host kubestage1003.eqiad.wmnet
[16:32:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt)
[16:33:36] <wikibugs>	 (03Merged) 10jenkins-bot: Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt)
[16:34:00] <wikibugs>	 (03PS2) 10CDanis: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[16:34:05] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1038392|Bump XML dump schema to version 0.11 (T365155)]]
[16:34:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:34:44] <wikibugs>	 (03PS3) 10Milimetric: dumps/other: remove unused links [puppet] - 10https://gerrit.wikimedia.org/r/1039246
[16:34:55] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[16:35:00] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[16:36:34] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and dr0ptp4kt: Backport for [[gerrit:1038392|Bump XML dump schema to version 0.11 (T365155)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:36:37] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[16:37:22] <wikibugs>	 (03PS2) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685)
[16:37:27] <wikibugs>	 (03PS2) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685)
[16:37:59] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:38:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[16:38:56] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864607 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[16:39:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:40:41] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestage1003.eqiad.wmnet
[16:40:55] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:42:13] <mutante>	 ^ this has been alerting every 5 minutes for like 6 hours 
[16:42:41] <mutante>	 would be nice if we could reduce the noise by a downtime or something
[16:43:26] <mutante>	 is it notifying someone in other ways?
[16:43:46] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and dr0ptp4kt: Continuing with sync
[16:45:19] <wikibugs>	 (03CR) 10Dzahn: "We have an alert about this every 5 minutes:" [puppet] - 10https://gerrit.wikimedia.org/r/1038329 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis)
[16:45:50] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[16:46:05] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[16:46:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001']
[16:46:20] <wikibugs>	 (03PS1) 10JHathaway: phab: fix ferm ensure [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395)
[16:46:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64120 and previous config saved to /var/cache/conftool/dbconfig/20240605-164635-ladsgroup.json
[16:46:38] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:46:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] phab: fix ferm ensure [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:47:46] <wikibugs>	 (03PS16) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[16:48:50] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[16:49:40] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: rsync-published.service on stat1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:51:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phab: fix ferm ensure [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:52:28] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1038392|Bump XML dump schema to version 0.11 (T365155)]] (duration: 18m 23s)
[16:52:30] <wikibugs>	 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T366724 (10phaultfinder) 03NEW
[16:52:31] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[16:53:35] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1004.eqiad.wmnet with reason: decom T353785
[16:53:38] <stashbot>	 T353785: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785
[16:53:48] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1004.eqiad.wmnet with reason: decom T353785
[16:54:13] <mutante>	 !log downtimed stat1004 for 10 days to avoid alerting spam during decom process - T353785
[16:54:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:40] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:55:00] <wikibugs>	 (03PS1) 10JHathaway: Revert "Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"""" [puppet] - 10https://gerrit.wikimedia.org/r/1038844
[16:55:10] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (owner: 10JHathaway)
[16:55:11] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "noop on prod server, removed new firewall rule on failover server, all good" [puppet] - 10https://gerrit.wikimedia.org/r/1039248 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[16:55:55] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-published.service on stat1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:56:09] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1005.eqiad.wmnet with reason: decom T353785
[16:56:15] <wikibugs>	 (03CR) 10Hnowlan: services: add data-gateway service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[16:56:22] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1005.eqiad.wmnet with reason: decom T353785
[16:56:51] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001']
[17:00:04] <jouncebot>	 Amir1: Your horoscope predicts another One-off deployment for T365155 deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630).
[17:00:04] <jouncebot>	 dr0ptp4kt: A patch you scheduled for One-off deployment for T365155 is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1700)
[17:00:05] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[17:02:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T364299)', diff saved to https://phabricator.wikimedia.org/P64121 and previous config saved to /var/cache/conftool/dbconfig/20240605-170200-marostegui.json
[17:02:05] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[17:02:55] <wikibugs>	 (03PS2) 10JHathaway: Revert "Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"""" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (https://phabricator.wikimedia.org/T365395)
[17:03:02] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[17:04:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[17:04:35] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[17:04:40] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: rsync-published.service on stat1006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:05:45] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1006.eqiad.wmnet with reason: decom T353785
[17:05:48] <stashbot>	 T353785: Decom EOL stats servers stat100[4-7] - https://phabricator.wikimedia.org/T353785
[17:05:58] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1006.eqiad.wmnet with reason: decom T353785
[17:06:34] <mutante>	 every 5 min for 5 hosts is a lot of noise
[17:06:41] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1007.eqiad.wmnet with reason: decom T353785
[17:06:46] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1033.eqiad.wmnet
[17:06:53] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "Revert "Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"""" [puppet] - 10https://gerrit.wikimedia.org/r/1038844 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[17:06:54] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1007.eqiad.wmnet with reason: decom T353785
[17:09:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64122 and previous config saved to /var/cache/conftool/dbconfig/20240605-170938-ladsgroup.json
[17:09:42] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:10:17] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1033.eqiad.wmnet
[17:10:41] <jhathaway>	 !log phabricator email now egressing via mx-out{1001,2001}.wikimedia.org, which should solve the SPF warnings in your inbox
[17:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:15] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9864875 (10jhathaway)
[17:12:49] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[17:12:55] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9864877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[17:13:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001']
[17:13:50] <wikibugs>	 (03PS1) 10Ladsgroup: Stop writing to pagelinks old columns in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039256 (https://phabricator.wikimedia.org/T352010)
[17:17:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P64123 and previous config saved to /var/cache/conftool/dbconfig/20240605-171708-marostegui.json
[17:17:57] <wikibugs>	 (03PS1) 10Btullis: Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155)
[17:18:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis)
[17:19:05] <wikibugs>	 (03PS2) 10Btullis: Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155)
[17:19:31] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis)
[17:24:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P64124 and previous config saved to /var/cache/conftool/dbconfig/20240605-172446-ladsgroup.json
[17:24:48] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[17:25:03] <wikibugs>	 (03Abandoned) 10Ebrahim: Enable numeric sorting for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039213 (https://phabricator.wikimedia.org/T366703) (owner: 10Ebrahim)
[17:25:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9864928 (10colewhite) The correct link to the docs for setting up kerberos: https://wikitech.wikimedia.o...
[17:27:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001']
[17:27:52] <Amir1>	 jouncebot: nowandnext
[17:27:52] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): One-off deployment for T365155 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1630)
[17:27:52] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1700)
[17:27:52] <jouncebot>	 In 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1800)
[17:27:52] <stashbot>	 T365155: Text id verification makes dumps skip many good rows - https://phabricator.wikimedia.org/T365155
[17:28:15] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Stop writing to pagelinks old columns in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039256 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup)
[17:28:56] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to pagelinks old columns in enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039256 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup)
[17:29:04] <James_F>	 \o/
[17:29:23] <James_F>	 And then just s2 to do? Nice.
[17:29:48] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1039256|Stop writing to pagelinks old columns in enwiki (T352010)]]
[17:29:51] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:30:19] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis)
[17:30:40] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] Revert "Temporarily disable XML dumps on snapshot hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1038845 (https://phabricator.wikimedia.org/T365155) (owner: 10Btullis)
[17:30:43] <wikibugs>	 (03PS17) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[17:31:52] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[17:32:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P64125 and previous config saved to /var/cache/conftool/dbconfig/20240605-173216-marostegui.json
[17:32:32] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1039256|Stop writing to pagelinks old columns in enwiki (T352010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:33:38] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[17:34:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:36:27] <wikibugs>	 (03CR) 10Scott French: "Thank you all for the reviews. I'll aim to get this merged today and the service turned up in staging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[17:38:17] <wikibugs>	 (03PS1) 10Bking: an-db1001: add `airflow_test_k8s` user and db [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001)
[17:38:49] <wikibugs>	 (03CR) 10Scott French: [C:03+2] services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[17:39:00] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[17:39:45] <wikibugs>	 (03Merged) 10jenkins-bot: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[17:39:55] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P64126 and previous config saved to /var/cache/conftool/dbconfig/20240605-173954-ladsgroup.json
[17:42:07] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1039256|Stop writing to pagelinks old columns in enwiki (T352010)]] (duration: 12m 19s)
[17:42:12] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:46:04] <icinga-wm_>	 PROBLEM - Host logging-hd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[17:47:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks Bking." [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[17:47:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T364299)', diff saved to https://phabricator.wikimedia.org/P64127 and previous config saved to /var/cache/conftool/dbconfig/20240605-174724-marostegui.json
[17:47:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2199.codfw.wmnet with reason: Maintenance
[17:47:28] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[17:47:32] <icinga-wm_>	 RECOVERY - Host logging-hd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[17:47:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2199.codfw.wmnet with reason: Maintenance
[17:48:40] <icinga-wm_>	 PROBLEM - OpenSearch health check for shards on 9200 on logging-hd1001 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f0e245b8210: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech
[17:48:40] <icinga-wm_>	 a.org/wiki/Search%23Administration
[17:50:40] <icinga-wm_>	 RECOVERY - OpenSearch health check for shards on 9200 on logging-hd1001 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 755, active_shards: 1525, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 247, delayed_unassigne
[17:50:40] <icinga-wm_>	  0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.06094808126412 https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:50:58] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.dhcp for host wikikube-ctrl1001.eqiad.wmnet
[17:55:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64128 and previous config saved to /var/cache/conftool/dbconfig/20240605-175503-ladsgroup.json
[17:55:07] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:57:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64129 and previous config saved to /var/cache/conftool/dbconfig/20240605-175725-ladsgroup.json
[18:00:00] <icinga-wm_>	 PROBLEM - Host logging-hd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:00:04] <jouncebot>	 dduvall and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T1800).
[18:01:34] <icinga-wm_>	 RECOVERY - Host logging-hd1002 is UP: PING OK - Packet loss = 0%, RTA = 5.10 ms
[18:02:38] <icinga-wm_>	 PROBLEM - OpenSearch health check for shards on 9200 on logging-hd1002 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7facc93c6e10: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech
[18:02:38] <icinga-wm_>	 a.org/wiki/Search%23Administration
[18:04:02] <dancy>	 o/
[18:04:38] <icinga-wm_>	 RECOVERY - OpenSearch health check for shards on 9200 on logging-hd1002 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 755, active_shards: 1524, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 248, delayed_unassigne
[18:04:38] <icinga-wm_>	  0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.00451467268623 https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:06:44] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[18:07:36] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1001.eqiad.wmnet
[18:11:41] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1001.eqiad.wmnet
[18:12:16] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply
[18:12:19] <wikibugs>	 (03CR) 10Xcollazo: "Are these dead links then?" [puppet] - 10https://gerrit.wikimedia.org/r/1039246 (owner: 10Milimetric)
[18:12:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64130 and previous config saved to /var/cache/conftool/dbconfig/20240605-181234-ladsgroup.json
[18:13:02] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply
[18:13:06] <wikibugs>	 (03PS18) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[18:18:42] <wikibugs>	 (03PS19) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[18:21:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:26:52] <wikibugs>	 (03PS20) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[18:27:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64131 and previous config saved to /var/cache/conftool/dbconfig/20240605-182742-ladsgroup.json
[18:30:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper)
[18:32:31] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Move ncmonitor credentials to its own profile [labs/private] - 10https://gerrit.wikimedia.org/r/1037857 (owner: 10BCornwall)
[18:32:33] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] Move ncmonitor credentials to its own profile [labs/private] - 10https://gerrit.wikimedia.org/r/1037857 (owner: 10BCornwall)
[18:39:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[18:40:16] <icinga-wm_>	 PROBLEM - Host logging-hd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[18:41:42] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039272 (https://phabricator.wikimedia.org/T361402)
[18:41:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039272 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[18:41:53] <wikibugs>	 (03PS21) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[18:42:24] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039272 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[18:42:48] <wikibugs>	 (03PS22) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[18:42:50] <icinga-wm_>	 RECOVERY - Host logging-hd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms
[18:42:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64132 and previous config saved to /var/cache/conftool/dbconfig/20240605-184250-ladsgroup.json
[18:42:55] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:44:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[18:49:25] <wikibugs>	 (03CR) 10Bking: [C:03+2] an-db1001: add `airflow_test_k8s` user and db [puppet] - 10https://gerrit.wikimedia.org/r/1039260 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[18:51:05] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Reformat credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1039275
[18:51:17] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Reformat credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1039275 (owner: 10BCornwall)
[18:53:03] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[18:53:18] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.8  refs T361402
[18:53:21] <stashbot>	 T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402
[18:53:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[18:57:15] <jinxer-wm>	 FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[18:57:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: eqiad, codfw 2 VM request for postfix mx-out - https://phabricator.wikimedia.org/T361750#9865257 (10jhathaway)
[18:57:41] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out - https://phabricator.wikimedia.org/T325407#9865258 (10jhathaway)
[18:58:36] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply
[18:58:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[18:58:56] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9865254 (10cmooney) a:05MatthewVernon→03cmooney
[19:01:22] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9865262 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney
[19:02:00] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9865284 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney
[19:02:15] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[19:03:36] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865298 (10cmooney) p:05Triage→03Medium a:05MatthewVernon→03cmooney
[19:03:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865304 (10cmooney)
[19:03:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[19:04:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9865307 (10cmooney) >>! In T365988#9837257, @MatthewVernon wrote: > From the swift POV, this is just checking the cluster is hap...
[19:06:38] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9865316 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney
[19:06:50] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9865331 (10cmooney)
[19:08:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9865335 (10wiki_willy) Hi @dcaro - just following up on this.  Can you provide the racking information for us, to start this install?  Thanks, Willy
[19:08:30] <jinxer-wm>	 FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[19:08:59] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9865360 (10cmooney)
[19:09:39] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply
[19:09:51] <wikibugs>	 (03PS3) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[19:10:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9865354 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney
[19:11:31] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9865362 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney
[19:12:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744 (10jhathaway) 03NEW
[19:12:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865408 (10jhathaway) p:05Triage→03Medium a:03jhathaway
[19:12:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9865379 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney
[19:13:15] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[19:13:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9865417 (10cmooney)
[19:13:27] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad - https://phabricator.wikimedia.org/T365997#9865412 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney
[19:16:57] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9865429 (10cmooney) p:05Triage→03Medium a:05ABran-WMF→03cmooney
[19:17:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9865435 (10cmooney) I spoke to @Jclark-ctr earlier, we will do this commencing at 12:00 UTC tomorrow Thurs 6th Jun.
[19:21:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in - https://phabricator.wikimedia.org/T325406#9865447 (10jhathaway)
[19:22:13] <wikibugs>	 (03PS1) 10JHathaway: email: add node definitions for mx-in boxen [puppet] - 10https://gerrit.wikimedia.org/r/1039280 (https://phabricator.wikimedia.org/T325406)
[19:24:50] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] email: add node definitions for mx-in boxen [puppet] - 10https://gerrit.wikimedia.org/r/1039280 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[19:26:13] <wikibugs>	 (03PS4) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[19:27:27] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[19:28:12] <wikibugs>	 (03PS5) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[19:28:13] <wikibugs>	 (03CR) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[19:29:06] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Move ssh key block to end of the file [labs/private] - 10https://gerrit.wikimedia.org/r/1039281
[19:29:21] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Move ssh key block to end of the file [labs/private] - 10https://gerrit.wikimedia.org/r/1039281 (owner: 10BCornwall)
[19:36:51] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.ganeti.makevm for new host mx-in1001.wikimedia.org
[19:36:53] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.dns.netbox
[19:38:58] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in1001.wikimedia.org - jhathaway@cumin1002"
[19:43:54] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in1001.wikimedia.org - jhathaway@cumin1002"
[19:43:54] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:43:55] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.dns.wipe-cache mx-in1001.wikimedia.org on all recursors
[19:43:58] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mx-in1001.wikimedia.org on all recursors
[19:44:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9865522 (10cmooney) >>! In T360789#9855905, @Papaul wrote: > @cmooney all good on lsw1-d4, lsw1-c2 and lsw1-d8  Thanks!  Confirmed all looks good.  What was...
[19:44:24] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in1001.wikimedia.org - jhathaway@cumin1002"
[19:45:10] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in1001.wikimedia.org - jhathaway@cumin1002"
[19:46:42] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Temporary removal of passwords lookup [puppet] - 10https://gerrit.wikimedia.org/r/1039285
[19:47:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] ncmonitor: Temporary removal of passwords lookup [puppet] - 10https://gerrit.wikimedia.org/r/1039285 (owner: 10BCornwall)
[19:47:04] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host mx-in1001.wikimedia.org with OS bookworm
[19:47:11] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1002 for host mx-in1001.wikimedia.org with OS bookworm
[19:47:13] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Temporary removal of passwords lookup [puppet] - 10https://gerrit.wikimedia.org/r/1039285 (owner: 10BCornwall)
[19:50:02] <wikibugs>	 (03PS6) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[19:52:05] <wikibugs>	 (03PS1) 10Urbanecm: Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748)
[19:52:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) (owner: 10Urbanecm)
[19:54:04] <wikibugs>	 (03PS2) 10Urbanecm: Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748)
[19:57:02] <urbanecm>	 the fancy scheduling tool doesn't seem to be doing anything :(
[19:57:14] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mx-in1001.wikimedia.org with reason: host reimage
[19:58:10] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Add test key to solve pcc error [labs/private] - 10https://gerrit.wikimedia.org/r/1039288
[19:59:07] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Add test key to solve pcc error [labs/private] - 10https://gerrit.wikimedia.org/r/1039288 (owner: 10BCornwall)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T2000).
[20:00:05] <jouncebot>	 Dreamy_Jazz and sergi0: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] <urbanecm>	 i can deploy today
[20:00:15] <sergi0>	 hello
[20:00:15] <Dreamy_Jazz>	 \o
[20:00:15] <urbanecm>	 hi Dreamy_Jazz and sergi0!
[20:00:22] <Dreamy_Jazz>	 Hi there.
[20:00:30] <urbanecm>	 sergi0: do you want to do the backports for the testwiki as well?
[20:00:36] <urbanecm>	 or just beta today?
[20:01:00] <wikibugs>	 (03PS3) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685)
[20:01:01] <wikibugs>	 (03PS7) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[20:01:04] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz)
[20:01:11] <wikibugs>	 (03PS5) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892)
[20:01:14] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm)
[20:01:17] <sergi0>	 urbanecm: I'd prefer just beta, so we can accumulate other possible backports to testwiki
[20:01:46] <wikibugs>	 (03Merged) 10jenkins-bot: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) (owner: 10Dreamy Jazz)
[20:01:51] <urbanecm>	 sergi0: ack. we could do just backports to save time tomorrow, or we can get everything tomorrow too
[20:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm)
[20:01:58] <wikibugs>	 (03PS9) 10Sergio Gimeno: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892)
[20:02:01] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[20:02:16] <Dreamy_Jazz>	 I'll be able to test my config patch by inspecting the DB after performing a log action.
[20:02:38] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx-in1001.wikimedia.org with reason: host reimage
[20:02:40] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta] Enable CommunityConfiguration extension in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[20:02:45] <urbanecm>	 Dreamy_Jazz: ack
[20:03:38] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1038740|[CheckUser] Stop writing old for event tables migration on group0 (T360685)]], [[gerrit:1038882|Growth: Use `growthexperiments` DB list for enabling GrowthExperiments (T364892)]], [[gerrit:1035473|[Beta] Enable CommunityConfiguration extension in all wikis (T364892)]]
[20:03:42] <stashbot>	 T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685
[20:03:42] <stashbot>	 T364892: Enable CommunityConfiguration on all beta wikis with GrowthExperiments - https://phabricator.wikimedia.org/T364892
[20:03:52] <wikibugs>	 (03PS8) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[20:04:40] <wikibugs>	 (03PS1) 10CDanis: otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750)
[20:05:50] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519) (owner: 10Urbanecm)
[20:06:09] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] Improve navigation link handling in CommunityConfiguration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038843 (https://phabricator.wikimedia.org/T364938) (owner: 10Sergio Gimeno)
[20:06:18] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and sgimeno and dreamyjazz: Backport for [[gerrit:1038740|[CheckUser] Stop writing old for event tables migration on group0 (T360685)]], [[gerrit:1038882|Growth: Use `growthexperiments` DB list for enabling GrowthExperiments (T364892)]], [[gerrit:1035473|[Beta] Enable CommunityConfiguration extension in all wikis (T364892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/M
[20:06:18] <logmsgbot>	 wdebug)
[20:06:22] <wikibugs>	 (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[20:06:26] <urbanecm>	 Dreamy_Jazz: can you do the testing, please?
[20:06:31] <Dreamy_Jazz>	 Sure.
[20:06:41] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Fix namespacing of keys [labs/private] - 10https://gerrit.wikimedia.org/r/1039293
[20:07:11] <urbanecm>	 sergi0: and let's spot test the growthexperiments list part as well (just that Growth features don't disappear, i guess)
[20:07:19] <wikibugs>	 (03PS2) 10BCornwall: ncmonitor: Fix namespacing of keys [labs/private] - 10https://gerrit.wikimedia.org/r/1039293
[20:07:48] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] ncmonitor: Fix namespacing of keys [labs/private] - 10https://gerrit.wikimedia.org/r/1039293 (owner: 10BCornwall)
[20:08:05] <sergi0>	 urbanecm: alright
[20:08:20] <Dreamy_Jazz>	 urbanecm: Test successful.
[20:08:21] <wikibugs>	 (03PS16) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272)
[20:08:25] <urbanecm>	 Dreamy_Jazz: great!
[20:10:01] <sergi0>	 urbanecm: as per running the migration script, how should we proceed? Within this window? At least for testwiki and some betas?
[20:10:18] <wikibugs>	 (03PS9) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[20:10:41] <urbanecm>	 sergi0: feel free to run the script in beta at any time (during the window or after it, up2you). for testwiki, i think the script cannot be executed until we enable the feature there?
[20:10:55] <urbanecm>	 (I'm OK with doing that today, but you seemed like you want to wait)
[20:11:19] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2783/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall)
[20:12:42] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply
[20:14:07] <urbanecm>	 sergi0: i checked a couple of wikis, GE features appear available. ok to sync from your side?
[20:15:02] <sergi0>	 urbanecm: Yes,  I'll start running the script after. For testwiki I prefer to wait.
[20:15:08] <urbanecm>	 ok
[20:15:53] <Dreamy_Jazz>	 Interesting that the logmsgbot message ended up having the URL truncated in the on-wiki phab message
[20:16:04] <jinxer-wm>	 FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[20:16:14] <Dreamy_Jazz>	 i.e. https://wikitech.wikimedia.org/wiki/M being the URL
[20:16:15] <urbanecm>	 Dreamy_Jazz: the mwdebug one?
[20:16:22] <urbanecm>	 yeah, that's for irc max length constraints
[20:16:39] <urbanecm>	 i know that for ~3 patches, the URL is the only thing that gets cut out sometimes (depending on commit messages)
[20:16:42] <Dreamy_Jazz>	 Ah yeah, that would explain it.
[20:16:59] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and sgimeno and dreamyjazz: Continuing with sync
[20:18:21] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mx-in1001.wikimedia.org with OS bookworm
[20:18:21] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mx-in1001.wikimedia.org
[20:18:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865636 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1002 for host mx-in1001.wikimedia.org with OS bookworm completed: - m...
[20:19:09] <wikibugs>	 (03PS1) 10CDanis: otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750)
[20:21:54] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.ganeti.makevm for new host mx-in2001.wikimedia.org
[20:21:55] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.dns.netbox
[20:22:57] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply
[20:23:43] <wikibugs>	 (03PS2) 10CDanis: otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750)
[20:23:43] <wikibugs>	 (03PS2) 10CDanis: otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750)
[20:24:01] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in2001.wikimedia.org - jhathaway@cumin1002"
[20:25:08] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM mx-in2001.wikimedia.org - jhathaway@cumin1002"
[20:25:08] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:25:08] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.dns.wipe-cache mx-in2001.wikimedia.org on all recursors
[20:25:12] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mx-in2001.wikimedia.org on all recursors
[20:25:34] <sergi0>	 urbanecm: It seems that GrowthExperiments complains about "Invalid suggested edits configuration". Does that mean that for prod wikis we should split each enabling and run the script in between?
[20:25:38] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in2001.wikimedia.org - jhathaway@cumin1002"
[20:25:43] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1038740|[CheckUser] Stop writing old for event tables migration on group0 (T360685)]], [[gerrit:1038882|Growth: Use `growthexperiments` DB list for enabling GrowthExperiments (T364892)]], [[gerrit:1035473|[Beta] Enable CommunityConfiguration extension in all wikis (T364892)]] (duration: 22m 04s)
[20:25:50] <urbanecm>	 sergi0: complains where?
[20:25:51] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865652 (10Dwisehaupt) @jhathaway Question about the routing of mail with these hosts. Currently the civicrm host receives mail...
[20:25:52] <stashbot>	 T360685: Stop writing old for event table migration on WMF wikis - https://phabricator.wikimedia.org/T360685
[20:25:52] <stashbot>	 T364892: Enable CommunityConfiguration on all beta wikis with GrowthExperiments - https://phabricator.wikimedia.org/T364892
[20:26:07] <Dreamy_Jazz>	 Thanks!
[20:26:11] <urbanecm>	 no problem Dreamy_Jazz 
[20:26:30] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM mx-in2001.wikimedia.org - jhathaway@cumin1002"
[20:26:31] <sergi0>	 urbanecm: eg: https://beta-logs.wmcloud.org/goto/d72e84a040b77bbeba4f9670e75fb0a1
[20:26:45] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host mx-in2001.wikimedia.org with OS bookworm
[20:27:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1002 for host mx-in2001.wikimedia.org with OS bookworm
[20:29:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2206.codfw.wmnet with reason: Maintenance
[20:29:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2206.codfw.wmnet with reason: Maintenance
[20:29:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T364299)', diff saved to https://phabricator.wikimedia.org/P64133 and previous config saved to /var/cache/conftool/dbconfig/20240605-202949-marostegui.json
[20:29:56] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[20:30:26] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall)
[20:30:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[20:30:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[20:32:15] <urbanecm>	 sergi0: good question...we probably should enable wmgUseCommunityConfiguration first, run script second and then enable the GE flag
[20:32:21] <urbanecm>	 that way, this should not happen
[20:33:11] <sergi0>	 urbanecm: ack
[20:34:57] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865703 (10jhathaway)
[20:35:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[20:35:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[20:36:15] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[20:36:15] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[20:38:25] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9865710 (10jhathaway) >>! In T365395#9865652, @Dwisehaupt wrote: > @jhathaway Question about the routing of mail with these host...
[20:42:54] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mx-in2001.wikimedia.org with reason: host reimage
[20:45:14] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx-in2001.wikimedia.org with reason: host reimage
[20:46:00] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[20:46:00] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[20:51:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:56:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T2100)
[21:02:19] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mx-in2001.wikimedia.org with OS bookworm
[21:02:19] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mx-in2001.wikimedia.org
[21:02:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10vm-requests: Site: eqiad, codfw 2 VM request for postfix mx-in - https://phabricator.wikimedia.org/T366744#9865741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1002 for host mx-in2001.wikimedia.org with OS bookworm completed: - m...
[21:04:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[21:04:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[21:07:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[21:08:29] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply
[21:09:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[21:09:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[21:10:15] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[21:10:15] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[21:18:39] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply
[21:30:00] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[21:30:00] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[21:33:23] <wikibugs>	 (03PS12) 10Dzahn: peopleweb: introduce script to warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577
[21:34:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:36:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555
[21:37:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[21:41:59] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555
[21:42:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555
[21:43:09] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Hail mary - eevans@cumin1002
[21:46:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] peopleweb: introduce script to warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[21:51:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[21:51:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[21:56:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[21:56:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[21:57:10] <wikibugs>	 (03PS1) 10Dzahn: peopleweb: fix file permission and typo in script config [puppet] - 10https://gerrit.wikimedia.org/r/1039303 (https://phabricator.wikimedia.org/T343364)
[21:59:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "forgot to link https://phabricator.wikimedia.org/T343364" [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn)
[22:00:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] peopleweb: fix file permission and typo in script config [puppet] - 10https://gerrit.wikimedia.org/r/1039303 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn)
[22:03:13] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Hail mary - eevans@cumin1002
[22:13:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 13.29% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:13:39] <wikibugs>	 (03PS1) 10Dzahn: peopleweb: set warning threshold for home dirs to 2GB [puppet] - 10https://gerrit.wikimedia.org/r/1039305 (https://phabricator.wikimedia.org/T343364)
[22:15:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] peopleweb: set warning threshold for home dirs to 2GB [puppet] - 10https://gerrit.wikimedia.org/r/1039305 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn)
[22:16:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.42000034901261735s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:16:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: 0.21520063012904037s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE
[22:16:21] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.223s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:18:15] <jinxer-wm>	 RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 14.86% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:19:28] <wikibugs>	 (03PS1) 10BryanDavis: wikitech: Update Phabricator Conduit calls to disable/enable users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039307 (https://phabricator.wikimedia.org/T366587)
[22:21:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad appserver GET/200: 0.42000034901261735s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede
[22:21:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver GET/200: ...
[22:21:15] <jinxer-wm>	 0.21520063012904037s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:21:21] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.223s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:21:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:21:45] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[22:24:15] <icinga-wm_>	 PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:26:45] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[22:27:55] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:28:15] <icinga-wm_>	 PROBLEM - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:37:55] <jinxer-wm>	 FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:38:15] <icinga-wm_>	 RECOVERY - Check unit status of push_cross_cluster_settings_9600 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:42:55] <jinxer-wm>	 FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:43:53] <icinga-wm_>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:44:12] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply
[22:44:15] <icinga-wm_>	 RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cloudelastic1010 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:47:55] <jinxer-wm>	 RESOLVED: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:48:27] <wikibugs>	 (03PS2) 10BryanDavis: wikitech: Replace OSM class in Gerrit blocking hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah)
[22:48:27] <wikibugs>	 (03PS3) 10BryanDavis: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah)
[22:49:04] <wikibugs>	 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9866001 (10Scott_French)
[22:50:11] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin2002 - T366555
[22:52:55] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:53:53] <icinga-wm_>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on cloudelastic1009 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:53:57] <icinga-wm_>	 PROBLEM - Host logstash1026 is DOWN: PING CRITICAL - Packet loss = 100%
[22:54:21] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply
[22:55:27] <icinga-wm_>	 RECOVERY - Host logstash1026 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[23:02:55] <jinxer-wm>	 RESOLVED: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on cloudelastic1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:09:29] <wikibugs>	 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9866023 (10Scott_French) The service is turned up in staging and was verified against the commons impact metrics dataset present in cassandra staging a...
[23:11:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:14:41] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:29:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[23:29:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[23:29:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T352010)', diff saved to https://phabricator.wikimedia.org/P64134 and previous config saved to /var/cache/conftool/dbconfig/20240605-232926-ladsgroup.json
[23:29:30] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[23:29:50] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[23:30:14] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[23:38:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038799
[23:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038799 (owner: 10TrainBranchBot)
[23:44:07] <wikibugs>	 (03PS1) 10Stoyofuku-wmf: Refine list of pages where font size controls are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039310 (https://phabricator.wikimedia.org/T366334)
[23:45:32] <wikibugs>	 (03PS2) 10Stoyofuku-wmf: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625)
[23:45:57] <wikibugs>	 (03CR) 10BryanDavis: [C:03+1] "I will roll this out along with I8aa283b88ed7896e8dddd16fd9c3fe4588e2e51e, probably on 2024-06-06" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah)
[23:46:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T364299)', diff saved to https://phabricator.wikimedia.org/P64135 and previous config saved to /var/cache/conftool/dbconfig/20240605-234643-marostegui.json
[23:46:46] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[23:59:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038799 (owner: 10TrainBranchBot)