[00:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:04] FIRING: PuppetDisabled: Puppet disabled on wikikube-worker2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kubernetes&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [00:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:31:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T371742)', diff saved to https://phabricator.wikimedia.org/P68564 and previous config saved to /var/cache/conftool/dbconfig/20240903-003452-ladsgroup.json [00:34:55] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:41:45] (03PS3) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [00:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:49:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P68565 and previous config saved to /var/cache/conftool/dbconfig/20240903-004959-ladsgroup.json [00:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:58:38] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10111474 (10Bugreporter) >Dumps should be disabled until they no longer cause db lag. Or, we should introduce dedicated ap... [00:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:05:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P68566 and previous config saved to /var/cache/conftool/dbconfig/20240903-010506-ladsgroup.json [01:20:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T371742)', diff saved to https://phabricator.wikimedia.org/P68567 and previous config saved to /var/cache/conftool/dbconfig/20240903-012013-ladsgroup.json [01:20:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:20:17] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:20:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:28:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T0200) [02:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:07:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [02:07:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [02:07:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T371742)', diff saved to https://phabricator.wikimedia.org/P68568 and previous config saved to /var/cache/conftool/dbconfig/20240903-020730-ladsgroup.json [02:07:35] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:18:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T0300) [03:01:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T371742)', diff saved to https://phabricator.wikimedia.org/P68569 and previous config saved to /var/cache/conftool/dbconfig/20240903-031012-ladsgroup.json [03:10:16] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:25:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P68570 and previous config saved to /var/cache/conftool/dbconfig/20240903-032519-ladsgroup.json [03:40:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P68571 and previous config saved to /var/cache/conftool/dbconfig/20240903-034026-ladsgroup.json [03:55:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T371742)', diff saved to https://phabricator.wikimedia.org/P68572 and previous config saved to /var/cache/conftool/dbconfig/20240903-035534-ladsgroup.json [03:55:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [03:55:37] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:55:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [03:55:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [03:56:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [03:56:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T371742)', diff saved to https://phabricator.wikimedia.org/P68573 and previous config saved to /var/cache/conftool/dbconfig/20240903-035610-ladsgroup.json [03:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T0400) [04:00:48] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.18 (duration: 00m 47s) [04:02:43] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1070012 (owner: 10L10n-bot) [04:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:08:04] FIRING: PuppetDisabled: Puppet disabled on wikikube-worker2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kubernetes&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [04:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T371742)', diff saved to https://phabricator.wikimedia.org/P68574 and previous config saved to /var/cache/conftool/dbconfig/20240903-041816-ladsgroup.json [04:18:19] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [04:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:33:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P68575 and previous config saved to /var/cache/conftool/dbconfig/20240903-043323-ladsgroup.json [04:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:48:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P68576 and previous config saved to /var/cache/conftool/dbconfig/20240903-044830-ladsgroup.json [04:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T371742)', diff saved to https://phabricator.wikimedia.org/P68577 and previous config saved to /var/cache/conftool/dbconfig/20240903-050338-ladsgroup.json [05:03:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [05:03:41] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:03:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [05:04:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T371742)', diff saved to https://phabricator.wikimedia.org/P68578 and previous config saved to /var/cache/conftool/dbconfig/20240903-050400-ladsgroup.json [05:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:56:39] (03PS2) 10Stevemunene: dns: remove wdqs experimental endpoints [dns] - 10https://gerrit.wikimedia.org/r/1064355 (https://phabricator.wikimedia.org/T371833) [05:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:58:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T371742)', diff saved to https://phabricator.wikimedia.org/P68579 and previous config saved to /var/cache/conftool/dbconfig/20240903-055855-ladsgroup.json [05:58:58] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T0600). [06:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:36] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:36] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P68580 and previous config saved to /var/cache/conftool/dbconfig/20240903-061402-ladsgroup.json [06:15:17] (03CR) 10Stevemunene: [C:03+1] airflow: fix configuration checksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070063 (owner: 10Brouberol) [06:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:54] (03CR) 10Slyngshede: [C:03+2] cloudidp-dev for Horizon OIDC test. [dns] - 10https://gerrit.wikimedia.org/r/1070001 (owner: 10Slyngshede) [06:24:50] (03CR) 10Slyngshede: [V:03+1 C:03+2] R:codfw1dev:cloudweb: Add cloudidp-dev TLS. [puppet] - 10https://gerrit.wikimedia.org/r/1070034 (owner: 10Slyngshede) [06:29:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P68581 and previous config saved to /var/cache/conftool/dbconfig/20240903-062909-ladsgroup.json [06:39:01] (03CR) 10Brouberol: [C:03+2] airflow: fix configuration checksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070063 (owner: 10Brouberol) [06:44:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T371742)', diff saved to https://phabricator.wikimedia.org/P68582 and previous config saved to /var/cache/conftool/dbconfig/20240903-064416-ladsgroup.json [06:44:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [06:44:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [06:44:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T371742)', diff saved to https://phabricator.wikimedia.org/P68583 and previous config saved to /var/cache/conftool/dbconfig/20240903-064438-ladsgroup.json [06:51:52] 10SRE-tools, 06Infrastructure-Foundations: sre.hosts.reimage fails when the node is already in puppet db but has no facts (puppet never ran) - https://phabricator.wikimedia.org/T373810#10111617 (10elukey) 05Open→03Resolved a:03elukey Ack let's close and re-open if you see it again :) [06:54:19] (03PS1) 10Muehlenhoff: Remove idp-build alias [puppet] - 10https://gerrit.wikimedia.org/r/1070135 [06:56:21] (03CR) 10Slyngshede: "Thanks, fix:" [puppet] - 10https://gerrit.wikimedia.org/r/1070030 (owner: 10Slyngshede) [06:56:50] (03CR) 10Slyngshede: [C:03+1] Remove idp-build alias [puppet] - 10https://gerrit.wikimedia.org/r/1070135 (owner: 10Muehlenhoff) [06:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:57] (03CR) 10Muehlenhoff: [C:03+2] Remove idp-build alias [puppet] - 10https://gerrit.wikimedia.org/r/1070135 (owner: 10Muehlenhoff) [07:03:46] (03CR) 10Vgutierrez: [C:03+1] P:trafficserver::backend add cloudidp-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070030 (owner: 10Slyngshede) [07:05:46] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3817/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070025 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [07:06:39] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: adjust throttling threshold for GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1070025 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [07:07:22] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10111627 (10MoritzMuehlenhoff) >>! In T373783#10110623, @elukey wrote: > I updated the bullseye and bookworm netist images this morning Ah, nice. I'll mark the tasks as updated. > but didn't... [07:07:53] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10111628 (10MoritzMuehlenhoff) [07:08:14] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10111629 (10MoritzMuehlenhoff) [07:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:40] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_codfw for 9.2.5-1wm2 [07:12:49] (03CR) 10Slyngshede: [C:03+2] P:trafficserver::backend add cloudidp-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070030 (owner: 10Slyngshede) [07:15:28] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@4ca4744] (releasing): (no justification provided) [07:16:07] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@4ca4744] (releasing): (no justification provided) (duration: 00m 39s) [07:23:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Evaluate xbzrle and/or auto-converge in qemu - https://phabricator.wikimedia.org/T317406#10111632 (10MoritzMuehlenhoff) 05Open→03Declined This was intended as a workaround for VMs running on Ganeti servers with 1G memory and Java-based workloads which ha... [07:25:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.21 [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070141 (https://phabricator.wikimedia.org/T373640) [07:25:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.21 [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070141 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [07:29:20] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10111643 (10joanna_borun) Approved! [07:29:43] (03PS1) 10Stevemunene: Configure prometheus metrics on the cephosd cluster [puppet] - 10https://gerrit.wikimedia.org/r/1070142 (https://phabricator.wikimedia.org/T369583) [07:30:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:31:14] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:35:01] (03PS4) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [07:36:22] 06SRE, 06Infrastructure-Foundations: Updated java.security policy in OpenJDK 11.0.4 - https://phabricator.wikimedia.org/T299894#10111650 (10MoritzMuehlenhoff) 05Open→03Declined This got superceded by https://phabricator.wikimedia.org/T328331 [07:36:40] (03PS5) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [07:37:08] (03PS3) 10Stevemunene: dns: remove wdqs experimental endpoints [dns] - 10https://gerrit.wikimedia.org/r/1064355 (https://phabricator.wikimedia.org/T371833) [07:37:19] (03CR) 10Brouberol: [C:03+1] dns: remove wdqs experimental endpoints [dns] - 10https://gerrit.wikimedia.org/r/1064355 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [07:40:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T371742)', diff saved to https://phabricator.wikimedia.org/P68584 and previous config saved to /var/cache/conftool/dbconfig/20240903-074055-ladsgroup.json [07:40:59] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:51:08] (03PS1) 10Stevemunene: wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) [07:54:18] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.21 [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070141 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [07:56:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P68585 and previous config saved to /var/cache/conftool/dbconfig/20240903-075602-ladsgroup.json [07:57:24] !log move LDAP user cn=ncreasy from cn=nda to cn=wmf [07:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:10] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [08:01:58] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: revert_risk_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067933 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:02:40] (03PS1) 10Brouberol: airflow-test-k8s: deploy the data_platform_sre dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070199 (https://phabricator.wikimedia.org/T373837) [08:04:10] (03CR) 10Kevin Bazira: [C:03+2] ml-services: revert_risk_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067933 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:05:08] (03CR) 10Brouberol: "Sorry, I've run PCC without realizing that no `Hosts:` stanza was specified. I've cancelled the PCC run to avoid consuming resources for n" [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [08:05:25] (03Merged) 10jenkins-bot: ml-services: revert_risk_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067933 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:06:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-upload_codfw for 9.2.5-1wm2 [08:07:22] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10111735 (10elukey) Updated the netinst image yesterday :) [08:07:54] (03PS2) 10Stevemunene: wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) [08:08:04] FIRING: PuppetDisabled: Puppet disabled on wikikube-worker2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kubernetes&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:33] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070200 (https://phabricator.wikimedia.org/T373640) [08:09:35] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070200 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [08:10:25] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070200 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [08:10:30] (03CR) 10CI reject: [V:04-1] wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [08:10:42] !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.21 refs T373640 [08:10:44] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [08:10:56] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:11:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P68586 and previous config saved to /var/cache/conftool/dbconfig/20240903-081110-ladsgroup.json [08:11:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:47] (03PS3) 10Stevemunene: wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) [08:13:14] !log upgrade python3-nbconvert on various DE hosts for security upgrades [08:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:05] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [08:17:29] (03PS1) 10Elukey: Revert "profile::puppetserver: set java_start_mem to 40g" [puppet] - 10https://gerrit.wikimedia.org/r/1070201 [08:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:58] !log upgrade spicerack to 8.12.0 on cumin1002 [08:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:48] (03CR) 10Elukey: [C:03+2] Revert "profile::puppetserver: set java_start_mem to 40g" [puppet] - 10https://gerrit.wikimedia.org/r/1070201 (owner: 10Elukey) [08:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:47] 06SRE, 06Infrastructure-Foundations: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10111789 (10elukey) >>! In T373527#10109603, @elukey wrote: > Next steps: > > * Wait some hours for puppetserver on puppetserver1002 to get to a steady state and... [08:23:39] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-text_codfw for 9.2.5-1wm2 [08:26:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T371742)', diff saved to https://phabricator.wikimedia.org/P68588 and previous config saved to /var/cache/conftool/dbconfig/20240903-082617-ladsgroup.json [08:26:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [08:26:21] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:26:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [08:26:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T371742)', diff saved to https://phabricator.wikimedia.org/P68589 and previous config saved to /var/cache/conftool/dbconfig/20240903-082639-ladsgroup.json [08:26:45] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:28:04] (03PS1) 10Brouberol: airflow: add restrictedSecurityContext to the git-sync initcontainer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070202 (https://phabricator.wikimedia.org/T369492) [08:28:29] !log installing intel-microcode security updates [08:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:03] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:30:07] (03CR) 10Brouberol: [C:03+1] wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [08:31:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:22] !log remove intel-microcode 3.20240312.1~deb11u1 from apt.wikimedia.org (this was a temporary import for the last round of Bullseye reboots, not superceded by 3.20240813.1~deb11u1 from the 11.1 point release) T373795 [08:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:29] T373795: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795 [08:37:42] (03PS1) 10Slyngshede: R:codfw1dev:cloudweb: Enable memcache, disable CORS. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [08:37:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [08:38:06] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:38:10] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:25] !incidents [08:38:26] 5131 (UNACKED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet eqiad) [08:38:26] 5130 (RESOLVED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [08:38:30] !ack 5131 [08:38:31] 5131 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet eqiad) [08:38:41] * akosiaris looking [08:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:32] it's around 12.5rps as errors [08:40:33] (03PS15) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [08:42:16] (03PS2) 10Slyngshede: R:codfw1dev:cloudweb: Enable memcache, disable CORS. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [08:43:10] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [08:44:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [08:45:01] (03CR) 10Stevemunene: [C:03+2] dns: remove wdqs experimental endpoints [dns] - 10https://gerrit.wikimedia.org/r/1064355 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [08:45:14] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:45:25] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:48:06] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-api-ext-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=mw-api-ext-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:48:52] (03PS3) 10Slyngshede: R:codfw1dev:cloudweb: Enable memcache, disable CORS. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [08:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:49:51] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3821/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [08:50:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:56:19] (03PS16) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [08:56:25] !log jnuche@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.21 refs T373640 (duration: 45m 42s) [08:56:27] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [08:56:57] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:57:08] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:58:12] (03PS2) 10DCausse: cirrus: run the sanitizer only for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1052136 [08:58:14] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:59:42] (03PS1) 10David Caro: prometheus::cloud: add maintaindbusers target [puppet] - 10https://gerrit.wikimedia.org/r/1070206 (https://phabricator.wikimedia.org/T332955) [09:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:25] (03PS4) 10Slyngshede: R:codfw1dev:cloudweb: Enable memcache, disable CORS. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [09:03:10] RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:12] (03PS5) 10Slyngshede: R:codfw1dev:cloudweb: Enable memcache, disable CORS. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [09:05:40] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3824/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070206 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [09:08:42] (03CR) 10Btullis: [C:03+1] airflow: add restrictedSecurityContext to the git-sync initcontainer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070202 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:10] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2066.codfw.wmnet [09:14:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2066.codfw.wmnet [09:14:23] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2066.codfw.wmnet [09:14:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2066.codfw.wmnet [09:15:09] (03CR) 10Brouberol: [C:03+2] airflow: add restrictedSecurityContext to the git-sync initcontainer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070202 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:15:10] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:50] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2066.codfw.wmnet with OS bullseye [09:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:23] (03PS1) 10Arturo Borrero Gonzalez: openstack: compute: increase size of conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/1070209 (https://phabricator.wikimedia.org/T373816) [09:19:31] (03PS1) 10JMeybohm: cfssl-issuer: Remove version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070210 (https://phabricator.wikimedia.org/T337928) [09:20:23] (03CR) 10CI reject: [V:04-1] cfssl-issuer: Remove version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070210 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [09:20:59] (03PS2) 10JMeybohm: cfssl-issuer: Remove version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070210 (https://phabricator.wikimedia.org/T337928) [09:21:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:21:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:21:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T371742)', diff saved to https://phabricator.wikimedia.org/P68591 and previous config saved to /var/cache/conftool/dbconfig/20240903-092129-ladsgroup.json [09:21:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:21:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-text_codfw for 9.2.5-1wm2 [09:21:48] (03CR) 10David Caro: [C:03+1] openstack: compute: increase size of conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/1070209 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [09:23:40] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:46] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:30] (03PS6) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [09:24:45] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:cp-magru for 9.2.5-1wm2 [09:25:13] (03PS7) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [09:27:54] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3825/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070206 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [09:29:45] (03CR) 10FNegri: [C:03+1] openstack: compute: increase size of conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/1070209 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [09:29:46] (03PS1) 10Klausman: ml-services: Double CPU and Memory limits for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 [09:30:07] (03PS2) 10Jelto: gitlab: enable throttling for all GitLab instances [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) [09:30:23] (03CR) 10Klausman: "A hand-edit shows this in kubectl diff:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [09:32:00] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3826/co" [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:33:16] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2066.codfw.wmnet with reason: host reimage [09:33:39] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Double CPU and Memory limits for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [09:35:39] (03CR) 10JMeybohm: [C:03+2] cfssl-issuer: Remove version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070210 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [09:35:53] (03CR) 10Jelto: [V:03+1] gitlab: enable throttling for all GitLab instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [09:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2066.codfw.wmnet with reason: host reimage [09:36:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P68592 and previous config saved to /var/cache/conftool/dbconfig/20240903-093637-ladsgroup.json [09:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:47] (03Merged) 10jenkins-bot: cfssl-issuer: Remove version pinning [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070210 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [09:40:04] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:40:06] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:40:14] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:40:35] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:41:14] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:41:16] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:41:26] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:42:03] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:42:43] (03CR) 10Alexandros Kosiaris: trafficserver: Fix /w/rest.php and /api/ regex_map (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070032 (https://phabricator.wikimedia.org/T364400) (owner: 10Clément Goubert) [09:43:24] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:43:31] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:43:40] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:44:08] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:44:15] !log jayme@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:44:47] !log jayme@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:46:14] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:46:31] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:46:33] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:46:51] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:46:53] !log jayme@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [09:47:11] !log jayme@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [09:47:13] !log jayme@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:47:35] !log jayme@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:47:36] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:47:37] (03PS2) 10Klausman: ml-services: Double CPU and Memory limits (ResourceQuota) for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 [09:47:50] !log jayme@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:47:51] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:48:21] !log jayme@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:48:54] (03CR) 10Klausman: "`console" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [09:50:08] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2001.codfw.wmnet [09:50:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2001.codfw.wmnet [09:50:53] (03PS2) 10Hashar: contint: add java jdk-17 packages in addition to jdk-11 [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [09:50:58] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [09:51:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P68594 and previous config saved to /var/cache/conftool/dbconfig/20240903-095144-ladsgroup.json [09:51:52] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3828/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070206 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [09:51:57] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2001.codfw.wmnet [09:52:02] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2001.codfw.wmnet [09:52:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2001.codfw.wmnet [09:52:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10112179 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node was started by cgoubert@cumin1002 Renumb... [09:52:25] (03PS1) 10Elukey: redfish: introduce the AccountManager URI for DELL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070217 (https://phabricator.wikimedia.org/T365372) [09:52:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye [09:53:04] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10112181 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [09:53:10] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2001.codfw.wmnet with OS bullseye [09:53:11] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2001.codfw.wmnet [09:53:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10112182 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [09:53:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10112183 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.renumber-node started by cgoubert@cumin1002 Renumberin... [09:55:52] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2001.codfw.wmnet [09:55:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2001.codfw.wmnet [09:56:57] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2001.codfw.wmnet [09:56:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker2001.codfw.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1000) [10:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:11] (03CR) 10AikoChou: [C:03+1] "the commit msg should also be changed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [10:02:12] (03PS3) 10Klausman: ml-services: Increase CPU and Memory limits (ResourceQuota) for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 [10:02:21] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:02:32] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:02:32] (03CR) 10Klausman: "Done!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [10:03:43] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [10:03:55] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Increase CPU and Memory limits (ResourceQuota) for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [10:04:03] (03CR) 10Klausman: [C:03+2] ml-services: Increase CPU and Memory limits (ResourceQuota) for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [10:05:50] (03PS6) 10Slyngshede: R:codfw1dev:cloudweb: Enable memcache, disable CORS. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [10:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:46] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3829/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [10:06:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T371742)', diff saved to https://phabricator.wikimedia.org/P68596 and previous config saved to /var/cache/conftool/dbconfig/20240903-100651-ladsgroup.json [10:06:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [10:06:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:07:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [10:07:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T371742)', diff saved to https://phabricator.wikimedia.org/P68597 and previous config saved to /var/cache/conftool/dbconfig/20240903-100713-ladsgroup.json [10:07:15] (03Merged) 10jenkins-bot: ml-services: Increase CPU and Memory limits (ResourceQuota) for RR namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070212 (owner: 10Klausman) [10:08:48] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:09:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2066.codfw.wmnet with OS bullseye [10:09:20] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:10:50] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3830/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [10:11:11] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:11:40] (03PS2) 10Arturo Borrero Gonzalez: openstack: compute: increase size of conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/1070209 (https://phabricator.wikimedia.org/T373816) [10:12:45] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:13:37] (03CR) 10FNegri: [C:03+1] openstack: compute: increase size of conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/1070209 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [10:15:05] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: compute: increase size of conntrack table [puppet] - 10https://gerrit.wikimedia.org/r/1070209 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [10:15:52] (03PS7) 10Slyngshede: R:codfw1dev:cloudweb: Enable memcache, disable CORS. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [10:16:14] (03CR) 10Clément Goubert: [C:03+2] decommission mw226[1-2].codfw.wmnet mw22[68-77] [puppet] - 10https://gerrit.wikimedia.org/r/1069999 (https://phabricator.wikimedia.org/T371262) (owner: 10Clément Goubert) [10:16:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3831/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [10:18:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:20:12] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw226[1-2].codfw.wmnet mw22[68-77].codfw.wmnet - https://phabricator.wikimedia.org/T371262#10112273 (10Clement_Goubert) [10:22:11] (03PS1) 10Arturo Borrero Gonzalez: cloud: refresh conntrack values for cloudgw/neutron [puppet] - 10https://gerrit.wikimedia.org/r/1070221 (https://phabricator.wikimedia.org/T373816) [10:22:37] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070217 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:22:57] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10112286 (10MoritzMuehlenhoff) [10:23:13] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10112287 (10MoritzMuehlenhoff) [10:23:22] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10112289 (10MoritzMuehlenhoff) [10:26:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [10:27:51] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:29:17] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:29:59] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:30:52] (03PS1) 10LSobanski: Filter out addresses that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1070222 (https://phabricator.wikimedia.org/T368257) [10:31:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [10:33:59] (03PS4) 10Stevemunene: wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) [10:34:13] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [10:36:41] (03CR) 10Elukey: [C:03+2] redfish: introduce the AccountManager URI for DELL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070217 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:40:03] (03PS17) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [10:40:45] (03PS2) 10Arturo Borrero Gonzalez: cloud: refresh conntrack values for cloudgw/neutron [puppet] - 10https://gerrit.wikimedia.org/r/1070221 (https://phabricator.wikimedia.org/T373816) [10:44:59] (03PS2) 10Hnowlan: k8s: rename kubernetes20(28|55) and mw242[23] to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1070052 (https://phabricator.wikimedia.org/T372878) [10:46:21] (03PS5) 10Stevemunene: wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) [10:46:38] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [10:48:29] (03Merged) 10jenkins-bot: redfish: introduce the AccountManager URI for DELL [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070217 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:51:04] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059103 (https://phabricator.wikimedia.org/T371360) (owner: 10Effie Mouzeli) [10:57:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T371742)', diff saved to https://phabricator.wikimedia.org/P68598 and previous config saved to /var/cache/conftool/dbconfig/20240903-105710-ladsgroup.json [10:57:14] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:57:20] !log installing amd64-microcode security updates [10:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:01:50] (03PS1) 10Arturo Borrero Gonzalez: openstack: compute: set conntrack buckets values to a power of 2 [puppet] - 10https://gerrit.wikimedia.org/r/1070229 (https://phabricator.wikimedia.org/T373816) [11:05:42] (03PS15) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [11:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:07:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, I remember that we ran into the same issue many years ago with ferm/iptables and connection tracking as well." [puppet] - 10https://gerrit.wikimedia.org/r/1070229 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [11:07:41] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3832/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [11:08:46] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10112448 (10MoritzMuehlenhoff) [11:08:49] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10112449 (10MoritzMuehlenhoff) [11:09:17] (03PS8) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [11:09:59] (03CR) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [11:10:43] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: compute: set conntrack buckets values to a power of 2 [puppet] - 10https://gerrit.wikimedia.org/r/1070229 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [11:12:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P68599 and previous config saved to /var/cache/conftool/dbconfig/20240903-111218-ladsgroup.json [11:12:37] (03PS3) 10Arturo Borrero Gonzalez: cloud: refresh conntrack values for cloudgw/neutron [puppet] - 10https://gerrit.wikimedia.org/r/1070221 (https://phabricator.wikimedia.org/T373816) [11:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:27] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T373755#10112530 (10phaultfinder) [11:25:55] (03CR) 10Hashar: [C:03+1] "I was wondering how `/usr/bin/java` would be handled. That is done by `alternatives::java` which sets the alternative to the first entry." [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P68600 and previous config saved to /var/cache/conftool/dbconfig/20240903-112725-ladsgroup.json [11:28:00] (03PS2) 10Hashar: contint: switch java_home from jdk-11 to jdk-17 [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:28:00] (03PS2) 10Hashar: contint: remove jdk-11 packages [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:29:25] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:31:24] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:cp-magru for 9.2.5-1wm2 [11:31:46] (03CR) 10Hashar: "After Java 11 is removed, `/usr/bin/java` would point to Java 17. We would the need to reconnect the Jenkins agents on the two production " [puppet] - 10https://gerrit.wikimedia.org/r/1069328 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:33:57] (03CR) 10Clément Goubert: [C:03+1] k8s: rename kubernetes20(28|55) and mw242[23] to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1070052 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [11:37:12] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:38:08] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069327 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [11:38:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1070222 (https://phabricator.wikimedia.org/T368257) (owner: 10LSobanski) [11:39:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10112599 (10Clement_Goubert) [11:42:29] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10112610 (10Ladsgroup) They are both on dedicated servers and db replicas. We can't fully isolate the replica from the res... [11:42:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T371742)', diff saved to https://phabricator.wikimedia.org/P68601 and previous config saved to /var/cache/conftool/dbconfig/20240903-114232-ladsgroup.json [11:42:34] (03CR) 10Muehlenhoff: R:codfw1dev:cloudweb: Adapt config to cloud dev. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [11:42:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:42:36] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:42:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [11:49:17] (03PS1) 10Effie Mouzeli: trafficserver: Allow XWD to be used for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1070233 (https://phabricator.wikimedia.org/T371537) [11:49:38] (03CR) 10CI reject: [V:04-1] trafficserver: Allow XWD to be used for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1070233 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [11:50:06] (03PS2) 10Effie Mouzeli: trafficserver: Allow XWD to be used for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1070233 (https://phabricator.wikimedia.org/T371537) [11:51:23] (03PS9) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [11:51:24] (03CR) 10Ladsgroup: [C:03+1] trafficserver: Allow XWD to be used for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1070233 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [11:51:48] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070233 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [11:51:58] (03CR) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [11:52:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3833/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [11:54:41] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3834/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [11:59:51] (03CR) 10Slyngshede: [C:04-1] "Nope, memcache does not disable so easily." [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1200) [12:00:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [12:00:52] (03CR) 10Muehlenhoff: [C:03+2] Readd profile::idp::build to idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1069954 (owner: 10Muehlenhoff) [12:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:47] I'm working on T373703 and trying to figure how $wgDefaultUserOptions['math'] is currently configured. From the part of the config I found, I get the impression that some config code is hidden, or the defaults have changed and the current is effectively equivalent to using the defaults given in the extension. If there is no hidden config, I would suggest to remove the code. [12:06:47] T373703: Enable native mathml rendering by default on group0 and test wikis in production - https://phabricator.wikimedia.org/T373703 [12:06:54] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1069258 [12:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:29] (03PS10) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [12:11:50] (03CR) 10CI reject: [V:04-1] R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [12:14:25] (03PS11) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [12:14:45] (03CR) 10CI reject: [V:04-1] R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [12:15:46] (03PS12) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [12:16:08] (03CR) 10CI reject: [V:04-1] R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [12:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:47] (03PS13) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [12:17:08] (03CR) 10CI reject: [V:04-1] R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [12:18:22] (03PS1) 10Muehlenhoff: Temporarily disable stunnel for the Puppet 7 migration of deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/1070236 [12:19:32] (03PS14) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [12:20:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070236 (owner: 10Muehlenhoff) [12:20:38] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:02] (03PS1) 10Ladsgroup: Revert "Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070237 (https://phabricator.wikimedia.org/T359529) [12:24:03] (03PS2) 10Ladsgroup: Revert "Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070237 (https://phabricator.wikimedia.org/T359529) [12:24:13] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:24:48] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2207.codfw.wmnet with reason: Maintenance [12:26:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2207.codfw.wmnet with reason: Maintenance [12:26:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T371742)', diff saved to https://phabricator.wikimedia.org/P68602 and previous config saved to /var/cache/conftool/dbconfig/20240903-122647-ladsgroup.json [12:26:50] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:27:48] (03PS4) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [12:27:48] (03PS1) 10Andrew Bogott: keystone/apache: fix OIDCRedirectURI setting [puppet] - 10https://gerrit.wikimedia.org/r/1070238 (https://phabricator.wikimedia.org/T359590) [12:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:43:09] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:43:11] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:51:04] (03PS1) 10Filippo Giunchedi: pontoon: improve git push experience [puppet] - 10https://gerrit.wikimedia.org/r/1070244 [12:51:54] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:52:41] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:15] o/ [13:01:37] (03PS1) 10Brouberol: airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) [13:05:34] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve git push experience [puppet] - 10https://gerrit.wikimedia.org/r/1070244 (owner: 10Filippo Giunchedi) [13:05:37] (03PS2) 10Brouberol: airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) [13:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:55] (03CR) 10JMeybohm: [C:03+2] eventgate-main: Disable end-to-end readinessProbe (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [13:07:13] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:08:02] (03Merged) 10jenkins-bot: eventgate-main: Disable end-to-end readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [13:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:24] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:12:26] (03CR) 10AOkoth: [C:03+1] Filter out addresses that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1070222 (https://phabricator.wikimedia.org/T368257) (owner: 10LSobanski) [13:13:25] (03CR) 10Vgutierrez: [C:03+1] trafficserver: Allow XWD to be used for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1070233 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:53] (03PS1) 10Jforrester: Drop old wikifunctions.ui event stream, replaced by ….wikifunctions_ui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070247 (https://phabricator.wikimedia.org/T369949) [13:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:49] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:17:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T371742)', diff saved to https://phabricator.wikimedia.org/P68604 and previous config saved to /var/cache/conftool/dbconfig/20240903-131704-ladsgroup.json [13:17:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:17:11] (03PS3) 10Brouberol: airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) [13:18:20] (03CR) 10CI reject: [V:04-1] airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:20:00] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:20:51] (03PS4) 10Brouberol: airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) [13:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:30] (03PS5) 10Brouberol: airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) [13:25:58] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:47] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T373731#10112948 (10Papaul) 05Open→03Resolved a:03Papaul This is s duplicate for https://phabricator.wikimedia.org/T373727. Resolving [13:29:10] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:29:43] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [13:30:04] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [13:32:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P68605 and previous config saved to /var/cache/conftool/dbconfig/20240903-133211-ladsgroup.json [13:33:32] (03CR) 10JMeybohm: [C:03+1] "works for me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:34:38] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:37:48] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:51] (03CR) 10Btullis: [C:03+1] "This looks really cool. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:41:09] !log jayme@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [13:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:39] !log jayme@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [13:43:31] (03CR) 10Btullis: [C:03+1] airflow: fully generate airflow.cfg from helm values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:43:32] (03CR) 10Brouberol: [C:03+2] airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:43:46] (03CR) 10Btullis: [C:03+1] airflow: fully generate airflow.cfg from helm values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:44:00] (03CR) 10Brouberol: [C:03+2] airflow: fully generate airflow.cfg from helm values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:44:03] (03CR) 10Brouberol: [V:03+2 C:03+2] airflow: fully generate airflow.cfg from helm values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:45:21] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:45:58] (03PS2) 10Andrew Bogott: keystone/apache: fix OIDC settings [puppet] - 10https://gerrit.wikimedia.org/r/1070238 (https://phabricator.wikimedia.org/T359590) [13:45:59] (03PS5) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [13:47:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P68606 and previous config saved to /var/cache/conftool/dbconfig/20240903-134719-ladsgroup.json [13:47:33] (03CR) 10Andrew Bogott: [C:03+2] keystone/apache: fix OIDC settings [puppet] - 10https://gerrit.wikimedia.org/r/1070238 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:48:31] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:49:04] (03CR) 10Andrea Denisse: [C:03+2] alert: Enable the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1064806 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [13:49:13] !log jayme@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [13:49:48] !log jayme@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [13:51:16] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T373755#10113055 (10VRiley-WMF) a:03VRiley-WMF [13:51:31] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:51:32] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T373755#10113058 (10VRiley-WMF) 05Open→03Resolved Rebalanced power [13:51:54] (03PS1) 10Brouberol: global_config: define an external-services entry for mx[1-2]001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1070255 (https://phabricator.wikimedia.org/T368737) [13:52:25] (03CR) 10Brouberol: [V:03+2 C:03+2] airflow: fully generate airflow.cfg from helm values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070245 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [13:52:51] (03PS1) 10JMeybohm: eventgate: Disable end-to-end readinessProbe by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070257 (https://phabricator.wikimedia.org/T373192) [13:53:54] (03CR) 10JMeybohm: "I tried to clean up the type confusion around `test_events`, making it a list everywhere." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070257 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [13:54:43] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:55:27] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:57:05] (03CR) 10Effie Mouzeli: [C:03+2] trafficserver: Allow XWD to be used for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1070233 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [13:58:48] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:00:05] denisse and godog: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Alert hosts failoverFailover to alert1002 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1400). [14:00:24] godog: Ready to go. :) [14:00:30] (03CR) 10WMDE-leszek: "+1, WMDE would find it easily with wikidata-query-gui name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [14:00:43] !log Disabling meta-monitoring for the alert hosts [14:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:50] !log Disabling meta-monitoring for the alert hosts - T372418 [14:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:53] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [14:01:14] denisse: \o/ [14:02:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T371742)', diff saved to https://phabricator.wikimedia.org/P68609 and previous config saved to /var/cache/conftool/dbconfig/20240903-140226-ladsgroup.json [14:02:29] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:03:43] !log Stopping services in the alert1001 host - T372418 [14:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:51] (03CR) 10Andrea Denisse: [C:03+2] alert: Failover from alert1001 to alert2002 [puppet] - 10https://gerrit.wikimedia.org/r/1064826 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:42] !log Failing over to alert2002 - T372418 [14:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:45] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [14:07:21] (03PS2) 10Brouberol: airflow-test-k8s: deploy the data_platform_sre dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070199 (https://phabricator.wikimedia.org/T373837) [14:09:05] (03PS3) 10Brouberol: airflow-test-k8s: deploy the data_platform_sre dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070199 (https://phabricator.wikimedia.org/T373837) [14:10:16] (03PS3) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1065258 (https://phabricator.wikimedia.org/T372418) [14:10:35] !log Resolve DNS queries to alert2002 - T372418 [14:10:35] (03PS1) 10Elukey: redfish: catch no-json-responses in change_user_password [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070263 (https://phabricator.wikimedia.org/T365372) [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:02] (03PS15) 10Slyngshede: R:codfw1dev:cloudweb: Adapt config to cloud dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070203 [14:12:04] (03CR) 10Andrea Denisse: [C:03+2] alert: Resolve alerts DNS queries to alert2002 [dns] - 10https://gerrit.wikimedia.org/r/1065258 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:27] (03PS8) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [14:16:18] (03PS9) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [14:17:32] (03CR) 10Slyngshede: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3837/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070034 (owner: 10Slyngshede) [14:18:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:18:32] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: deploy the data_platform_sre dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070199 (https://phabricator.wikimedia.org/T373837) (owner: 10Brouberol) [14:18:43] (03CR) 10Hnowlan: [C:03+2] k8s: rename kubernetes20(28|55) and mw242[23] to wikikube-workers [puppet] - 10https://gerrit.wikimedia.org/r/1070052 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [14:19:57] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3838/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [14:20:18] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070255 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [14:20:30] hnowlan: Can I merge k8s: rename kubernetes20(28|55) and mw242[23] to wikikube-workers (7884bf2827) ? [14:20:38] denisse: was just writing you a message - please do [14:20:55] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:46] Merged. [14:22:45] (03PS2) 10Brouberol: global_config: define an external-services entry for mx[1-2]001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1070255 (https://phabricator.wikimedia.org/T368737) [14:23:04] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070255 (https://phabricator.wikimedia.org/T368737) (owner: 10Brouberol) [14:24:51] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host kubernetes2028.codfw.wmnet [14:24:56] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2055.codfw.wmnet [14:25:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3839/console" [puppet] - 10https://gerrit.wikimedia.org/r/1070203 (owner: 10Slyngshede) [14:25:01] PROBLEM - SSH on gitlab1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:01] PROBLEM - SSH on miscweb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:08] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2422.codfw.wmnet [14:25:13] PROBLEM - NTP peers on dns3004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:13] PROBLEM - SSH on planet1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:15] PROBLEM - NTP peers on dns5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:15] PROBLEM - SSH on aphlict2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:15] PROBLEM - SSH on doc2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:15] PROBLEM - NTP peers on dns6002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:16] PROBLEM - SSH on gitlab1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:22] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2423.codfw.wmnet [14:25:23] PROBLEM - NTP anycast VIP 10.3.0.6 ntp-b.anycast.wmnet on ntp-b.anycast.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [14:25:27] PROBLEM - SSH on etherpad1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:27] PROBLEM - SSH on lists2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:29] PROBLEM - NTP peers on dns3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:29] PROBLEM - NTP peers on dns4004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:29] PROBLEM - NTP peers on dns6001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:29] PROBLEM - SSH on planet2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:33] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host kubernetes2055.codfw.wmnet [14:25:34] :o [14:25:37] PROBLEM - SSH on durum2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:37] PROBLEM - SSH on durum2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:37] PROBLEM - SSH on durum1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:37] PROBLEM - SSH on durum4001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:37] PROBLEM - SSH on durum1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:38] PROBLEM - NTP peers on dns7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:38] PROBLEM - SSH on durum3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:39] PROBLEM - SSH on durum4002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:39] PROBLEM - SSH on durum5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:40] PROBLEM - SSH on durum5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:40] PROBLEM - SSH on durum6002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:41] PROBLEM - SSH on durum7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:41] PROBLEM - SSH on durum3004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:42] PROBLEM - SSH on durum6001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:42] PROBLEM - SSH on durum7002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:42] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2422.codfw.wmnet [14:25:43] PROBLEM - SSH on gerrit1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:43] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2422.codfw.wmnet [14:25:44] PROBLEM - SSH on gerrit1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:44] PROBLEM - SSH on gerrit2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:45] PROBLEM - SSH on gitlab2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:45] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2422.codfw.wmnet [14:25:46] PROBLEM - SSH on miscweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:46] PROBLEM - NTP anycast VIP 10.3.0.7 ntp-c.anycast.wmnet on ntp-c.anycast.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [14:25:47] PROBLEM - SSH on etherpad2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:49] PROBLEM - NTP peers on dns1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:49] PROBLEM - NTP peers on dns1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:49] PROBLEM - NTP peers on dns1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:49] PROBLEM - NTP peers on dns2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:49] PROBLEM - NTP peers on dns2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:50] PROBLEM - NTP peers on dns2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:50] PROBLEM - NTP peers on dns4003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:50] PROBLEM - NTP peers on dns5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:51] PROBLEM - NTP peers on dns7001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP [14:25:52] PROBLEM - SSH on doc1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:53] PROBLEM - SSH on aphlict1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:25:56] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2423.codfw.wmnet [14:26:05] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2423.codfw.wmnet [14:26:06] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host mw2423.codfw.wmnet [14:26:21] (03PS2) 10Physikerwelt: Remove redundandant setting of $wgDefaultUserOptions['math'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069258 (https://phabricator.wikimedia.org/T373703) [14:28:24] something is off with the SSH monitoring. I had alert for doc1003 / gerrit1003 / doc2002 / gerrit2002 but they do respond to SSH [14:28:33] PROBLEM - NTP anycast VIP 10.3.0.5 ntp-a.anycast.wmnet on ntp-a.anycast.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [14:28:42] its due to the alerting migration hashar [14:28:52] hashar: Apologies for the noise. [14:29:15] hnowlan: Do you know if the NTP alerts have to do with the k8s patch? [14:29:37] denisse: I doubt it [14:29:50] hnowlan: You're right, my bad. [14:29:53] I'm taking a look. [14:30:24] or at least if they are related I have done something unprecedented :D [14:32:25] denisse: I'm looking into alerts.w.o not working in the meantime [14:32:58] FIRING: [7x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip6) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:01] godog: It seems to work for me, I tried logging in from an incognito window. [14:33:02] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:09] FIRING: [3x] RedisMemoryFull: Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:33:18] FIRING: JobUnavailable: Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:33:25] denisse: yes I just started prometheus-alertmanager on alert2002, it was stopped previously [14:36:11] arnaudb: denisse: thanks! My best wishes for the ongoing maintenance [14:36:28] hashar: Thanks! :) [14:36:32] <3 [14:37:56] FIRING: [8x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:38:00] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:12] FIRING: [2x] JobUnavailable: Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:12] godog: Looking at the NTP peers Icinga alerts, clicking on any host show's that everything is fine. PING OK - Packet loss = 0% [14:38:25] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:01] godog: I wonder if this could be related to the `sync_check_icinga_contacts.service` issue where the keyholder doesn't seem to want to hand-off the key. [14:39:14] denisse: definitely a different issue in this case [14:39:24] (03PS1) 10Effie Mouzeli: Revert "trafficserver: Allow XWD to be used for wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/1070266 [14:39:40] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2028 to wikikube-worker2072 [14:40:20] godog: I'm looking to see if I missed something in the Puppet config. [14:40:25] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:41:52] PROBLEM - SSH on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:42:39] !log Restarting CI Jenkins [14:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:55] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:55] (03PS6) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [14:43:55] (03PS1) 10Andrew Bogott: keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) [14:44:02] denisse: ack [14:44:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2055 to wikikube-worker2073 [14:44:07] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:44:15] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2028 to wikikube-worker2072 - hnowlan@cumin1002" [14:44:24] (03CR) 10CI reject: [V:04-1] keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:44:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2028 to wikikube-worker2072 - hnowlan@cumin1002" [14:44:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:44:34] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2072 [14:44:34] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:45:34] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2072 [14:46:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2028 to wikikube-worker2072 [14:46:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10113353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from kubernetes2028 to wikikube-worker... [14:47:58] 10ops-eqiad, 06SRE, 06DC-Ops: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888 (10MoritzMuehlenhoff) 03NEW [14:48:55] godog: Everything looks good in the Puppet and DNS repositories. [14:48:59] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2055 to wikikube-worker2073 - hnowlan@cumin1002" [14:49:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2055 to wikikube-worker2073 - hnowlan@cumin1002" [14:49:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:49:25] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2073 [14:49:28] godog: I wonder if restarting icinga would trigger the checks again, what do you think? [14:49:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2073 [14:50:15] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2055 to wikikube-worker2073 [14:50:20] denisse: it will but I doubt it will fix anything, you can reproduce the problem with for example ssh lists1004.wikimedia.org from alert2002, which should work and doesn't [14:50:27] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10113386 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from kubernetes2055 to wikikube-worker... [14:50:33] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10113387 (10elukey) Happened again for: ` [2024-09-03T14:39:43] Unable to update host 'lvs3009.esams.wmnet' Traceback (most recent call l... [14:51:04] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2422 to wikikube-worker2074 [14:51:14] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2423 to wikikube-worker2075 [14:51:32] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:52:41] godog: You're right... I'm wondering if it could be a missing ferm rule... [14:52:43] (03PS2) 10Andrew Bogott: keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) [14:52:43] (03PS7) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [14:52:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:53:12] FIRING: [2x] JobUnavailable: Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:55] godog: I think we may be missing a Puppet run on those hosts, I ran puppet on lists1004 and one of the changes is Ssh::Client/File[/etc/ssh/ssh_known_hosts. [14:55:39] denisse: I also doubt that's the problem, that file doesn't control access [14:55:49] but yes definitely sth firewall related [14:56:27] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2423 to wikikube-worker2075 - hnowlan@cumin1002" [14:56:39] (03CR) 10Elukey: "I tried to run the cookbook with test-cookbook and https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1070263 seems to be th" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:57:04] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [14:57:10] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2423 to wikikube-worker2075 - hnowlan@cumin1002" [14:57:11] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:11] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2075 [14:57:30] ok here we go, the new alert hosts are not in /etc/nftables [14:57:49] That would explain it! ;o [14:58:15] Let me see how to add them. [14:58:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2075 [14:58:56] (03Abandoned) 10Effie Mouzeli: Revert "trafficserver: Allow XWD to be used for wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/1070266 (owner: 10Effie Mouzeli) [14:59:04] (03CR) 10FNegri: [C:03+1] cloud: refresh conntrack values for cloudgw/neutron [puppet] - 10https://gerrit.wikimedia.org/r/1070221 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [14:59:04] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2423 to wikikube-worker2075 [14:59:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10113412 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2423 to wikikube-worker2075 com... [14:59:23] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:24] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2074 [14:59:52] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on backup2003 - https://phabricator.wikimedia.org/T372698#10113413 (10Jhancock.wm) 05Open→03Resolved drive has been replaced. [15:00:05] eoghan, jelto, arnoldokoth, and mutante: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1500) [15:00:13] (03CR) 10David Caro: [C:03+1] cloud: refresh conntrack values for cloudgw/neutron [puppet] - 10https://gerrit.wikimedia.org/r/1070221 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [15:00:17] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2074 [15:00:55] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2422 to wikikube-worker2074 [15:01:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10113418 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2422 to wikikube-worker2074 com... [15:01:30] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2072.codfw.wmnet with OS bullseye [15:01:40] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2072 [15:01:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10113419 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2072.codf... [15:01:46] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:02:07] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2073.codfw.wmnet with OS bullseye [15:02:17] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10113422 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker2073.codf... [15:03:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:29] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070272 [15:03:29] (03PS3) 10Andrew Bogott: keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) [15:03:29] (03PS8) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [15:04:01] (03PS1) 10Muehlenhoff: Don't uninstall libnet-dns-perl when moving from ferm to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1070273 (https://phabricator.wikimedia.org/T373637) [15:04:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: ganeti2009 - ManagementSSHDown - https://phabricator.wikimedia.org/T373727#10113428 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm server found with mgmt up. confirmed can login. must have corrected itself. [15:04:53] (03PS1) 10Alexandros Kosiaris: ats: Fix issue with /api/ pointing to /w/rest.php [puppet] - 10https://gerrit.wikimedia.org/r/1070274 (https://phabricator.wikimedia.org/T364400) [15:05:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070273 (https://phabricator.wikimedia.org/T373637) (owner: 10Muehlenhoff) [15:05:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [15:06:20] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2072 - hnowlan@cumin1002" [15:06:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2072 - hnowlan@cumin1002" [15:06:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:26] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2072.codfw.wmnet 89.0.192.10.in-addr.arpa 9.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:06:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2072.codfw.wmnet 89.0.192.10.in-addr.arpa 9.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:06:29] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2072 [15:06:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2072 [15:06:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2072 [15:06:53] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2073 [15:07:09] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:07:55] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:09] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloud: refresh conntrack values for cloudgw/neutron [puppet] - 10https://gerrit.wikimedia.org/r/1070221 (https://phabricator.wikimedia.org/T373816) (owner: 10Arturo Borrero Gonzalez) [15:09:33] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:09:49] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:10:11] (03CR) 10Herron: [C:03+2] grafana: set thanos as default datasource [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [15:10:49] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2073 - hnowlan@cumin1002" [15:10:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2073 - hnowlan@cumin1002" [15:10:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:10:53] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2073.codfw.wmnet 25.0.192.10.in-addr.arpa 5.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:10:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2073.codfw.wmnet 25.0.192.10.in-addr.arpa 5.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:10:57] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2073 [15:11:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2073 [15:11:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2073 [15:12:55] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:02] (03CR) 10Vgutierrez: "please add the missing bits to write the .prom file on disk, script looking good overall" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [15:21:43] PROBLEM - Host kubernetes2028 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:06] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:55] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:29:39] (03PS1) 10Ebernhardson: cloudelastic: Increase heap by 2g in small clusters [puppet] - 10https://gerrit.wikimedia.org/r/1070277 [15:31:55] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893 (10Papaul) 03NEW [15:32:28] (03PS2) 10Bking: cloudelastic: Increase heap by 2g in small clusters [puppet] - 10https://gerrit.wikimedia.org/r/1070277 (owner: 10Ebernhardson) [15:32:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070277 (owner: 10Ebernhardson) [15:33:42] (03PS3) 10Bking: cloudelastic: Increase heap by 2g in small clusters [puppet] - 10https://gerrit.wikimedia.org/r/1070277 (owner: 10Ebernhardson) [15:33:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070277 (owner: 10Ebernhardson) [15:34:40] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T373894 (10phaultfinder) 03NEW [15:35:32] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10113626 (10wiki_willy) Thanks @dcaro, sounds good. I'll bug them again about the drive number, if we don't hear bac... [15:35:40] (03CR) 10Bking: [C:03+2] cloudelastic: Increase heap by 2g in small clusters [puppet] - 10https://gerrit.wikimedia.org/r/1070277 (owner: 10Ebernhardson) [15:39:40] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:55] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:24] !log bounce icinga on alert2002 [15:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:38] (03CR) 10Dzahn: [C:03+1] "latest observations have been that there are no more IPs being affected with these values, let's go" [puppet] - 10https://gerrit.wikimedia.org/r/1058608 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [15:47:29] (03PS1) 10Ebernhardson: cirrus: Introduce an expensive query pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070281 (https://phabricator.wikimedia.org/T369808) [15:47:31] (03PS1) 10Ebernhardson: cirrus: Remove unused Regex pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) [15:47:41] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10113655 (10Jhancock.wm) Proposed: We can move frpig to U21 and use ports 20 on the new switches. We can move pay-lb2002 to U15 and use ports 14 on the new sw... [15:48:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on puppetmaster1003 - https://phabricator.wikimedia.org/T373235#10113665 (10VRiley-WMF) →14Duplicate dup:03T373888 [15:48:23] PROBLEM - Host kubernetes2028 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:37] (03PS2) 10Ebernhardson: cirrus: Remove unused Regex pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) [15:48:50] (03CR) 10Ebernhardson: [C:04-2] "related patch is not deployed yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [15:49:28] 10ops-eqiad, 06SRE, 06DC-Ops: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10113662 (10VRiley-WMF) [15:49:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070281 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [15:49:35] 10ops-eqiad, 06SRE, 06DC-Ops: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10113670 (10VRiley-WMF) Hey @MoritzMuehlenhoff , thanks for reaching out on this ticket. Thankfully, I have been able to locate a replacement disk for this unit. We can swap this disk at anytime. [15:49:52] 10ops-eqiad, 06SRE, 06DC-Ops: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10113671 (10VRiley-WMF) a:03VRiley-WMF [15:50:49] (03PS1) 10Andrea Denisse: Revert "alert: Resolve alerts DNS queries to alert2002" [dns] - 10https://gerrit.wikimedia.org/r/1070283 [15:51:04] (03PS1) 10Andrea Denisse: Revert "alert: Failover from alert1001 to alert2002" [puppet] - 10https://gerrit.wikimedia.org/r/1070284 [15:51:11] (03PS1) 10Kamila Součková: kubernetes: Rename mw240[267] to wikikube-worker... [puppet] - 10https://gerrit.wikimedia.org/r/1070285 (https://phabricator.wikimedia.org/T372878) [15:51:23] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10113695 (10Jhancock.wm) [15:51:52] 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Relocate servers in C8 to make room for new Network devices - https://phabricator.wikimedia.org/T373893#10113697 (10Jhancock.wm) [15:52:02] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "alert: Resolve alerts DNS queries to alert2002" [dns] - 10https://gerrit.wikimedia.org/r/1070283 (owner: 10Andrea Denisse) [15:52:15] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "alert: Failover from alert1001 to alert2002" [puppet] - 10https://gerrit.wikimedia.org/r/1070284 (owner: 10Andrea Denisse) [15:52:15] (03CR) 10Andrea Denisse: [C:03+2] Revert "alert: Resolve alerts DNS queries to alert2002" [dns] - 10https://gerrit.wikimedia.org/r/1070283 (owner: 10Andrea Denisse) [15:52:24] (03CR) 10Andrea Denisse: [C:03+2] Revert "alert: Failover from alert1001 to alert2002" [puppet] - 10https://gerrit.wikimedia.org/r/1070284 (owner: 10Andrea Denisse) [15:53:01] (03PS1) 10Andrea Denisse: Revert "alert: Enable the alert[12]002 hosts as alertmanagers" [puppet] - 10https://gerrit.wikimedia.org/r/1070286 [15:53:13] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "alert: Enable the alert[12]002 hosts as alertmanagers" [puppet] - 10https://gerrit.wikimedia.org/r/1070286 (owner: 10Andrea Denisse) [15:53:23] (03CR) 10Andrea Denisse: [C:03+1] Revert "alert: Enable the alert[12]002 hosts as alertmanagers" [puppet] - 10https://gerrit.wikimedia.org/r/1070286 (owner: 10Andrea Denisse) [15:53:25] (03CR) 10Andrea Denisse: [C:03+2] Revert "alert: Enable the alert[12]002 hosts as alertmanagers" [puppet] - 10https://gerrit.wikimedia.org/r/1070286 (owner: 10Andrea Denisse) [15:58:22] PROBLEM - Host kubernetes2028 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:22] RECOVERY - SSH on etherpad1004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:58:26] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs-main,name=codfw [15:58:47] !log Reverting back to alert1001 - T372418 [15:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:49] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [15:59:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:59:06] !log Enabling meta monitoring for alert[12]001 - T372418 [15:59:06] RECOVERY - SSH on planet2003 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:00:04] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069334 (https://phabricator.wikimedia.org/T372757) (owner: 10Stoyofuku-wmf) [16:00:26] RECOVERY - SSH on aphlict2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:28] RECOVERY - NTP peers on dns2004 is OK: NTP OK: Offset -0.000199836 secs https://wikitech.wikimedia.org/wiki/NTP [16:01:30] RECOVERY - NTP peers on dns3003 is OK: NTP OK: Offset -0.00022357 secs https://wikitech.wikimedia.org/wiki/NTP [16:01:30] RECOVERY - NTP peers on dns5003 is OK: NTP OK: Offset 0.00088032 secs https://wikitech.wikimedia.org/wiki/NTP [16:01:32] RECOVERY - SSH on durum1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:32] RECOVERY - SSH on durum4002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:32] RECOVERY - SSH on durum5002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:36] (03CR) 10Volans: [C:03+1] "Fair enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070263 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [16:01:45] (03PS1) 10Ottomata: refine_eventlogging_analytics - only refine MediaWikiPingback data [puppet] - 10https://gerrit.wikimedia.org/r/1070288 (https://phabricator.wikimedia.org/T238230) [16:01:48] RECOVERY - SSH on gerrit1003 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:48] RECOVERY - SSH on gitlab1003 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:58] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply heap settings - bking@cumin2002 - T373895 [16:02:03] (03PS1) 10Hnowlan: site: fix role for reimaged kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/1070289 (https://phabricator.wikimedia.org/T372878) [16:02:09] T373895: Reduce frequency of garbage collection alerts on cloudelastic - https://phabricator.wikimedia.org/T373895 [16:03:31] (03PS1) 10Volans: setup.py: fix test dependency removed upstream [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1070291 [16:03:31] (03PS1) 10Volans: interactive: log the user input only if is valid [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1070292 [16:03:38] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp7015.magru.wmnet with reason: T371554 [16:03:50] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:03:53] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp7015.magru.wmnet with reason: T371554 [16:04:50] RECOVERY - NTP peers on dns7002 is OK: NTP OK: Offset -9.2865e-05 secs https://wikitech.wikimedia.org/wiki/NTP [16:04:52] RECOVERY - SSH on durum3004 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:04:52] (03CR) 10Scott French: [C:03+1] site: fix role for reimaged kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/1070289 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [16:04:53] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3840/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070288 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:05:02] RECOVERY - SSH on etherpad2002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:06:12] FIRING: [2x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:07:06] (03PS1) 10Brouberol: datasets-config: replace hardcoded gitlab IPs by external_services entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070294 [16:07:15] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:07:16] (03CR) 10Ottomata: refine_eventlogging_analytics - only refine MediaWikiPingback data [puppet] - 10https://gerrit.wikimedia.org/r/1070288 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:07:23] (03PS2) 10Brouberol: datasets-config: replace hardcoded gitlab IPs by external_services entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070294 [16:07:28] (03PS1) 10Dzahn: gerrit: slightly raise throttling tresholds one last time [puppet] - 10https://gerrit.wikimedia.org/r/1070295 (https://phabricator.wikimedia.org/T365259) [16:07:56] (03CR) 10Dzahn: [C:03+2] gerrit: slightly raise throttling tresholds one last time [puppet] - 10https://gerrit.wikimedia.org/r/1070295 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [16:08:46] (03CR) 10Hnowlan: [C:03+2] site: fix role for reimaged kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/1070289 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [16:09:29] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [16:09:32] T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791 [16:10:17] (03CR) 10Scott French: [C:03+1] kubernetes: Rename mw240[267] to wikikube-worker... (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070285 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [16:10:55] (03CR) 10Ottomata: [C:03+2] "I tested this manually on the CLI with dry_run." [puppet] - 10https://gerrit.wikimedia.org/r/1070288 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:11:57] FIRING: ProbeDown: Service wdqs-main:443 has failed probes (http_wdqs-main_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-main:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:12:01] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:12:17] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:12:40] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ihurbain - https://phabricator.wikimedia.org/T373811#10113806 (10andrea.denisse) a:03andrea.denisse [16:13:21] (03CR) 10Dzahn: [C:03+2] contint: add java jdk-17 packages in addition to jdk-11 [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [16:16:00] (03CR) 10Dzahn: [C:03+2] "Notice: /Stage[main]/Java/Java::Package[openjdk-jdk-17]/Package[openjdk-17-jdk]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [16:16:12] FIRING: [2x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:00] !log contint - installed java jdk 17 packages - just installed, in parallel to existing jdk 11, no change to java_home / what is used by CI yet. T359795 [16:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:03] T359795: Switch Jenkins instances from Java 11 to Java 17 - https://phabricator.wikimedia.org/T359795 [16:17:24] * Emperor here, wasn't expecting a page at this time of day [16:17:33] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10113833 (10Southparkfan) For the record: there seem to be a few IdP-related issues in Netbox (T373702), but despite that, this LDAP access request is still valid. [16:17:59] Here too. [16:18:02] !incidents [16:18:02] 5133 (ACKED) ProbeDown sre (10.2.1.33 ip4 wdqs-main:443 probes/service http_wdqs-main_ip4 codfw) [16:18:03] 5131 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-api-ext-ro.discovery.wmnet eqiad) [16:18:03] 5130 (RESOLVED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [16:18:18] from -search channel there's a cloudelastic restart ongoing [16:18:41] ok [16:19:03] I can't imagine that affecting wdqs though - disregard [16:19:13] ryankemper: wdqs-main seems to have a problem [16:20:20] (03PS1) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) [16:20:55] inflatador btullis brouberol ^^ [16:21:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:21:14] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2073.codfw.wmnet with reason: host reimage [16:21:21] see above: Cookbook sre.wdqs.data-transfer was started right before wdqs had the issue [16:22:30] it might just be the depooling part, if they effectively depooled all the responsive nodes [16:22:30] it's new that this is an LVS service, probably it was just not expected that it would page for "down but pooled" [16:22:38] what bblack said [16:22:52] the cookbook probably did [16:22:58] but I mean, depooling all the responsive nodes does imply everything's borked, seems valid to page on [16:23:45] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2073.codfw.wmnet with reason: host reimage [16:24:23] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2072.codfw.wmnet with reason: host reimage [16:24:25] inflatador: ? [16:24:30] mutante bblack cwhite apologies for the confusion, been discussing it in security [16:24:46] TLDR is that wdqs and its services shouldn't ever p-age mainline SRE [16:25:15] that is now harder since every LVS service would do that by default [16:25:25] FIRING: [3x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:25:56] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:26:30] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply heap settings - bking@cumin2002 - T373895 [16:26:32] T373895: Reduce frequency of garbage collection alerts on cloudelastic - https://phabricator.wikimedia.org/T373895 [16:27:04] !log About to deploy analytics/refinery [16:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:16] (03CR) 10Ottomata: eventgate: Disable end-to-end readinessProbe by default (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070257 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [16:27:28] !log aqu@deploy1003 Started deploy [analytics/refinery@07fd127]: Regular analytics weekly train [analytics/refinery@07fd1275] [16:27:40] (03CR) 10TChin: [C:03+1] datasets-config: replace hardcoded gitlab IPs by external_services entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070294 (owner: 10Brouberol) [16:27:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2072.codfw.wmnet with reason: host reimage [16:31:04] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10113935 (10andrea.denisse) a:03andrea.denisse [16:34:16] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10113946 (10xcollazo) > We can't fully isolate the replica from the rest of the traffic but as you can see all alerts are... [16:34:43] (03CR) 10Brouberol: [C:03+2] datasets-config: replace hardcoded gitlab IPs by external_services entry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070294 (owner: 10Brouberol) [16:35:37] !log aqu@deploy1003 Finished deploy [analytics/refinery@07fd127]: Regular analytics weekly train [analytics/refinery@07fd1275] (duration: 08m 09s) [16:38:46] (03PS1) 10Bking: wdqs-main,wdqs-scholarly: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1070301 (https://phabricator.wikimedia.org/T373145) [16:39:55] (03PS2) 10Cwhite: loki: increase chunk flush interval [puppet] - 10https://gerrit.wikimedia.org/r/1069301 (https://phabricator.wikimedia.org/T335610) [16:40:02] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1070301 (https://phabricator.wikimedia.org/T373145) (owner: 10Bking) [16:40:12] (03CR) 10Dzahn: [C:03+1] wdqs-main,wdqs-scholarly: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1070301 (https://phabricator.wikimedia.org/T373145) (owner: 10Bking) [16:40:19] (03PS2) 10Kamila Součková: kubernetes: Rename mw240[267] to wikikube-worker... [puppet] - 10https://gerrit.wikimedia.org/r/1070285 (https://phabricator.wikimedia.org/T372878) [16:40:36] (03CR) 10BCornwall: [C:03+1] wdqs-main,wdqs-scholarly: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1070301 (https://phabricator.wikimedia.org/T373145) (owner: 10Bking) [16:41:29] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10113986 (10xcollazo) >>! In T368098#10111017, @Ladsgroup wrote: > Hi, this has caused ~12 alerts just since this weekend... [16:41:55] (03PS7) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [16:41:57] (03PS2) 10Bking: wdqs-main,wdqs-scholarly: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1070301 (https://phabricator.wikimedia.org/T373145) [16:42:37] (03CR) 10Volans: "Addressed comments/issues, this should be the PS to merge. I'll send tomorrow a new PS with modifications to be able to test the cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [16:43:06] (03CR) 10Kamila Součková: kubernetes: Rename mw240[267] to wikikube-worker... (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070285 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [16:43:27] (03CR) 10Dzahn: [C:03+1] wdqs-main,wdqs-scholarly: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1070301 (https://phabricator.wikimedia.org/T373145) (owner: 10Bking) [16:44:19] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2402.codfw.wmnet [16:44:35] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2406.codfw.wmnet [16:44:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2402.codfw.wmnet [16:45:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2406.codfw.wmnet [16:45:09] (03CR) 10Bking: [C:03+2] wdqs-main,wdqs-scholarly: disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1070301 (https://phabricator.wikimedia.org/T373145) (owner: 10Bking) [16:45:17] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2407.codfw.wmnet [16:45:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2407.codfw.wmnet [16:45:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2072.codfw.wmnet with OS bullseye [16:46:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [16:46:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2073.codfw.wmnet with OS bullseye [16:47:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [16:47:22] (03CR) 10Kamila Součková: [C:03+1] "copying +1 from previous, as it is a ~trivial rebase" [puppet] - 10https://gerrit.wikimedia.org/r/1070285 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [16:47:58] (03CR) 10Kamila Součková: [C:03+2] kubernetes: Rename mw240[267] to wikikube-worker... [puppet] - 10https://gerrit.wikimedia.org/r/1070285 (https://phabricator.wikimedia.org/T372878) (owner: 10Kamila Součková) [16:48:59] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10114003 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:49:16] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2074.codfw.wmnet with OS bullseye [16:49:17] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2075.codfw.wmnet with OS bullseye [16:49:26] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2074 [16:49:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [16:49:34] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114038 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wi... [16:51:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:51:50] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [16:52:54] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2402 to wikikube-worker2076 [16:55:44] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:56:15] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2074 - hnowlan@cumin1002" [16:56:26] (03CR) 10Volans: "sorry for the random comment, I'm skimming CRs in my email backlog ;)" [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [16:56:55] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:57:56] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [16:57:58] T373791: Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022 - https://phabricator.wikimedia.org/T373791 [16:58:04] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:58:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:59:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2074 - hnowlan@cumin1002" [16:59:24] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:24] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2074.codfw.wmnet 78.0.192.10.in-addr.arpa 8.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:59:25] FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:28] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2074.codfw.wmnet 78.0.192.10.in-addr.arpa 8.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:59:28] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2074 [17:00:04] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2406 to wikikube-worker2077 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1700) [17:00:08] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2402 to wikikube-worker2076 - kamila@cumin1002" [17:00:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2402 to wikikube-worker2076 - kamila@cumin1002" [17:00:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:13] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2076 [17:00:21] FIRING: [9x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:21] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:00:25] RESOLVED: [4x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:31] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2074 [17:00:31] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2074 [17:00:48] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2075 [17:00:57] RESOLVED: ProbeDown: Service wdqs-main:443 has failed probes (http_wdqs-main_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-main:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:02:16] (03CR) 10Bking: [C:03+2] wdqs: Remove experimental configuration [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [17:03:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2076 [17:03:39] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw2407 to wikikube-worker2078 [17:03:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2402 to wikikube-worker2076 [17:04:04] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2406 to wikikube-worker2077 - kamila@cumin1002" [17:04:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114108 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2402 to wi... [17:04:32] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2076.codfw.wmnet with OS bullseye [17:04:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2406 to wikikube-worker2077 - kamila@cumin1002" [17:04:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:04:50] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2077 [17:05:11] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:05:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2077 [17:05:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [17:06:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2406 to wikikube-worker2077 [17:06:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114129 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2406 to wi... [17:06:55] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2077.codfw.wmnet with OS bullseye [17:07:08] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [17:08:22] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2407 to wikikube-worker2078 - kamila@cumin1002" [17:08:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2407 to wikikube-worker2078 - kamila@cumin1002" [17:08:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:08:27] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2078 [17:08:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2078 [17:08:52] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [17:09:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2407 to wikikube-worker2078 [17:09:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from mw2407 to wi... [17:10:56] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:11:08] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2075.codfw.wmnet 80.0.192.10.in-addr.arpa 0.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:11:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2075.codfw.wmnet 80.0.192.10.in-addr.arpa 0.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:11:13] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2075 [17:12:28] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2075 [17:12:28] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2075 [17:12:35] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2077 [17:12:46] (03PS3) 10Majavah: P:toolforge::bastion: Re-install joe [puppet] - 10https://gerrit.wikimedia.org/r/1059451 (https://phabricator.wikimedia.org/T371556) [17:12:55] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:13:13] 10ops-eqiad, 06SRE, 06DC-Ops: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10114164 (10MoritzMuehlenhoff) Nice! I suppose the disk swap needs downtime? Then I'll take the server out of rotation Thursday morning (I'm off tomorrow) [17:15:25] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:15:42] (03CR) 10David Caro: [C:03+2] P:toolforge::bastion: Re-install joe [puppet] - 10https://gerrit.wikimedia.org/r/1059451 (https://phabricator.wikimedia.org/T371556) (owner: 10Majavah) [17:16:12] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2077 - kamila@cumin1002" [17:16:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2077 - kamila@cumin1002" [17:16:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:17] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2077.codfw.wmnet 71.0.192.10.in-addr.arpa 1.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:16:17] is there anyone around who can take a look at icinga passive checks? [17:16:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2077.codfw.wmnet 71.0.192.10.in-addr.arpa 1.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:16:21] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2077 [17:16:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2077 [17:16:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2077 [17:17:11] !log homer 'lsw1-a5-codfw*' && homer 'lsw1-a6-codfw*' commit [17:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:29] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2076 [17:17:49] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:19:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2078.codfw.wmnet with OS bullseye [17:19:39] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2074.codfw.wmnet with reason: host reimage [17:19:42] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [17:20:25] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:55] re. icing nevermind, I just saw the comment on T372418 that we've flipped back to alert1001 [17:20:56] T372418: Put the alert1002 and alert2002 hosts in production - https://phabricator.wikimedia.org/T372418 [17:21:05] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Networ [17:21:05] ring%23BGP_status [17:21:05] (03CR) 10Dzahn: [C:03+2] Filter out addresses that cannot be removed from VRTS [puppet] - 10https://gerrit.wikimedia.org/r/1070222 (https://phabricator.wikimedia.org/T368257) (owner: 10LSobanski) [17:21:21] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:21:30] !log homer 'cr*codfw*' commit [17:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2076 - kamila@cumin1002" [17:21:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2076 - kamila@cumin1002" [17:21:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:21:44] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2076.codfw.wmnet 66.0.192.10.in-addr.arpa 6.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:21:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2076.codfw.wmnet 66.0.192.10.in-addr.arpa 6.6.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:21:48] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2076 [17:21:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2076 [17:21:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2076 [17:22:03] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2078 [17:22:33] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:23:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2074.codfw.wmnet with reason: host reimage [17:25:24] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:25:25] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:26:03] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2078 - kamila@cumin1002" [17:26:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2078 - kamila@cumin1002" [17:26:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:26:07] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2078.codfw.wmnet 75.0.192.10.in-addr.arpa 5.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:26:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2078.codfw.wmnet 75.0.192.10.in-addr.arpa 5.7.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:26:11] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2078 [17:26:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2078 [17:26:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2078 [17:29:51] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2075.codfw.wmnet with reason: host reimage [17:29:54] (03PS2) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) [17:30:17] (03CR) 10CI reject: [V:04-1] mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) (owner: 10EoghanGaffney) [17:31:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T373696#10114261 (10VRiley-WMF) [17:31:25] (03PS3) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) [17:31:43] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [17:31:47] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T373696#10114244 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This is completed. Thank you! [17:33:02] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2077.codfw.wmnet with reason: host reimage [17:33:05] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 395, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:33:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2075.codfw.wmnet with reason: host reimage [17:33:18] 10ops-eqiad, 06SRE, 06DC-Ops: puppetmaster1003: broken disk - https://phabricator.wikimedia.org/T373888#10114282 (10VRiley-WMF) Sure, that will work for us. We will plan for it then. Thank you! [17:34:24] FIRING: [2x] ProbeDown: Service miscweb2003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:34:54] looking at this one [17:35:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:36:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2077.codfw.wmnet with reason: host reimage [17:38:14] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2076.codfw.wmnet with reason: host reimage [17:39:24] FIRING: [6x] ProbeDown: Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:40:28] ^ this is a false positive caused by moving services to new backends.. names were removed from certs without removing monitoring. will ACK it and new need follow-up fix [17:40:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2076.codfw.wmnet with reason: host reimage [17:41:37] FIRING: [6x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:29] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2078.codfw.wmnet with reason: host reimage [17:43:40] (03CR) 10Scott French: [C:03+1] sre.k8s.renumber-node: Handle renamed host [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 (owner: 10Clément Goubert) [17:45:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2078.codfw.wmnet with reason: host reimage [17:48:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2074.codfw.wmnet with OS bullseye [17:48:35] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [17:49:42] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 477, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:53:11] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2075.codfw.wmnet with OS bullseye [17:53:21] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114457 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [17:53:27] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2072.codfw.wmnet [17:53:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2072.codfw.wmnet [17:53:35] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2073.codfw.wmnet [17:53:37] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2073.codfw.wmnet [17:53:42] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2074.codfw.wmnet [17:53:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2074.codfw.wmnet [17:53:49] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2075.codfw.wmnet [17:53:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2075.codfw.wmnet [17:54:51] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916 (10hnowlan) 03NEW [17:56:40] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:57:37] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10114489 (10Ladsgroup) >>! In T368098#10113946, @xcollazo wrote: >> We can't fully isolate the replica from the rest of th... [18:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1800) [18:00:11] o/ [18:01:16] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070306 (https://phabricator.wikimedia.org/T373640) [18:01:17] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070306 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [18:02:02] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070306 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [18:04:24] RESOLVED: [6x] ProbeDown: Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:06:17] (03Abandoned) 10Ahmon Dancy: logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (owner: 10Ahmon Dancy) [18:10:56] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:43] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.21 refs T373640 [18:13:47] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [18:14:30] (03PS1) 10Scott French: kubernetes: re-name/IP mw2260 and mw2312 [puppet] - 10https://gerrit.wikimedia.org/r/1070309 (https://phabricator.wikimedia.org/T372878) [18:14:38] (03PS1) 10Jdlrobson: Disable lead paragraph transform on Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070310 [18:20:25] FIRING: [5x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:25:37] (03PS4) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [18:30:25] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:31] FIRING: [3x] RedisMemoryFull: Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:49:15] (03CR) 10Stoyofuku-wmf: [C:03+1] "yolo lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070310 (owner: 10Jdlrobson) [18:49:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, September 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070310 (owner: 10Jdlrobson) [18:57:03] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070315 (https://phabricator.wikimedia.org/T373640) [18:57:05] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070315 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [18:57:47] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070315 (https://phabricator.wikimedia.org/T373640) (owner: 10TrainBranchBot) [18:58:07] !log dancy@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.21 refs T373640 [18:58:10] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [18:59:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2078.codfw.wmnet with OS bullseye [18:59:32] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [18:59:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2077.codfw.wmnet with OS bullseye [19:00:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114732 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [19:01:53] jouncebot: nowandnext [19:01:53] For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T1800) [19:01:54] In 0 hour(s) and 58 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T2000) [19:02:18] Train rollback is in progress. [19:02:24] (03CR) 10RLazarus: [C:03+1] kubernetes: re-name/IP mw2260 and mw2312 [puppet] - 10https://gerrit.wikimedia.org/r/1070309 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [19:03:21] dancy: let me know if the window is free and I can deploy a couple of changes [19:03:29] if and when [19:03:32] no rush [19:03:49] I'll notify you when the rollback is complete (seems to take about 13 minutes in total) and then it's all yours. [19:04:00] thank you! [19:04:43] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2260.codfw.wmnet [19:05:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2076.codfw.wmnet with OS bullseye [19:05:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114737 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [19:05:17] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2260.codfw.wmnet [19:05:49] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2312.codfw.wmnet [19:06:14] !log ran homer on lsw1-a5-codfw for T372878 [19:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:17] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [19:06:19] !log dancy@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.21 refs T373640 (duration: 08m 11s) [19:06:21] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2312.codfw.wmnet [19:06:22] T373640: 1.43.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T373640 [19:06:27] Amir1: You're up [19:07:10] (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP mw2260 and mw2312 [puppet] - 10https://gerrit.wikimedia.org/r/1070309 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [19:07:40] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2076.codfw.wmnet [19:07:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2076.codfw.wmnet [19:07:52] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2077.codfw.wmnet [19:07:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2077.codfw.wmnet [19:08:02] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2078.codfw.wmnet [19:08:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2078.codfw.wmnet [19:08:13] thank you [19:09:29] (03PS1) 10Jforrester: Revert "Add missing cases to ParserOutput::collectMetadata()" [core] (wmf/1.43.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1070319 (https://phabricator.wikimedia.org/T373920) [19:09:52] (03CR) 10Ladsgroup: [C:03+2] Revert "Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070237 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [19:10:41] (03Merged) 10jenkins-bot: Revert "Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070237 (https://phabricator.wikimedia.org/T359529) (owner: 10Ladsgroup) [19:10:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:11:23] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1070237|Revert "Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki" (T359529)]] [19:11:30] T359529: Future of flaggedtemplates feature - https://phabricator.wikimedia.org/T359529 [19:11:46] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10114752 (10kamila) [19:13:04] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from mw2312 to wikikube-worker2080 [19:13:23] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [19:15:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:16:42] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1070237|Revert "Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki" (T359529)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:16:45] T359529: Future of flaggedtemplates feature - https://phabricator.wikimedia.org/T359529 [19:17:34] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [19:17:38] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2312 to wikikube-worker2080 - swfrench@cumin2002" [19:17:49] (03PS1) 10Ladsgroup: Fix bug causing review form to disappear on unreviewed pages [extensions/FlaggedRevs] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1070324 (https://phabricator.wikimedia.org/T373582) [19:18:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2312 to wikikube-worker2080 - swfrench@cumin2002" [19:18:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:18:20] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2080 [19:18:39] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2080 [19:19:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2312 to wikikube-worker2080 [19:19:29] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from mw2312 to... [19:20:13] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2080.codfw.wmnet on all recursors [19:20:16] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2080.codfw.wmnet on all recursors [19:21:15] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2080.codfw.wmnet with OS bullseye [19:21:27] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2080 [19:21:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10114781 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [19:21:44] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [19:21:49] 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927 (10WMDE-leszek) 03NEW [19:22:14] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070237|Revert "Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki" (T359529)]] (duration: 10m 50s) [19:22:17] T359529: Future of flaggedtemplates feature - https://phabricator.wikimedia.org/T359529 [19:23:49] !log swfrench@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:27:31] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:30:46] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt phab1005 - jclark@cumin1002" [19:31:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt phab1005 - jclark@cumin1002" [19:31:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:31:44] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host phab1005 [19:32:52] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [19:33:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host phab1005 [19:34:41] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Offboard Manuel (WMDE) from WMF systems - https://phabricator.wikimedia.org/T373927#10114850 (10Aklapper) [19:35:08] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:35:08] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2080.codfw.wmnet 159.16.192.10.in-addr.arpa 9.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:35:11] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2080.codfw.wmnet 159.16.192.10.in-addr.arpa 9.5.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:35:12] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2080 [19:35:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2080 [19:35:35] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2080 [19:38:36] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:38:38] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:41:08] (03PS1) 10Cwhite: beta-logs: split curator actions into jobs [puppet] - 10https://gerrit.wikimedia.org/r/1070328 (https://phabricator.wikimedia.org/T364190) [19:41:10] (03PS1) 10Cwhite: logstash: split curator actions into jobs [puppet] - 10https://gerrit.wikimedia.org/r/1070329 (https://phabricator.wikimedia.org/T364190) [19:42:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:42:29] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:43:25] FIRING: SystemdUnitFailed: wmf_auto_restart_systemd-timesyncd.service on wikikube-worker2076:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:46:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:48:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:52:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:52:21] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2080.codfw.wmnet with reason: host reimage [19:53:39] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:55:46] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2080.codfw.wmnet with reason: host reimage [19:56:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt phab1005 - jclark@cumin1002" [19:56:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt phab1005 - jclark@cumin1002" [19:56:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:57:28] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host phab1005.mgmt.eqiad.wmnet with reboot policy FORCED [19:58:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1005.mgmt.eqiad.wmnet with reboot policy FORCED [19:58:39] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T373894#10114975 (10Jhancock.wm) this one must be flapping. just closed one for this exact server that I could login to the mgmt of. ping test. some variance, but all successful. replaced cable. waited for Prometheus rerun. s... [19:58:48] (03PS1) 10Cwhite: logstash: increase logstash index pattern delete timeout to 5m [puppet] - 10https://gerrit.wikimedia.org/r/1070332 (https://phabricator.wikimedia.org/T364190) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240903T2000). [20:00:05] srishakatux, ebernhardson, toyofuku, and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] \o [20:01:18] o/ I can self deploy after everyone's done [20:01:26] I can also self deploy [20:01:33] I need to practice now that I've done the training [20:01:35] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:01:50] (03CR) 10Cwhite: [C:03+2] "PCC OK: https://puppet-compiler.wmflabs.org/output/1070332/3841/" [puppet] - 10https://gerrit.wikimedia.org/r/1070332 (https://phabricator.wikimedia.org/T364190) (owner: 10Cwhite) [20:03:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:04:14] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:07:19] (03PS2) 10Jdlrobson: Disable lead paragraph transform on Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070310 [20:07:21] (03PS2) 10Stoyofuku-wmf: Turn on donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069334 (https://phabricator.wikimedia.org/T372757) [20:07:24] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Update iDRAC on mw2260.codfw.wmnet - https://phabricator.wikimedia.org/T373934 (10Scott_French) 03NEW [20:07:25] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt phab1005 - jclark@cumin1002" [20:07:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt phab1005 - jclark@cumin1002" [20:07:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:07:41] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host phab1005 [20:07:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host phab1005 [20:08:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host phab1005.mgmt.eqiad.wmnet with reboot policy FORCED [20:11:11] Any objections to me doing the self-deploy while we wait? [20:12:00] I'd offer to deploy everyone else's code too but it's my first time without adult supervision and I'd like to keep the risks as low as possible [20:13:01] jan_drewniak: ^ [20:15:25] FIRING: [3x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:57] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070336 (https://phabricator.wikimedia.org/T128546) [20:18:30] hi - sorry to be late -- lmk when self-deploys are done - i can do the rest [20:18:50] oh you can go ahead! [20:18:57] I'll go after you - I haven't started yet [20:20:20] ok - wait - so Steph and Jan are self-deploying? [20:20:31] Yep! [20:20:56] I believe it's srishakatux and ebernhardson who need their patches deployed [20:20:57] * jan_drewniak that's correct, I think srishakatux and ebernhardson have backport patches [20:21:03] jinx [20:21:09] lol [20:21:37] cool - ok - then ebernhardson: i'll do yours first since it's config [20:21:43] (03PS2) 10Ebernhardson: cirrus: Introduce an expensive query pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070281 (https://phabricator.wikimedia.org/T369808) [20:22:04] srishakatux: are you around? [20:23:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070281 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [20:23:59] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10115093 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member ge-0/0/39; - member ge-1/0/39;... [20:24:07] (03Merged) 10jenkins-bot: cirrus: Introduce an expensive query pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070281 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [20:24:28] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1070281|cirrus: Introduce an expensive query pool counter (T369808)]] [20:24:31] T369808: The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories) - https://phabricator.wikimedia.org/T369808 [20:24:51] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2080.codfw.wmnet with OS bullseye [20:25:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10115098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [20:25:23] cjming: nothing to really test in my patch, it introduces a new configuration that is not yet referenced [20:25:49] ebernhardson: 10-4 -- i will sync [20:26:42] !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1070281|cirrus: Introduce an expensive query pool counter (T369808)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:26:42] srishakatux: happy to do your patch next but if you're not here yet, I'll let the self-deployers take the reigns and I can check back in with you after they're done [20:26:47] !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync [20:29:03] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10115118 (10Papaul) [20:29:10] (03PS1) 10Papaul: Add DNS entries for payments2006 [dns] - 10https://gerrit.wikimedia.org/r/1070339 [20:30:57] (03CR) 10Papaul: [C:03+2] Add DNS entries for payments2006 [dns] - 10https://gerrit.wikimedia.org/r/1070339 (owner: 10Papaul) [20:31:15] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1070281|cirrus: Introduce an expensive query pool counter (T369808)]] (duration: 06m 47s) [20:31:23] ebernhardson: should be live! [20:31:30] T369808: The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories) - https://phabricator.wikimedia.org/T369808 [20:32:03] toyofuku and jan_drewniak - feel free to self-deploy - i'll let you duke out who goes first - just lmk when you're both done [20:32:05] cjming: thanks! [20:32:09] np! [20:32:11] ty!! [20:32:20] jan_drewniak: rock paper scissors? [20:33:33] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:34:00] jk I'll go first as I believe he's still getting his together [20:35:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069334 (https://phabricator.wikimedia.org/T372757) (owner: 10Stoyofuku-wmf) [20:35:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070310 (owner: 10Jdlrobson) [20:35:38] toyofuku: yes go ahead! [20:35:46] (03Merged) 10jenkins-bot: Turn on donate link in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069334 (https://phabricator.wikimedia.org/T372757) (owner: 10Stoyofuku-wmf) [20:35:48] (03Merged) 10jenkins-bot: Disable lead paragraph transform on Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070310 (owner: 10Jdlrobson) [20:35:52] tyty [20:36:07] !log toyofuku@deploy1003 Started scap sync-world: Backport for [[gerrit:1069334|Turn on donate link in beta (T372757)]], [[gerrit:1070310|Disable lead paragraph transform on Wikivoyages]] [20:36:10] T372757: Move donation entry point on Vector 2022 - https://phabricator.wikimedia.org/T372757 [20:36:35] (03CR) 10Dzahn: "We had some alerts from this like https://phabricator.wikimedia.org/T373909 but it was probably because of a race condition. Puppet had to" [puppet] - 10https://gerrit.wikimedia.org/r/1070197 (https://phabricator.wikimedia.org/T371833) (owner: 10Stevemunene) [20:37:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1005.mgmt.eqiad.wmnet with reboot policy FORCED [20:38:19] !log toyofuku@deploy1003 jdlrobson, toyofuku: Backport for [[gerrit:1069334|Turn on donate link in beta (T372757)]], [[gerrit:1070310|Disable lead paragraph transform on Wikivoyages]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:39:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host phab1005.mgmt.eqiad.wmnet with reboot policy FORCED [20:40:17] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10115177 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt hey this ready for you sorry didn't get on this since i was out all of last week. [20:41:22] !log toyofuku@deploy1003 jdlrobson, toyofuku: Continuing with sync [20:42:07] RECOVERY - Host gerrit1004 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [20:44:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host phab1005.mgmt.eqiad.wmnet with reboot policy FORCED [20:44:30] !log running homer 'lsw1-b3-codfw*' commit 'T372878' [20:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host phab1005.eqiad.wmnet with OS bookworm [20:44:34] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [20:44:44] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10115188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host phab1005.eqiad.wmnet with O... [20:45:50] !log toyofuku@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069334|Turn on donate link in beta (T372757)]], [[gerrit:1070310|Disable lead paragraph transform on Wikivoyages]] (duration: 09m 43s) [20:45:53] T372757: Move donation entry point on Vector 2022 - https://phabricator.wikimedia.org/T372757 [20:46:37] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:21] jan_drewniak: all yours! [20:47:55] toyofuku: thanks! [20:49:51] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070336 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:50:07] RECOVERY - Host gerrit1004 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [20:50:13] gerrit1004 isn't production gerrit, fwiw [20:50:37] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070336 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:51:56] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:52:55] PROBLEM - Host mw2402 is DOWN: PING CRITICAL - Packet loss = 100% [20:53:39] PROBLEM - Host gerrit1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:34] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host phab1005 [20:56:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host phab1005 [20:56:47] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1070336| Bumping portals to master (T128546)]] (duration: 02m 21s) [20:56:50] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:59:07] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1070336| Bumping portals to master (T128546)]] (duration: 02m 19s) [20:59:47] (03PS4) 10Andrew Bogott: keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) [20:59:47] (03PS9) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [21:00:41] PROBLEM - Host mw2406 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:50] FIRING: [8x] ProbeDown: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip4) - https://wikitech.wikimedia.org/wiki/Puppet#Debugging - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:01:08] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) (owner: 10EoghanGaffney) [21:02:59] swfrench-wmf: is the switch config change related to those mw hosts going down? [21:03:43] PROBLEM - Host mw2407 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:02] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [21:04:18] mutante: no, I don't think so - they're on a different ToR switch [21:05:28] also, I aborted that before committing, as there are some unexplained config diffs I [21:05:36] 'd like to clarify first [21:06:53] swfrench-wmf: gotcha, well, I got an explanation for the one host in eqiad, but nothing for codfw [21:07:08] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10115250 (10Jclark-ctr) a:03Jclark-ctr [21:07:13] RECOVERY - Host gerrit1004 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [21:07:37] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10115253 (10Jclark-ctr) @Dzahn please update preseed.yaml file for sw raid for this server. Reimage fails without this [21:07:39] mutante: ah, wait, mw240[267] are among the hosts that were renamed earlier today [21:08:33] I wonder if the old names are still lingering somewhere [21:09:09] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10115261 (10Dzahn) @Jclark-ctr It also fails because coordination was needed to change the role: https://gerrit.wikimedia.org/r/c/operat... [21:09:43] swfrench-wmf: phew, good, that explains it [21:09:54] (03PS5) 10Andrew Bogott: keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) [21:09:55] (03PS10) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [21:10:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [21:10:59] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 06m 25s) [21:11:02] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:11:08] I can't believe the crazy timing [21:11:18] random hosts in different DCs at the same time for unrelated reasons [21:11:42] heh, yeah - so I suspect the renamed ones were done in a batch, and a downtime just expired [21:12:02] I also suspect they need manually deactivated from puppet, which I'll try shortly [21:12:06] yea, probably that was just one of the slow puppet runs on alert* server too [21:12:12] when it realizes things changed at the same time [21:12:32] yep, thanks [21:13:18] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 02m 18s) [21:15:25] FIRING: [3x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:18:25] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10115289 (10Jgreen) [21:21:24] (03PS6) 10Andrew Bogott: keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) [21:21:24] (03PS11) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [21:21:31] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [21:25:26] (03CR) 10Andrew Bogott: [C:03+2] keystone/apache: fix OIDC settings again! [puppet] - 10https://gerrit.wikimedia.org/r/1070267 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [21:29:20] (03CR) 10Dzahn: [C:03+2] site: rename gerrit1004 to phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1063870 (https://phabricator.wikimedia.org/T372817) (owner: 10Dzahn) [21:31:19] (03PS2) 10Dzahn: site: rename gerrit1004 to phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1063870 (https://phabricator.wikimedia.org/T372817) [21:31:50] !log running homer 'lsw1-b3-codfw*' commit 'T372878' [21:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:52] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [21:32:26] (03CR) 10Dzahn: [C:03+2] site: rename gerrit1004 to phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1063870 (https://phabricator.wikimedia.org/T372817) (owner: 10Dzahn) [21:34:15] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2080.codfw.wmnet [21:34:17] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2080.codfw.wmnet [21:35:54] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10115308 (10Dzahn) @Jclark-ctr I don't think it's preseed.yaml. Both existing gerrit1004 and phab* are set to standard/raid1-2d... [21:37:58] (03CR) 10Dzahn: [C:03+1] "thanks! this should unblock both vrts and mailman switching the firewall provider" [puppet] - 10https://gerrit.wikimedia.org/r/1070273 (https://phabricator.wikimedia.org/T373637) (owner: 10Muehlenhoff) [21:38:26] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373916#10115314 (10Scott_French) [21:42:02] !log running homer 'cr*codfw*' commit 'T372878' [21:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:06] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [21:42:44] (03CR) 10Dzahn: [C:03+2] "Yea, but, and this is different from the releases hosts, update-alternatives wasn't run by puppet with this change. This really just insta" [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [21:48:31] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 393, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:49:48] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3842/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) (owner: 10EoghanGaffney) [21:53:36] (03PS4) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) [21:54:35] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 475, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:58:32] (03PS5) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) [21:58:54] (03CR) 10CI reject: [V:04-1] mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) (owner: 10EoghanGaffney) [21:59:50] (03PS6) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) [22:04:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host phab1005.eqiad.wmnet with OS bookworm [22:04:59] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10115361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host phab1005.eqiad.wmnet w... [22:06:07] (03CR) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055232 (owner: 10Ncmonitor) [22:06:09] (03PS1) 10Ebernhardson: NetworkSession: Only enable for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070344 (https://phabricator.wikimedia.org/T373826) [22:06:24] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1055232 (owner: 10Ncmonitor) [22:06:43] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069644 (owner: 10Ncmonitor) [22:18:54] (03CR) 10Bartosz Dziewoński: [C:03+1] NetworkSession: Only enable for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070344 (https://phabricator.wikimedia.org/T373826) (owner: 10Ebernhardson) [22:19:55] (03PS7) 10EoghanGaffney: mailman: Move /var/lib/mailman to /srv/mailman [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) [22:20:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:21:06] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3845/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070297 (https://phabricator.wikimedia.org/T373846) (owner: 10EoghanGaffney) [22:36:01] FIRING: [3x] RedisMemoryFull: Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:19:34] (03PS1) 10NMW03: Update wgSitename for tlywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070347 (https://phabricator.wikimedia.org/T367009) [23:38:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070349 [23:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1070349 (owner: 10TrainBranchBot) [23:42:55] PROBLEM - MariaDB Replica SQL: s2 on db2197 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: nlwiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:43:03] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 188 probes of 747 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:43:25] FIRING: SystemdUnitFailed: wmf_auto_restart_systemd-timesyncd.service on wikikube-worker2076:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:03] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 9 probes of 747 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:51:56] (03PS5) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [23:55:24] (03PS1) 10Cwhite: logstash: put logging-sd100[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070352 (https://phabricator.wikimedia.org/T373651) [23:55:25] (03PS1) 10Cwhite: logstash: put logging-sd200[1-4] in service [puppet] - 10https://gerrit.wikimedia.org/r/1070353 (https://phabricator.wikimedia.org/T373651) [23:57:20] (03PS6) 10Jdlrobson: Roll out appearance menu and font size change to sister projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059393 (https://phabricator.wikimedia.org/T371020) [23:57:21] (03PS1) 10Jdlrobson: Enable appearance menu for all logged in users on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070354 (https://phabricator.wikimedia.org/T371020)