[00:01:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T370903)', diff saved to https://phabricator.wikimedia.org/P68003 and previous config saved to /var/cache/conftool/dbconfig/20240828-000117-ladsgroup.json [00:01:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:03:53] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:06:45] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:07:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1067450 (owner: 10TrainBranchBot) [00:12:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T371742)', diff saved to https://phabricator.wikimedia.org/P68004 and previous config saved to /var/cache/conftool/dbconfig/20240828-001214-ladsgroup.json [00:12:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:12:22] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:12:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:16:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P68005 and previous config saved to /var/cache/conftool/dbconfig/20240828-001625-ladsgroup.json [00:31:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P68006 and previous config saved to /var/cache/conftool/dbconfig/20240828-003132-ladsgroup.json [00:44:37] (03PS1) 10Andrew Bogott: Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) [00:46:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T370903)', diff saved to https://phabricator.wikimedia.org/P68007 and previous config saved to /var/cache/conftool/dbconfig/20240828-004639-ladsgroup.json [00:46:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2218.codfw.wmnet with reason: Maintenance [00:46:44] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:46:45] (03PS2) 10Andrew Bogott: Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) [00:46:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2218.codfw.wmnet with reason: Maintenance [00:47:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68008 and previous config saved to /var/cache/conftool/dbconfig/20240828-004702-ladsgroup.json [00:48:28] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [00:49:26] (03CR) 10CI reject: [V:04-1] Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [00:50:51] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#10098491 (10JJMC89) Not yet - waiting on a response from @JbuattiWMF. [00:51:43] (03PS3) 10Andrew Bogott: Keystone+apache 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) [00:53:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68009 and previous config saved to /var/cache/conftool/dbconfig/20240828-005342-ladsgroup.json [00:53:46] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [00:57:10] (03PS4) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) [01:08:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P68010 and previous config saved to /var/cache/conftool/dbconfig/20240828-010849-ladsgroup.json [01:23:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P68011 and previous config saved to /var/cache/conftool/dbconfig/20240828-012356-ladsgroup.json [01:39:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T370903)', diff saved to https://phabricator.wikimedia.org/P68012 and previous config saved to /var/cache/conftool/dbconfig/20240828-013903-ladsgroup.json [01:39:08] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [02:01:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [02:01:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [02:01:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T371742)', diff saved to https://phabricator.wikimedia.org/P68013 and previous config saved to /var/cache/conftool/dbconfig/20240828-020145-ladsgroup.json [02:01:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:07:19] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:10:33] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 538, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:13:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:25] (03CR) 10Krinkle: [C:03+1] Revert "Enter deprecation trial for third-party cookie blocking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067390 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [02:36:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T371742)', diff saved to https://phabricator.wikimedia.org/P68014 and previous config saved to /var/cache/conftool/dbconfig/20240828-024627-ladsgroup.json [02:46:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:01:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P68015 and previous config saved to /var/cache/conftool/dbconfig/20240828-030135-ladsgroup.json [03:03:40] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P68016 and previous config saved to /var/cache/conftool/dbconfig/20240828-031642-ladsgroup.json [03:23:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T371742)', diff saved to https://phabricator.wikimedia.org/P68017 and previous config saved to /var/cache/conftool/dbconfig/20240828-033149-ladsgroup.json [03:31:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:31:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:32:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [03:32:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T371742)', diff saved to https://phabricator.wikimedia.org/P68018 and previous config saved to /var/cache/conftool/dbconfig/20240828-033211-ladsgroup.json [03:38:18] FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:03:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:13:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:23] (03PS1) 10Marostegui: installserver: Do not format db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1067593 [04:53:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10098595 (10Marostegui) [04:54:10] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1067593 (owner: 10Marostegui) [05:42:07] (03PS2) 10Chlod Alejandro: kaawiktionary: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064363 (https://phabricator.wikimedia.org/T368868) [05:42:13] (03PS2) 10Chlod Alejandro: kawikisource: re-add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064356 (https://phabricator.wikimedia.org/T368868) [05:42:15] (03PS2) 10Chlod Alejandro: bewwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063920 (https://phabricator.wikimedia.org/T368868) [05:42:18] (03PS2) 10Chlod Alejandro: kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868) [05:42:20] (03PS2) 10Chlod Alejandro: mywikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063918 (https://phabricator.wikimedia.org/T368868) [05:42:22] (03PS2) 10Chlod Alejandro: iglwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063916 (https://phabricator.wikimedia.org/T368868) [05:42:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T371742)', diff saved to https://phabricator.wikimedia.org/P68019 and previous config saved to /var/cache/conftool/dbconfig/20240828-054237-ladsgroup.json [05:42:42] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:56:23] (03CR) 10Ayounsi: [C:03+2] Network report: remove wdqs from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555) (owner: 10Ayounsi) [05:57:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P68020 and previous config saved to /var/cache/conftool/dbconfig/20240828-055744-ladsgroup.json [05:58:24] (03Merged) 10jenkins-bot: Network report: remove wdqs from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067366 (https://phabricator.wikimedia.org/T312555) (owner: 10Ayounsi) [05:59:37] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [05:59:50] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T0600) [06:01:33] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [06:02:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [06:04:36] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:36] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P68021 and previous config saved to /var/cache/conftool/dbconfig/20240828-061252-ladsgroup.json [06:13:27] (03CR) 10Ayounsi: [C:03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French) [06:28:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T371742)', diff saved to https://phabricator.wikimedia.org/P68022 and previous config saved to /var/cache/conftool/dbconfig/20240828-062759-ladsgroup.json [06:28:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [06:28:04] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [06:28:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [06:42:32] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1067397 (https://phabricator.wikimedia.org/T373426) (owner: 10Ssingh) [07:00:04] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T0700). [07:00:05] srishakatux: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:23:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:27:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:28:20] (03PS1) 10Marostegui: installserver: Do not format db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1067766 [07:30:29] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2007.codfw.wmnet [07:31:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2007.codfw.wmnet [07:31:14] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1067766 (owner: 10Marostegui) [07:35:19] (03CR) 10Hashar: "I have noticed this was scheduled for this morning backport window, I would have done it unfortunately I have long forgot how to process c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [07:35:48] (03PS2) 10Arnaudb: mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) [07:35:48] (03CR) 10Arnaudb: "fix this morning's spam" [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [07:36:53] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:37:06] (03CR) 10Marostegui: "Please add a comment referencing why this is needed, like I asked on yesterday's patch" [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [07:37:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:38:18] FIRING: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:38:18] (03PS3) 10Arnaudb: mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) [07:38:54] (03CR) 10Arnaudb: "done, but lets try to also use git blame to avoid duplicating information" [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [07:38:55] (03CR) 10Marostegui: [C:03+1] mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [07:40:18] (03CR) 10Arnaudb: [C:03+2] mariadb: remove backup from replication thread counter alert critical [alerts] - 10https://gerrit.wikimedia.org/r/1067770 (https://phabricator.wikimedia.org/T372991) (owner: 10Arnaudb) [07:40:53] (03PS1) 10JMeybohm: Rename/Re-IP kubernetes2007 as wikikube-worker2047 [puppet] - 10https://gerrit.wikimedia.org/r/1067873 (https://phabricator.wikimedia.org/T372878) [07:42:18] (03CR) 10JMeybohm: [C:03+2] Rename/Re-IP kubernetes2007 as wikikube-worker2047 [puppet] - 10https://gerrit.wikimedia.org/r/1067873 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [07:43:56] 06SRE, 10iPoid-Service, 06Trust and Safety Product Team, 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Theremin (Aug 26 - Sept. 6)): IPoid imports are failing after the daily-updates container stalled - https://phabricator.wikimedia.org/T373427#10098740 (10kostajh) 05Open→03Resolve... [07:44:34] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2007 to wikikube-worker2047 [07:44:51] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [07:50:33] (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [07:51:10] (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [07:51:47] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2007 to wikikube-worker2047 - jayme@cumin1002" [07:52:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2007 to wikikube-worker2047 - jayme@cumin1002" [07:52:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:52:46] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2047 [07:52:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2047 [07:53:36] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2007 to wikikube-worker2047 [07:53:50] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes200... [07:54:04] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2047.codfw.wmnet on all recursors [07:54:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2047.codfw.wmnet on all recursors [07:54:35] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2047.codfw.wmnet with OS bullseye [07:54:46] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host [07:54:46] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [07:58:18] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [07:58:20] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the (test) switch [07:58:26] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [07:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [07:59:55] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:00:04] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T0800) [08:00:18] eh eh [08:00:29] (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [08:00:32] not right now cause I am in the middle of some other things still :/ [08:01:18] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:04:15] (03CR) 10Marostegui: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [08:04:27] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2047 - jayme@cumin1002" [08:04:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2047 - jayme@cumin1002" [08:04:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:04:31] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2047.codfw.wmnet 196.0.192.10.in-addr.arpa 6.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:04:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2047.codfw.wmnet 196.0.192.10.in-addr.arpa 6.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:04:35] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2047 [08:05:10] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:05:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2047 [08:05:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [08:06:27] (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [08:07:53] (03CR) 10Marostegui: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [08:08:40] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:44] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:08:52] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:09:54] (03PS1) 10KartikMistry: Enable Section Translation in btm/dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) [08:09:58] andre: good morning :) [08:10:07] hej hej hashar [08:10:08] so yeah I am bit rusty this week, I am back from vacations! [08:10:24] and I ended up filing wayy toooo manyyyy buuuggggs yesterday [08:10:26] so I got lost [08:10:32] hashar, my respect for dealing with yesterday. At some point I was just like "I'm not gonna be of any help" :-/ [08:11:02] no :erit [08:11:04] no merit [08:11:20] just a shit ton of years and years of context being hidden somewhere in my brain cells [08:11:21] :D [08:11:26] last week was so smooth, I was sure this week will blow up [08:11:34] remembers me I need to file a task to get rid of that chmod 777 [08:11:53] or how rebuildLocalisationCache should really disappear [08:11:54] anyway [08:11:56] lets train [08:13:18] hashar: hehe, go ahead (but if you want me to join in a call or such, just say) [08:14:23] (03CR) 10Arnaudb: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [08:15:18] ah yeah [08:15:19] hmm [08:16:41] Hi, when it comes to mediawiki-config changes for LabsSettings (beta) does it go in the regular deployment-window for mediawiki-config? [08:17:41] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:17:54] PROBLEM - SSH on wdqs1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:18:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans) [08:18:58] nemo-yiannis: usually yes, or at least sync up here :) [08:19:11] that theorically should NOT affect prod, but one never knows [08:19:16] ok, thanks, i added the patch for the next window [08:19:23] we can do it now :) [08:19:26] which change is it ? [08:20:12] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1065266 [08:21:40] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the (test) switch [08:21:45] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [08:21:48] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch [08:22:02] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:22:04] nemo-yiannis: that needs rebase according to Gerrit :) [08:22:29] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2047.codfw.wmnet with reason: host reimage [08:22:30] (03PS2) 10Eevans: Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) [08:23:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:33] (03CR) 10Hashar: [C:03+2] Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans) [08:24:37] thanks :) [08:24:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs2024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:25:17] (03Merged) 10jenkins-bot: Replace deployment-restbase04 w/ deployment-restbase05 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1065266 (https://phabricator.wikimedia.org/T370460) (owner: 10Eevans) [08:26:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2047.codfw.wmnet with reason: host reimage [08:26:27] (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [08:27:05] nemo-yiannis: the beta update job already triggered before the change got merged [08:27:23] (03PS1) 10TrainBranchBot: group1 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067909 (https://phabricator.wikimedia.org/T366965) [08:27:25] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067909 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [08:27:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:28:04] (03Merged) 10jenkins-bot: group1 to 1.43.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067909 (https://phabricator.wikimedia.org/T366965) (owner: 10TrainBranchBot) [08:28:09] hashar: so now the change should be live ? [08:28:22] not yet [08:28:27] the update job started before the change merged [08:28:33] I ll retrigger it [08:30:16] ah ok got it [08:30:42] nemo-yiannis: https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/510658/console :) [08:30:47] that is the git pulls [08:31:01] then it will triggers another job to run the deployment (using scap, like in prod) [08:34:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:12] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.20 refs T366965 [08:37:16] T366965: 1.43.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T366965 [08:40:17] nemo-yiannis: your change should be live on the beta cluster now [08:40:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [08:40:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [08:40:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T371742)', diff saved to https://phabricator.wikimedia.org/P68023 and previous config saved to /var/cache/conftool/dbconfig/20240828-084045-ladsgroup.json [08:40:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:41:25] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:41:54] hashar: thanks, checking [08:41:55] (03PS1) 10Brouberol: airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) [08:44:00] (03CR) 10JMeybohm: [C:03+1] "As from yesterdays discussion, maybe change the maintainer to your team." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman) [08:44:31] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:57] (03PS1) 10Marostegui: installserver: Remove db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1067914 [08:45:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2047.codfw.wmnet with OS bullseye [08:45:46] (03PS1) 10Brouberol: Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503) [08:45:54] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [08:46:10] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10098906 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [08:46:46] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch [08:46:50] (03PS1) 10Brouberol: deployment_server: define postgresql-test read/write usernames [puppet] - 10https://gerrit.wikimedia.org/r/1067916 (https://phabricator.wikimedia.org/T373503) [08:47:33] (03CR) 10JMeybohm: icinga: remove check_etcd_mw_config_lastindex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi) [08:47:34] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1067914 (owner: 10Marostegui) [08:48:10] (03CR) 10Marostegui: [C:03+2] installserver: Remove db2238 [puppet] - 10https://gerrit.wikimedia.org/r/1067914 (owner: 10Marostegui) [08:48:56] !log running homer commit on on lsw1-a6-codfw* - T372878 [08:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:00] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [08:49:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:58] (03PS6) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) [08:50:55] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2047.codfw.wmnet [08:50:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2047.codfw.wmnet [08:52:45] !log running homer commit on on cr*codfw* - T372878 [08:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:23] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:53:24] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373505 (10JMeybohm) 03NEW [08:56:33] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:41] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:59:42] (03CR) 10Ayounsi: [C:03+1] Add function to wmf-netbox plugin to provide QoS config data (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [09:00:31] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:02:42] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [09:04:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:04] (03CR) 10Ayounsi: [C:03+1] "Nice!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [09:06:10] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:18] (03PS7) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) [09:08:55] (03PS8) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) [09:09:13] (03CR) 10Klausman: "Done" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman) [09:09:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:00] (03PS1) 10Slyngshede: Management command for importing TOTP tokens from MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/1067918 [09:10:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the (test) switch [09:13:45] PROBLEM - SSH on wdqs2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:14:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:25] (03PS2) 10Slyngshede: Management command for importing TOTP tokens from MediaWiki. [software/bitu] - 10https://gerrit.wikimedia.org/r/1067918 [09:15:35] RECOVERY - SSH on wdqs2024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:15:56] (03CR) 10Stevemunene: [C:03+1] deployment_server: define postgresql-test read/write usernames [puppet] - 10https://gerrit.wikimedia.org/r/1067916 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [09:16:46] (03CR) 10Stevemunene: [C:03+1] Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [09:17:48] (03CR) 10Stevemunene: "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [09:19:40] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:39] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:22:17] (03PS2) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) [09:22:54] (03CR) 10CI reject: [V:04-1] toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [09:23:14] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3764/co" [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [09:24:19] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373505#10099044 (10Clement_Goubert) →14Duplicate dup:03T373457 [09:24:20] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373491#10099045 (10Clement_Goubert) →14Duplicate dup:03T373457 [09:24:37] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457#10099040 (10Clement_Goubert) [09:27:02] (03PS3) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) [09:28:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:28:41] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457#10099072 (10Clement_Goubert) [09:31:07] (03CR) 10Stevemunene: [C:03+1] Upgrade airflow to 2.10.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067352 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [09:33:58] !log homer 'lsw1-a3-codfw*' commit T372878 [09:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:03] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [09:34:47] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1063239 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [09:35:16] !log pooling wikikube-worker2043.codfw.wmnet - T372878 [09:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:33] (03PS4) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) [09:35:34] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2043.codfw.wmnet [09:35:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2043.codfw.wmnet [09:36:46] !log homer 'cr*codfw*' commit 'T372878' [09:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] (03PS5) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) [09:37:28] (03CR) 10CI reject: [V:04-1] toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [09:37:43] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:37:48] RESOLVED: [2x] KubernetesCalicoDown: mw2292.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:38:58] ^expected [09:39:47] (03CR) 10Marostegui: sre.switchdc.databases: new cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [09:40:08] !log start prometheus1005 bookworm upgrade - T326657 [09:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:12] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [09:42:58] (03CR) 10Ayounsi: [C:03+1] sre.hosts.decommission: ask on failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [09:43:41] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:43:48] (03PS6) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) [09:43:49] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 455, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:44:41] RECOVERY - SSH on wdqs1021 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:46:53] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 537, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:47:02] (03PS7) 10David Caro: toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) [09:48:37] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [09:49:04] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch [09:49:15] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [09:49:39] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1067415 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [09:50:53] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:53:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:54:35] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [09:57:06] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the (test) switch [09:57:06] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the (test) switch [09:57:27] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the (test) switch [09:58:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:58:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1000) [10:01:11] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [10:05:19] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1 [10:05:21] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the switch from test-s1 to test-s1 [10:06:21] (03PS1) 10Ladsgroup: Set ruwiki to non simple UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067930 (https://phabricator.wikimedia.org/T372694) [10:07:24] jouncebot: nowandnext [10:07:25] For the next 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1000) [10:07:25] In 0 hour(s) and 52 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1100) [10:07:25] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1 [10:07:40] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the switch from test-s1 to test-s1 [10:07:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:07:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:08:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T370903)', diff saved to https://phabricator.wikimedia.org/P68024 and previous config saved to /var/cache/conftool/dbconfig/20240828-100803-ladsgroup.json [10:08:08] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:08:30] (03CR) 10Ayounsi: "I left a bunch of comments here and there." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) (owner: 10Cathal Mooney) [10:10:35] (03PS1) 10Kevin Bazira: ml-services: revert_risk_model from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067933 (https://phabricator.wikimedia.org/T369344) [10:11:10] (03CR) 10Cathal Mooney: [C:03+2] Expose Netbox tunnel data to config templates [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1060909 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [10:11:36] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1 [10:11:40] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.prepare (exit_code=97) for the switch from test-s1 to test-s1 [10:12:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T370903)', diff saved to https://phabricator.wikimedia.org/P68025 and previous config saved to /var/cache/conftool/dbconfig/20240828-101214-ladsgroup.json [10:12:41] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1 [10:12:43] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from test-s1 to test-s1 [10:12:54] (03PS16) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [10:13:37] (03PS1) 10Hnowlan: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128) [10:18:12] (03CR) 10Cathal Mooney: [C:03+2] Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [10:18:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:22:00] (03CR) 10Clément Goubert: [C:03+1] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [10:24:42] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [10:25:13] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:27:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P68026 and previous config saved to /var/cache/conftool/dbconfig/20240828-102721-ladsgroup.json [10:27:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [10:27:58] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [10:28:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:30:13] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:31:45] (03CR) 10Filippo Giunchedi: icinga: remove check_etcd_mw_config_lastindex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi) [10:31:57] (03PS2) 10Filippo Giunchedi: icinga: remove check_etcd_mw_config_lastindex [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) [10:33:41] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:36:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067930 (https://phabricator.wikimedia.org/T372694) (owner: 10Ladsgroup) [10:37:33] (03Merged) 10jenkins-bot: Set ruwiki to non simple UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067930 (https://phabricator.wikimedia.org/T372694) (owner: 10Ladsgroup) [10:38:06] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [10:38:08] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1067930|Set ruwiki to non simple UI (T372694)]] [10:38:12] T372694: Switch ruwiki to use FlaggedRevs detailed interface mode - https://phabricator.wikimedia.org/T372694 [10:40:00] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Relase v0.7.0 with updated plugin - cmooney@cumin1002 [10:41:32] !log start prometheus2005 bookworm upgrade - T326657 [10:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:36] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [10:42:09] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1067930|Set ruwiki to non simple UI (T372694)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:42:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P68027 and previous config saved to /var/cache/conftool/dbconfig/20240828-104228-ladsgroup.json [10:42:36] (03CR) 10Ayounsi: [C:03+1] Enable Redis and TOTP support. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1064354 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [10:44:22] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:48:56] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067930|Set ruwiki to non simple UI (T372694)]] (duration: 10m 48s) [10:49:00] T372694: Switch ruwiki to use FlaggedRevs detailed interface mode - https://phabricator.wikimedia.org/T372694 [10:50:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Relase v0.7.0 with updated plugin - cmooney@cumin1002 [10:51:07] (03CR) 10Hnowlan: [C:03+2] shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [10:52:55] (03Merged) 10jenkins-bot: shellbox: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067934 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [10:53:09] (03PS1) 10Dreamy Jazz: Maintain ranked order of candidates in STV vote summary [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499) [10:54:58] (03PS3) 10Cathal Mooney: Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) [10:56:06] (03CR) 10Cathal Mooney: [C:03+2] Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [10:56:39] (03Merged) 10jenkins-bot: Use Netbox data to build tunnel configuration on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1060911 (https://phabricator.wikimedia.org/T369351) (owner: 10Cathal Mooney) [10:57:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T370903)', diff saved to https://phabricator.wikimedia.org/P68028 and previous config saved to /var/cache/conftool/dbconfig/20240828-105735-ladsgroup.json [10:57:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:57:40] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [10:57:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:57:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T370903)', diff saved to https://phabricator.wikimedia.org/P68029 and previous config saved to /var/cache/conftool/dbconfig/20240828-105757-ladsgroup.json [10:58:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499) (owner: 10Dreamy Jazz) [10:58:46] jouncebot: nowandnext [10:58:46] For the next 0 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1000) [10:58:46] In 0 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1100) [10:59:12] Going to deploy now if that's okay. [11:00:05] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1100). [11:00:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499) (owner: 10Dreamy Jazz) [11:02:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T370903)', diff saved to https://phabricator.wikimedia.org/P68030 and previous config saved to /var/cache/conftool/dbconfig/20240828-110200-ladsgroup.json [11:02:11] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [11:02:35] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [11:03:31] (03Merged) 10jenkins-bot: Maintain ranked order of candidates in STV vote summary [extensions/SecurePoll] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1067938 (https://phabricator.wikimedia.org/T373499) (owner: 10Dreamy Jazz) [11:03:49] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1067938|Maintain ranked order of candidates in STV vote summary (T373499)]] [11:03:53] T373499: Vote summaries for STV should display user ranked order instead of alphabetical candidate order - https://phabricator.wikimedia.org/T373499 [11:05:26] (03PS1) 10Ayounsi: Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 [11:05:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T371742)', diff saved to https://phabricator.wikimedia.org/P68031 and previous config saved to /var/cache/conftool/dbconfig/20240828-110535-ladsgroup.json [11:05:40] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:06:02] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1067938|Maintain ranked order of candidates in STV vote summary (T373499)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:06:04] (03PS2) 10Ayounsi: Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 [11:06:04] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [11:06:50] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:07:25] (03CR) 10Ayounsi: Refactor server provision script to select params based on profile (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/983268 (https://phabricator.wikimedia.org/T346428) (owner: 10Cathal Mooney) [11:07:51] (03CR) 10CI reject: [V:04-1] Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 (owner: 10Ayounsi) [11:09:09] (03PS3) 10Ayounsi: Provision script: don't ask the user for v6 AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067940 [11:10:33] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067938|Maintain ranked order of candidates in STV vote summary (T373499)]] (duration: 06m 44s) [11:10:37] T373499: Vote summaries for STV should display user ranked order instead of alphabetical candidate order - https://phabricator.wikimedia.org/T373499 [11:12:50] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:12:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:14:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:17:03] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:17:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P68032 and previous config saved to /var/cache/conftool/dbconfig/20240828-111708-ladsgroup.json [11:18:13] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:20:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P68033 and previous config saved to /var/cache/conftool/dbconfig/20240828-112042-ladsgroup.json [11:22:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:22:59] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [11:23:13] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1014:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:23:53] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [11:25:48] (03CR) 10Cathal Mooney: [C:03+1] "Overall LGTM. I suspect for the bigger mistakes we are probably gonna need to go to the backup, but it's a good approach, code is clean (" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) (owner: 10Ayounsi) [11:29:05] (03PS1) 10Ayounsi: ProvisionServer: add types [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1067960 [11:32:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P68034 and previous config saved to /var/cache/conftool/dbconfig/20240828-113215-ladsgroup.json [11:34:18] (03PS1) 10Hnowlan: changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241) [11:35:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P68035 and previous config saved to /var/cache/conftool/dbconfig/20240828-113549-ladsgroup.json [11:38:04] (03CR) 10Clément Goubert: [C:03+1] changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:39:09] (03CR) 10Hnowlan: [C:03+2] changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:40:07] (03Merged) 10jenkins-bot: changeprop-jobqueue: retry once on videoscaling jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067963 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:40:50] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:41:05] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:41:09] (03CR) 10Cathal Mooney: [C:03+1] RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [11:41:55] (03PS2) 10Cathal Mooney: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1052167 (https://phabricator.wikimedia.org/T339850) [11:42:08] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:42:57] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:42:58] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:43:40] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:43:58] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [11:44:44] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [11:45:55] (03CR) 10Slyngshede: [V:03+2 C:03+2] Enable Redis and TOTP support. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1064354 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [11:46:12] (03CR) 10David Caro: [C:03+2] toolforge:prometheus: limit ingress-nginx scrapes [puppet] - 10https://gerrit.wikimedia.org/r/1067332 (https://phabricator.wikimedia.org/T370143) (owner: 10David Caro) [11:47:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T370903)', diff saved to https://phabricator.wikimedia.org/P68036 and previous config saved to /var/cache/conftool/dbconfig/20240828-114722-ladsgroup.json [11:47:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:47:27] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [11:47:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:47:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68037 and previous config saved to /var/cache/conftool/dbconfig/20240828-114745-ladsgroup.json [11:48:05] (03CR) 10Slyngshede: "@jhathaway@wikimedia.org would you mind doing another review on this. I had to add a feature to lookup UIDs in LDAP." [software/bitu] - 10https://gerrit.wikimedia.org/r/1065166 (owner: 10Slyngshede) [11:48:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:50:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T371742)', diff saved to https://phabricator.wikimedia.org/P68038 and previous config saved to /var/cache/conftool/dbconfig/20240828-115057-ladsgroup.json [11:51:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [11:51:03] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:51:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [11:51:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:51:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:51:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T371742)', diff saved to https://phabricator.wikimedia.org/P68039 and previous config saved to /var/cache/conftool/dbconfig/20240828-115123-ladsgroup.json [11:55:01] (03PS2) 10KartikMistry: Enable Section Translation in bdr, btm and dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) [11:58:41] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:59:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:01:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10099466 (10JayCano) Approved as well! [12:10:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:13:54] (03CR) 10David Caro: [C:03+2] aptrepo: upgrade k8s components for 1.26 [puppet] - 10https://gerrit.wikimedia.org/r/1058560 (https://phabricator.wikimedia.org/T370246) (owner: 10Slavina Stefanova) [12:15:28] (03CR) 10David Caro: [C:03+2] aptrepo: upgrade k8s components for 1.26 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058560 (https://phabricator.wikimedia.org/T370246) (owner: 10Slavina Stefanova) [12:17:27] (03PS4) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) [12:17:54] (03CR) 10David Caro: "rebased on top of production branch, will do the refactor some other day" [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [12:19:17] (03CR) 10Hnowlan: [C:03+1] mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos) [12:19:40] Hi, we'll be running some CommunityConfiguration/GrowthExperiments maint scripts [12:19:48] They are not expected to disrupt anything. [12:19:50] !log T371228 running mwscript --wiki testwiki ./extensions/CommunityConfiguration/maintenance/setVersionData.php HelpPanel 1.0.0 [12:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:54] T371228: Page title component makes it easy to unintentionally blank page title - https://phabricator.wikimedia.org/T371228 [12:19:55] (we being Michael and myself) [12:20:09] (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [12:22:00] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1 [12:22:01] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.switchdc.databases.finalize (exit_code=97) for the switch from test-s1 to test-s1 [12:23:50] !log T371228 running foreachwikiindblist growthexperiments ./extensions/CommunityConfiguration/maintenance/setVersionData.php HelpPanel 1.0.0 [12:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:10] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:12] (03PS1) 10David Caro: updates: fix k8s 1.26 url [puppet] - 10https://gerrit.wikimedia.org/r/1067985 (https://phabricator.wikimedia.org/T370246) [12:24:36] (03CR) 10David Caro: [C:03+2] updates: fix k8s 1.26 url [puppet] - 10https://gerrit.wikimedia.org/r/1067985 (https://phabricator.wikimedia.org/T370246) (owner: 10David Caro) [12:24:37] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos) [12:25:56] (03Merged) 10jenkins-bot: mobileapps: Use IPs instead of hostnames for cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066711 (https://phabricator.wikimedia.org/T373314) (owner: 10Jgiannelos) [12:26:19] (03PS1) 10Elukey: profile::docker::reporter: exclude dcl-puppet-pki from base rules [puppet] - 10https://gerrit.wikimedia.org/r/1067986 (https://phabricator.wikimedia.org/T372472) [12:27:00] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1 [12:27:02] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.finalize (exit_code=99) for the switch from test-s1 to test-s1 [12:28:10] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1 [12:28:12] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.finalize (exit_code=99) for the switch from test-s1 to test-s1 [12:29:13] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from test-s1 to test-s1 [12:29:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from test-s1 to test-s1 [12:30:19] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:32:03] (03PS5) 10Arnaudb: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:32:29] All done from our side [12:32:57] (03CR) 10Arnaudb: "good catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:33:37] (03PS3) 10KartikMistry: Enable Section Translation in bdr, btm, and dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) [12:37:15] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:37:24] (03CR) 10Cathal Mooney: [C:03+2] Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1052167 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:38:01] (03Merged) 10jenkins-bot: Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1052167 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [12:39:20] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:39:22] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:40:06] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:41:27] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:41:56] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:42:45] (03PS1) 10Slyngshede: Fix syntax error [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1067988 [12:43:18] (03CR) 10Cathal Mooney: [C:03+2] RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [12:43:52] (03Merged) 10jenkins-bot: RPKI: replace rpki2002 with rpki2003 [homer/public] - 10https://gerrit.wikimedia.org/r/1067356 (https://phabricator.wikimedia.org/T372909) (owner: 10Ayounsi) [12:44:55] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:45:20] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) (owner: 10Volans) [12:45:41] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:48:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68040 and previous config saved to /var/cache/conftool/dbconfig/20240828-124801-ladsgroup.json [12:48:06] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [12:49:00] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:51:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:53:25] (03PS1) 10Jgiannelos: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067990 [12:54:25] (03CR) 10CI reject: [V:04-1] mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067990 (owner: 10Jgiannelos) [12:55:21] hehe. puppetserver1002 is not well [12:55:48] https://grafana.wikimedia.org/goto/nTxGvo3IR?orgId=1 [12:55:50] (03CR) 10Clément Goubert: [C:03+1] profile::docker::reporter: exclude dcl-puppet-pki from base rules [puppet] - 10https://gerrit.wikimedia.org/r/1067986 (https://phabricator.wikimedia.org/T372472) (owner: 10Elukey) [12:56:00] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:56:09] (03PS1) 10Jgiannelos: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991 [12:56:23] (03Abandoned) 10Jgiannelos: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067990 (owner: 10Jgiannelos) [12:57:17] going to power cycle it. [12:57:52] ack thanks, I can not connect over ssh either (only mgmt) [12:57:59] !log sudo ipmitool -I lanplus -H "puppetserver1002.mgmt.eqiad.wmnet" -U root -E chassis power cycle [12:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:05] yeah, it's thrashing, clearly [12:58:36] lets see if it comes back properly after the reboot [12:58:58] !log Started MediaModeration scan on enwiki, time limited to 24hrs - https://wikitech.wikimedia.org/wiki/MediaModeration [12:58:59] this is also the reason for the widespread puppet failures https://puppetboard.wikimedia.org/ [12:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:04] so that should resolve as well [12:59:54] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:59:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10099544 (10ssingh) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1300). [13:00:05] Gerges, nemo-yiannis, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518 (10Southparkfan) 03NEW [13:00:11] (03CR) 10Ssingh: [C:03+2] admin: add mszabo to deployment and move from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067397 (https://phabricator.wikimedia.org/T373426) (owner: 10Ssingh) [13:00:25] \o My patch is already done, so don't need to use the deployment window [13:00:30] I think my patch is already deployed too [13:01:31] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:01:51] and back [13:02:03] ssh works again for me [13:02:23] nice. and we can let the failed agent runs run organically so nothing to do there [13:02:45] metrics in prometheus back as well. +1 ^ [13:03:00] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10099570 (10Southparkfan) [13:03:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P68041 and previous config saved to /var/cache/conftool/dbconfig/20240828-130308-ladsgroup.json [13:03:31] !log delete 2023 5m blocks from thanos - T351927 [13:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:35] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [13:04:34] !log rolling out config additions of qos schedulers and policers to all network devices T339850 [13:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:37] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [13:06:10] FIRING: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:50] yeah, this most certainly needs a network-online.target [13:06:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for Máté Szabó - https://phabricator.wikimedia.org/T373426#10099573 (10ssingh) 05Open→03Resolved a:03ssingh @mszabo: Your request has been merged, also added to Gerrit group wmf-deployment. Please try in ~30 mins. Tha... [13:07:05] which now reminds me that this is the second time puppetserver1002 failed [13:07:08] (03CR) 10Elukey: [C:03+2] ldap: fix add-ldap-group script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1057814 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:07:10] because I think it failed last week as well [13:07:20] (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetmaster::frontend: allow puppetservers via ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:07:25] indeed, on 22 Aug as well [13:07:28] (03CR) 10Elukey: [V:03+1 C:03+2] Add safe directory settings to the prod private repo's git config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053272 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:07:30] ok, I will file a task [13:07:48] (03CR) 10Elukey: [C:03+2] services: update Thumbor Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067382 (https://phabricator.wikimedia.org/T373363) (owner: 10Elukey) [13:08:08] (03PS1) 10Hnowlan: shellbox-video: healthcheck every 1s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067992 (https://phabricator.wikimedia.org/T373517) [13:08:20] thanks you! [13:08:34] (03CR) 10AOkoth: [C:03+2] vrts: add yearly ticket count [puppet] - 10https://gerrit.wikimedia.org/r/1067360 (https://phabricator.wikimedia.org/T373419) (owner: 10AOkoth) [13:09:00] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:09:37] PROBLEM - Check unit status of sync-puppet-ca on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:43] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:09:50] ^ fixing since we rebooted the host [13:09:59] then a proper fix is to add network-online.target, which I will do later [13:10:01] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: sync [13:10:06] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: sync [13:10:12] (03CR) 10Hnowlan: [C:03+2] Remove role::common::core_platform, s/Core Platform/ServiceOps/g [puppet] - 10https://gerrit.wikimedia.org/r/1064725 (owner: 10Hnowlan) [13:11:40] sukhe: o/ thanks for the puppetserver1002 fix, I didn't notice it, did it happen before? It is not great :( [13:11:40] Here [13:12:00] elukey: it did happen yep, same issue (thrashing) on Aug 22 [13:12:12] I will file a task for that later as well so don't worry [13:12:39] okok thanks, I'll try to check as well [13:13:40] RESOLVED: [2x] SystemdUnitFailed: sync-puppet-ca.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:07] elukey: I will assign to you :P [13:15:11] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:15:15] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [13:15:16] sukhe: fair enough :D [13:18:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P68042 and previous config saved to /var/cache/conftool/dbconfig/20240828-131815-ladsgroup.json [13:19:37] RECOVERY - Check unit status of sync-puppet-ca on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-ca https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:19:43] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:20:09] (03PS1) 10AOkoth: Revert "vrts: add yearly ticket count" [puppet] - 10https://gerrit.wikimedia.org/r/1067995 [13:21:24] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10099627 (10ssingh) Hi @Southparkfan! We need two things for this to move forward, otherwise it's a simple addition. 1. Approval from your manager/point of contact. I am going to assume that th... [13:22:17] (03CR) 10Subramanya Sastry: [C:03+1] mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991 (owner: 10Jgiannelos) [13:23:45] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10099632 (10ssingh) 05Resolved→03Open [13:24:08] Who will deploy this backport patches? [13:25:36] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10099634 (10ssingh) daily_account_consistency_check reports that: ` seanleong-wmde present in privileged LDAP group (nda),but not present in data.yaml seanleong-wmde present in pri... [13:27:10] (03PS1) 10Ssingh: admin: add seanleong-wmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067998 (https://phabricator.wikimedia.org/T371694) [13:27:16] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991 (owner: 10Jgiannelos) [13:28:19] (03Merged) 10jenkins-bot: mobileapps: Enable caching in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067991 (owner: 10Jgiannelos) [13:28:26] (03CR) 10Ssingh: [C:03+2] admin: add seanleong-wmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1067998 (https://phabricator.wikimedia.org/T371694) (owner: 10Ssingh) [13:28:53] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: exclude dcl-puppet-pki from base rules [puppet] - 10https://gerrit.wikimedia.org/r/1067986 (https://phabricator.wikimedia.org/T372472) (owner: 10Elukey) [13:31:20] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: sync [13:31:53] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1 [13:31:55] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from test-s1 to test-s1 [13:32:21] (03CR) 10Vgutierrez: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [13:32:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:33:05] Hi Lucas_WMDE and Urbanecm, awight, TheresNoTime, Who will deploy this backport patches? [13:33:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T370903)', diff saved to https://phabricator.wikimedia.org/P68043 and previous config saved to /var/cache/conftool/dbconfig/20240828-133323-ladsgroup.json [13:33:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:33:27] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:33:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:33:39] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10099650 (10ssingh) 05Open→03Resolved Added to data.yaml, closing this. Thanks! [13:33:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T370903)', diff saved to https://phabricator.wikimedia.org/P68044 and previous config saved to /var/cache/conftool/dbconfig/20240828-133346-ladsgroup.json [13:34:54] (03PS5) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) [13:36:34] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [13:36:39] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1 [13:36:40] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.prepare (exit_code=99) for the switch from test-s1 to test-s1 [13:37:37] (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [13:37:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T370903)', diff saved to https://phabricator.wikimedia.org/P68045 and previous config saved to /var/cache/conftool/dbconfig/20240828-133753-ladsgroup.json [13:38:20] !log arnaudb@cumin1002 START - Cookbook sre.switchdc.databases.prepare for the switch from test-s1 to test-s1 [13:38:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.prepare (exit_code=0) for the switch from test-s1 to test-s1 [13:39:21] (03CR) 10Ayounsi: P:idp Clean up CAS 6.6 and Tomcat 9 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1066708 (https://phabricator.wikimedia.org/T372997) (owner: 10Slyngshede) [13:39:39] jouncebot: [13:39:41] 10SRE-Access-Requests: abi uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T373522 (10ssingh) 03NEW [13:39:50] jouncebot next [13:39:50] In 0 hour(s) and 20 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1400) [13:39:58] 10SRE-Access-Requests: abi uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T373522#10099694 (10ssingh) p:05Triage→03High [13:40:26] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: sync [13:41:55] (03PS3) 10Ayounsi: Add basic "revert" Netbox script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1066687 (https://phabricator.wikimedia.org/T310589) [13:42:19] (03PS5) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) [13:42:19] (03PS1) 10Andrew Bogott: Add apache to codfw1dev cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1068000 (https://phabricator.wikimedia.org/T359590) [13:42:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068000 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:45:32] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [13:45:34] (03PS1) 10Jelto: gerrit: lower thresholds for gerrit, remove gerrit1004 config [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259) [13:46:31] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [13:48:07] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3765/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [13:48:23] (03CR) 10Andrew Bogott: [C:03+2] Add apache to codfw1dev cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/1068000 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:49:50] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2009.codfw.wmnet [13:50:08] (03CR) 10JMeybohm: [C:03+1] kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman) [13:50:27] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2009.codfw.wmnet [13:51:04] (03PS9) 10Klausman: kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) [13:51:43] (03CR) 10Klausman: [C:03+2] kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman) [13:52:16] (03CR) 10Klausman: [V:03+2 C:03+2] kserve: Bump version to 0.13 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman) [13:52:54] (03PS6) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) [13:53:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P68046 and previous config saved to /var/cache/conftool/dbconfig/20240828-135300-ladsgroup.json [13:53:30] (03CR) 10AOkoth: [C:03+2] Revert "vrts: add yearly ticket count" [puppet] - 10https://gerrit.wikimedia.org/r/1067995 (owner: 10AOkoth) [13:55:32] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1010.eqiad.wmnet with OS bookworm [13:55:46] (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [13:55:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099735 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [13:57:05] (03CR) 10Elukey: kserve: Bump version to 0.13 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1046617 (https://phabricator.wikimedia.org/T367048) (owner: 10Klausman) [13:57:39] (03PS1) 10Elukey: role::deployment_server::kubernetes: upgrade nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1068004 (https://phabricator.wikimedia.org/T368366) [13:58:56] (03CR) 10Hnowlan: [C:03+2] shellbox-video: healthcheck every 1s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067992 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [13:59:14] bd808: o/ - is it ok to deploy toolhub to pick up a new version of mcrouter for https://phabricator.wikimedia.org/T368366 ? [13:59:20] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [13:59:22] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [13:59:59] (03Merged) 10jenkins-bot: shellbox-video: healthcheck every 1s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067992 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1400) [14:00:30] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [14:00:56] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [14:02:31] (03PS1) 10Ayounsi: Provision script: Assign the mgmt IP as oob_ip [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1068008 [14:03:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:05:52] !log ayounsi@cumin1002 START - Cookbook sre.hosts.dhcp for host ml-serve1009.eqiad.wmnet [14:06:59] Elukey: That should be fine, yes. Thanks for taking care of that. [14:08:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P68047 and previous config saved to /var/cache/conftool/dbconfig/20240828-140807-ladsgroup.json [14:08:35] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host ml-serve1009.eqiad.wmnet [14:11:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T371742)', diff saved to https://phabricator.wikimedia.org/P68048 and previous config saved to /var/cache/conftool/dbconfig/20240828-141108-ladsgroup.json [14:12:27] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:12:29] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:13:03] (03CR) 10JMeybohm: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1068004 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [14:13:45] (03CR) 10JMeybohm: [C:03+1] icinga: remove check_etcd_mw_config_lastindex [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi) [14:14:49] (03CR) 10JMeybohm: [V:03+2 C:03+2] Merge upstream v0.4.0 commit 'a15c162' into v0.4.0 [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [14:14:53] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update simple-cfssl to use wmf packages [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060844 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [14:18:10] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1010.eqiad.wmnet with reason: host reimage [14:18:41] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:18:41] 06SRE: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527 (10ssingh) 03NEW [14:18:51] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:18:53] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:19:02] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:19:04] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:19:41] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:20:26] (03PS6) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) [14:20:26] (03PS1) 10Andrew Bogott: keystone/apache.conf: fix listen ports [puppet] - 10https://gerrit.wikimedia.org/r/1068014 (https://phabricator.wikimedia.org/T359590) [14:20:35] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:21:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1010.eqiad.wmnet with reason: host reimage [14:21:46] (03PS1) 10Jgiannelos: Revert "mobileapps: Enable caching in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068015 [14:22:54] (03CR) 10Jgiannelos: [C:03+2] Revert "mobileapps: Enable caching in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068015 (owner: 10Jgiannelos) [14:23:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T370903)', diff saved to https://phabricator.wikimedia.org/P68049 and previous config saved to /var/cache/conftool/dbconfig/20240828-142315-ladsgroup.json [14:23:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:23:20] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:23:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:23:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:23:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:23:48] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:23:52] (03Merged) 10jenkins-bot: Revert "mobileapps: Enable caching in prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068015 (owner: 10Jgiannelos) [14:23:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68050 and previous config saved to /var/cache/conftool/dbconfig/20240828-142355-ladsgroup.json [14:24:11] XioNoX: topranks: are these routinator errors known? I have seen them fire up more recently this week than before [14:24:32] (03CR) 10Andrew Bogott: [C:03+2] keystone/apache.conf: fix listen ports [puppet] - 10https://gerrit.wikimedia.org/r/1068014 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:24:59] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:25:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [14:25:04] (03PS7) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) [14:25:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [14:25:21] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:26:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P68051 and previous config saved to /var/cache/conftool/dbconfig/20240828-142615-ladsgroup.json [14:26:19] sukhe: yeah... it's a bit of a pain, it's something we don't have control over, but we want to have alerts if there is a massive issue [14:26:36] I think the more people deploy RPKI, the more external fetches are going to fail [14:26:38] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:26:40] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:26:50] ah so that is what it is saying [14:27:49] (03CR) 10CI reject: [V:04-1] maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [14:28:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68052 and previous config saved to /var/cache/conftool/dbconfig/20240828-142821-ladsgroup.json [14:28:26] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:28:35] sukhe: I'm going to bump the threshold significantly. Or I should figure out how to have it fail after a certain percentage [14:28:41] and not an absolute value [14:28:57] no worries on the alerts I guess (non-paging) but I was mostly curious what's up [14:29:24] sukhe: I hate alerting noise, so I should clean up "mine" first :) [14:29:32] haha [14:29:48] well if we want to go down that path of cleaning up alerting noise... :P [14:31:09] (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [14:35:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [14:35:56] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:36:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:36:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1010.eqiad.wmnet with OS bookworm [14:36:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [14:36:27] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:00] sukhe: actually they are moving away from rsync, so that's why only the failed ones are staying around, so we're always above 50% failure rate [14:38:08] anyway, I'll remove the alerting for that [14:38:18] thanks <3 [14:39:06] for your contributions for reducing alert fatigue as well. only a 100 more to go across all SRE :P [14:41:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P68053 and previous config saved to /var/cache/conftool/dbconfig/20240828-144122-ladsgroup.json [14:43:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P68054 and previous config saved to /var/cache/conftool/dbconfig/20240828-144328-ladsgroup.json [14:43:41] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:48:52] (03CR) 10David Caro: [C:03+1] Put cloudcephosd1036 into service [puppet] - 10https://gerrit.wikimedia.org/r/1063861 (https://phabricator.wikimedia.org/T363344) (owner: 10Andrew Bogott) [14:50:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1011.eqiad.wmnet with OS bookworm [14:50:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [14:50:54] (03CR) 10David Caro: Make cloudcephosd1039-1041 into ceph osd nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) (owner: 10Andrew Bogott) [14:53:13] (03PS1) 10Ayounsi: Remove RPKI rsync alerting [alerts] - 10https://gerrit.wikimedia.org/r/1068019 [14:54:33] !log cgoubert@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2009.codfw.wmnet [14:54:35] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2009.codfw.wmnet [14:54:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2009.codfw.wmnet [14:54:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10099910 (10Jclark-ctr) [14:55:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2009.codfw.wmnet with OS bullseye [14:55:17] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10099911 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:55:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [14:55:42] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [14:56:12] (03CR) 10Ssingh: [C:03+1] "whatever that's worth 😊" [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi) [14:56:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T371742)', diff saved to https://phabricator.wikimedia.org/P68056 and previous config saved to /var/cache/conftool/dbconfig/20240828-145629-ladsgroup.json [14:56:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [14:56:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:56:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [14:56:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T371742)', diff saved to https://phabricator.wikimedia.org/P68057 and previous config saved to /var/cache/conftool/dbconfig/20240828-145651-ladsgroup.json [14:58:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P68058 and previous config saved to /var/cache/conftool/dbconfig/20240828-145835-ladsgroup.json [14:59:13] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2009 - cgoubert@cumin1002" [14:59:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2009 - cgoubert@cumin1002" [14:59:18] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:59:18] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2009.codfw.wmnet 197.16.192.10.in-addr.arpa 7.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:59:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2009.codfw.wmnet 197.16.192.10.in-addr.arpa 7.9.1.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:59:22] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2009 [14:59:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2009 [14:59:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [15:00:58] (03PS1) 10Jgiannelos: mobileapps: Re-enabling caching in prod after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 [15:01:27] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:59] (03PS2) 10Jgiannelos: mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365) [15:02:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:10] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1011.eqiad.wmnet with reason: host reimage [15:03:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:05:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1011.eqiad.wmnet with reason: host reimage [15:07:00] (03CR) 10Subramanya Sastry: [C:03+1] mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:07:50] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:08:49] (03Merged) 10jenkins-bot: mobileapps: Re-enabling caching after adding missing credentials [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068024 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [15:09:08] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 453, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:10:00] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:10:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:10:47] (03PS8) 10David Caro: maintain_dbusers: add prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) [15:11:12] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-lab1002 [15:11:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-lab1002 [15:11:32] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [15:13:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T370903)', diff saved to https://phabricator.wikimedia.org/P68059 and previous config saved to /var/cache/conftool/dbconfig/20240828-151342-ladsgroup.json [15:13:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:13:47] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:13:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:50] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:13:52] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:13:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance [15:14:00] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:14:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T370903)', diff saved to https://phabricator.wikimedia.org/P68060 and previous config saved to /var/cache/conftool/dbconfig/20240828-151404-ladsgroup.json [15:14:47] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:16:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage [15:17:50] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:18:31] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:18:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T370903)', diff saved to https://phabricator.wikimedia.org/P68061 and previous config saved to /var/cache/conftool/dbconfig/20240828-151831-ladsgroup.json [15:18:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2009.codfw.wmnet with reason: host reimage [15:20:26] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:22:04] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/toolhub: sync [15:22:17] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 535, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:22:43] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/toolhub: sync [15:23:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:23:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1011.eqiad.wmnet with OS bookworm [15:23:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100078 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [15:23:41] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/toolhub: sync [15:23:53] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/toolhub: sync [15:27:27] PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:30:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:32:11] FIRING: [2x] RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [15:33:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P68062 and previous config saved to /var/cache/conftool/dbconfig/20240828-153338-ladsgroup.json [15:33:55] (03PS1) 10Hnowlan: timedmediahandler: revert using shellbox for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517) [15:34:02] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:34:04] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:34:16] (03PS1) 10JMeybohm: Update cfssl-issuer to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1068026 (https://phabricator.wikimedia.org/T337928) [15:37:56] (03CR) 10Giuseppe Lavagetto: [C:03+1] timedmediahandler: revert using shellbox for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [15:38:43] (03PS1) 10JMeybohm: Pin cfssl-issuer and CRDs chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068027 (https://phabricator.wikimedia.org/T337928) [15:38:44] (03PS1) 10JMeybohm: Update cfss-issuer charts to v0.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) [15:40:33] (03CR) 10David Caro: "just needed rebasing, essentially, the click change it was depending on, just rebased on top of production to get the stats in before the " [puppet] - 10https://gerrit.wikimedia.org/r/908843 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [15:40:37] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2009.codfw.wmnet with OS bullseye [15:40:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:40:52] !log homer cr*codfw* commit 'T372878' [15:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:56] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [15:41:11] (03PS1) 10Hnowlan: videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) [15:41:35] (03CR) 10CI reject: [V:04-1] videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [15:42:47] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:43:11] (03CR) 10JMeybohm: Update cfss-issuer charts to v0.4.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [15:43:35] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:07] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:44:13] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:44:15] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [15:44:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:44:41] (03PS2) 10Hnowlan: videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) [15:45:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [15:45:35] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3767/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [15:45:36] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [15:46:51] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:34] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@0b23c91]: Test Refine through Airflow [15:47:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1009.eqiad.wmnet with OS bookworm [15:47:45] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@0b23c91]: Test Refine through Airflow (duration: 00m 11s) [15:47:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [15:47:57] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:28] (03CR) 10Clément Goubert: [C:03+1] videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [15:48:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P68063 and previous config saved to /var/cache/conftool/dbconfig/20240828-154846-ladsgroup.json [15:49:11] !log homer lsw1-b6-codfw* commit 'T372878' [15:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:15] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [15:49:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [15:49:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [15:49:59] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 72.13 ms [15:50:03] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.41 ms [15:50:09] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.72 ms [15:50:13] hmm [15:50:19] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:50:23] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:51:27] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:51:53] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.57 ms [15:52:41] !log TRUNCATE-ing RESTBase tables (`{commons,enwiki,others,wikipedia}_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY`) — T342148 [15:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:45] T342148: restbase: high storage utilization - https://phabricator.wikimedia.org/T342148 [15:53:17] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Cephadm doesn't find the correct image to run a shell - https://phabricator.wikimedia.org/T373185#10100246 (10MatthewVernon) For reference - [[ https://github.com/ceph/ceph/pull/59485 | upstream MR to make cephadm more helpful ]] [15:53:49] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 73.47 ms [15:54:27] (03CR) 10Hnowlan: [V:03+1 C:03+2] videoscaler: use ffmpeg from component [puppet] - 10https://gerrit.wikimedia.org/r/1068030 (https://phabricator.wikimedia.org/T373128) (owner: 10Hnowlan) [15:57:21] RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:57:22] (03PS1) 10Jgiannelos: Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032 [15:57:43] (03PS2) 10Jgiannelos: Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032 [15:59:05] (03CR) 10Jgiannelos: [C:03+2] Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032 (owner: 10Jgiannelos) [15:59:36] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1009.eqiad.wmnet with reason: host reimage [16:00:04] (03Merged) 10jenkins-bot: Revert "mobileapps: Re-enabling caching after adding missing credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068032 (owner: 10Jgiannelos) [16:00:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [16:00:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [16:01:01] (03PS1) 10Elukey: jaeger: add securityContext configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) [16:01:09] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:01:10] (03PS28) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [16:01:39] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:02:39] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:02:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1009.eqiad.wmnet with reason: host reimage [16:03:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T370903)', diff saved to https://phabricator.wikimedia.org/P68065 and previous config saved to /var/cache/conftool/dbconfig/20240828-160354-ladsgroup.json [16:03:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [16:04:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1240.eqiad.wmnet with reason: Maintenance [16:04:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:05:52] jouncebot: nowandnext [16:05:52] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [16:05:52] In 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T1700) [16:06:24] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:07:54] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2009.codfw.wmnet [16:08:05] (03PS1) 10Hashar: archiva: allow trailing slash for top directories [puppet] - 10https://gerrit.wikimedia.org/r/1068036 (https://phabricator.wikimedia.org/T359031) [16:09:41] (03CR) 10Hashar: "https://archiva.wikimedia.org/repository/mirrored yields a 404 not found since it lacks a trailing slash and that confused me :]" [puppet] - 10https://gerrit.wikimedia.org/r/1068036 (https://phabricator.wikimedia.org/T359031) (owner: 10Hashar) [16:13:56] I need to do an out of step deployment to address some error rate issues in videoscaling [16:14:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:14:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:16:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hnowlan@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:17:12] (03Merged) 10jenkins-bot: timedmediahandler: revert using shellbox for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068025 (https://phabricator.wikimedia.org/T373517) (owner: 10Hnowlan) [16:17:18] hnowlan: please do ! [16:17:28] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:17:31] hnowlan: I ran the MediaWiki train earlier today (roughly 8 hours ago) [16:17:36] !log hnowlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]] [16:17:41] T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517 [16:17:47] ah it is happening already \o/ [16:18:42] PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:19:16] (03PS1) 10Bartosz Dziewoński: CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) [16:19:43] (03CR) 10Bartosz Dziewoński: "I can backport in the evening if I get a +1." [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [16:19:50] hashar: fortunately/unfortunately the errors are definitely unrelated to the train :) [16:20:01] !log hnowlan@deploy1003 hnowlan: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:20:08] !log hnowlan@deploy1003 hnowlan: Continuing with sync [16:20:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:20:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1009.eqiad.wmnet with OS bookworm [16:20:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [16:20:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100333 (10Jclark-ctr) [16:22:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:22:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [16:22:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T370903)', diff saved to https://phabricator.wikimedia.org/P68066 and previous config saved to /var/cache/conftool/dbconfig/20240828-162239-ladsgroup.json [16:22:45] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:24:50] !log hnowlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]] (duration: 07m 13s) [16:24:54] 10ops-magru: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#10100341 (10RobH) 05Open→03Resolved a:03RobH All that remains off this #ops-magru tracking task is the traffic ramp up via T359054 and the geo maps update via T363722. Since those are only #traf... [16:24:54] T373517: shellbox-video pods being restarted prematurely - https://phabricator.wikimedia.org/T373517 [16:25:56] (03PS2) 10Bartosz Dziewoński: CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) [16:26:08] all done [16:26:47] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:26:49] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:27:19] (03CR) 10Scott French: [C:03+1] role::deployment_server::kubernetes: upgrade nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/1068004 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [16:29:39] (03CR) 10Scott French: [C:03+1] Update cfssl-issuer to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1068026 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [16:30:10] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2009.codfw.wmnet [16:30:11] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2009.codfw.wmnet [16:32:34] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2009.codfw.wmnet [16:32:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2009.codfw.wmnet [16:32:49] (03CR) 10Elukey: "Tried to come up with a configuration for Jaeger, with the following assumptions:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [16:32:54] (03PS1) 10Ssingh: admin: fix typo in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1068042 [16:33:55] (03CR) 10Elukey: jaeger: add securityContext configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068034 (https://phabricator.wikimedia.org/T369491) (owner: 10Elukey) [16:34:16] (03CR) 10Ssingh: [C:03+2] admin: fix typo in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1068042 (owner: 10Ssingh) [16:35:08] (03CR) 10Dzahn: [C:03+2] prometheus/gerrit: also add size of tracking list to exporter [puppet] - 10https://gerrit.wikimedia.org/r/1067415 (https://phabricator.wikimedia.org/T373136) (owner: 10Dzahn) [16:35:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [16:35:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100388 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [16:35:59] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2009.codfw.wmnet [16:36:00] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2009.codfw.wmnet [16:36:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100393 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 pool fo... [16:36:13] (03CR) 10Scott French: [C:03+1] Pin cfssl-issuer and CRDs chart versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068027 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [16:36:19] (03CR) 10CI reject: [V:04-1] CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [16:38:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [16:38:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [16:41:27] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373457#10100422 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:41:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T370903)', diff saved to https://phabricator.wikimedia.org/P68067 and previous config saved to /var/cache/conftool/dbconfig/20240828-164131-ladsgroup.json [16:41:36] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:44:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [16:44:26] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-lab1001.eqiad.wmnet with OS bookworm [16:44:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100481 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [16:44:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cum... [16:44:51] (03CR) 10Dzahn: [C:03+2] "yes to lowering the values, also tested "2000 without burst" and it still had like 2 IPs affected. the values for gerrit1004 were only her" [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [16:46:52] (03CR) 10Scott French: [C:03+1] Update cfss-issuer charts to v0.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068028 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [16:47:53] (03CR) 10Dzahn: "just want to clarify my comments aren't a -1 or anything. I'd say just address comments by Eoghan and merge it and try it out. Then follow" [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [16:48:36] RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:49:41] FIRING: [2x] RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:51:03] !log add qos config to management firewalls T339850 [16:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:07] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [16:52:06] (03CR) 10Scott French: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French) [16:52:07] (03CR) 10Scott French: [C:03+2] sre.hosts.move-vlan: use name property in runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French) [16:56:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P68068 and previous config saved to /var/cache/conftool/dbconfig/20240828-165638-ladsgroup.json [16:59:47] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:59:49] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:59:54] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:00:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:00:43] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:01:38] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:02:22] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:02:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T371742)', diff saved to https://phabricator.wikimedia.org/P68069 and previous config saved to /var/cache/conftool/dbconfig/20240828-170228-ladsgroup.json [17:02:34] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:02:41] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:03:13] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:04:28] (03Merged) 10jenkins-bot: sre.hosts.move-vlan: use name property in runtime_description [cookbooks] - 10https://gerrit.wikimedia.org/r/1067440 (owner: 10Scott French) [17:04:30] (03PS7) 10Andrew Bogott: Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) [17:05:01] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:09:00] (03CR) 10Bking: [C:03+2] airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [17:09:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100575 (10Jclark-ctr) [17:09:59] (03Merged) 10jenkins-bot: airflow-test-k8s: revert back to using an-db1001 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067913 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [17:10:13] (03CR) 10Bking: [C:03+2] Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [17:11:07] (03Merged) 10jenkins-bot: Define helmfile for a test postgresql cluster, to experiment with [deployment-charts] - 10https://gerrit.wikimedia.org/r/1067915 (https://phabricator.wikimedia.org/T373503) (owner: 10Brouberol) [17:11:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P68070 and previous config saved to /var/cache/conftool/dbconfig/20240828-171146-ladsgroup.json [17:14:19] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:15:24] 10ops-magru: Sao Paulo, Brazil, South America POP tracking task - https://phabricator.wikimedia.org/T346722#10100596 (10ssingh) Yes that's fair, the tasks left are on Traffic. Thanks! [17:15:41] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:16:34] (03PS1) 10Cathal Mooney: Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) [17:17:22] (03CR) 10CI reject: [V:04-1] Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [17:17:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P68071 and previous config saved to /var/cache/conftool/dbconfig/20240828-171735-ladsgroup.json [17:17:52] (03PS2) 10Cathal Mooney: Apply qos interface config in ulsfo and on lsw1-c6-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1068050 (https://phabricator.wikimedia.org/T339850) [17:17:57] (03CR) 10Bartosz Dziewoński: "Failure unrelated, T282893" [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [17:18:18] (03PS1) 10Bartosz Dziewoński: auth: Relax AuthManager session state check while cde00b55 is deployed [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) [17:18:45] (03CR) 10Bartosz Dziewoński: "recheck" [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [17:19:16] (03PS1) 10Bartosz Dziewoński: Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) [17:19:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100609 (10Jclark-ctr) [17:22:16] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:22:20] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:22:25] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:22:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100613 (10Jclark-ctr) a:03klausman @klausman. If you can update preseed.yaml file for thes... [17:23:16] (03CR) 10DLynch: [C:03+1] Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński) [17:24:41] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2045.codfw.wmnet with OS bullseye [17:24:57] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [17:26:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T370903)', diff saved to https://phabricator.wikimedia.org/P68072 and previous config saved to /var/cache/conftool/dbconfig/20240828-172653-ladsgroup.json [17:26:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:26:57] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:26:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:27:23] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:29:51] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@cb0bc4d]: (no justification provided) [17:30:10] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@cb0bc4d]: (no justification provided) (duration: 00m 18s) [17:31:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński) [17:32:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński) [17:32:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński) [17:32:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [17:32:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P68073 and previous config saved to /var/cache/conftool/dbconfig/20240828-173242-ladsgroup.json [17:34:25] (03PS1) 10Bvibber: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) [17:35:04] FIRING: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:35:10] (03PS2) 10Bvibber: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) [17:35:38] inflatador: is the above known? [17:35:43] k8s-dse alert. known/expected [17:36:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) (owner: 10Bvibber) [17:37:15] sukhe it's known...not sure why there's a monitor on a non-prod service but I will suppress. Thanks for reaching out [17:37:22] thanks <3 [17:38:36] (03CR) 10Gergő Tisza: [C:03+1] CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [17:39:10] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:08] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage [17:42:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [17:43:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [17:43:26] (03CR) 10Gergő Tisza: "Thanks!" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński) [17:44:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [17:45:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [17:45:12] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2045.codfw.wmnet with reason: host reimage [17:45:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T370903)', diff saved to https://phabricator.wikimedia.org/P68074 and previous config saved to /var/cache/conftool/dbconfig/20240828-174514-ladsgroup.json [17:45:19] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:47:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T371742)', diff saved to https://phabricator.wikimedia.org/P68075 and previous config saved to /var/cache/conftool/dbconfig/20240828-174749-ladsgroup.json [17:47:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:47:54] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:48:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [17:48:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T371742)', diff saved to https://phabricator.wikimedia.org/P68076 and previous config saved to /var/cache/conftool/dbconfig/20240828-174811-ladsgroup.json [17:52:41] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:57:04] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2294.codfw.wmnet [17:57:34] RESOLVED: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:57:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2294.codfw.wmnet [17:57:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:59:51] (03PS1) 10Alexandros Kosiaris: Rename mw2295 to wikikube-worker2048 [puppet] - 10https://gerrit.wikimedia.org/r/1068059 (https://phabricator.wikimedia.org/T372878) [18:00:44] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-lab1001.eqiad.wmnet with OS bookworm [18:00:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10100775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin10... [18:01:25] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 24, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:03:35] (03CR) 10Alexandros Kosiaris: [C:03+2] Rename mw2295 to wikikube-worker2048 [puppet] - 10https://gerrit.wikimedia.org/r/1068059 (https://phabricator.wikimedia.org/T372878) (owner: 10Alexandros Kosiaris) [18:04:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T370903)', diff saved to https://phabricator.wikimedia.org/P68077 and previous config saved to /var/cache/conftool/dbconfig/20240828-180401-ladsgroup.json [18:04:06] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:04:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2045.codfw.wmnet with OS bullseye [18:04:38] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10100786 (10KFrancis) Hello @Southparkfan, please send your full name, mailing address, and email address to kfrancis@wikimedia.org and I will send the NDA agreement to you. Thanks! [18:04:39] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [18:04:39] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from mw2294 to wikikube-worker2048 [18:04:56] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [18:06:40] (03PS1) 10RLazarus: deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 [18:08:06] (03PS1) 10Ssingh: admin: update keys for abi [puppet] - 10https://gerrit.wikimedia.org/r/1068062 (https://phabricator.wikimedia.org/T373522) [18:08:13] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2294 to wikikube-worker2048 - akosiaris@cumin1002" [18:08:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2294 to wikikube-worker2048 - akosiaris@cumin1002" [18:08:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:08:48] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2048 [18:09:13] (03CR) 10CI reject: [V:04-1] deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 (owner: 10RLazarus) [18:09:48] (03CR) 10Ssingh: [C:03+2] admin: update keys for abi [puppet] - 10https://gerrit.wikimedia.org/r/1068062 (https://phabricator.wikimedia.org/T373522) (owner: 10Ssingh) [18:10:06] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2048 [18:10:45] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2294 to wikikube-worker2048 [18:10:55] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100821 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from mw2294 to... [18:11:34] (03CR) 10Bartosz Dziewoński: "I like having all the patches in master, even if they're intended temporary. The only reason I didn't do that with the other patch is beca" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński) [18:13:32] 10SRE-Access-Requests, 13Patch-For-Review: abi uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T373522#10100826 (10ssingh) 05Open→03Resolved New key updated for shell access. Thanks @abi_ for the quick response! [18:14:35] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2048.codfw.wmnet with OS bullseye [18:14:45] !log akosiaris@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2048 [18:15:08] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [18:15:12] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [18:16:36] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2045.codfw.wmnet [18:16:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2045.codfw.wmnet [18:16:50] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10100848 (10Southparkfan) >>! In T373518#10100786, @KFrancis wrote: > Hello @Southparkfan, please send your full name, mailing address, and email address to kfrancis@wikimedia.org and I will sen... [18:18:18] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2048 - akosiaris@cumin1002" [18:18:22] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2048 - akosiaris@cumin1002" [18:18:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:18:23] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2048.codfw.wmnet 164.0.192.10.in-addr.arpa 4.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:18:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2048.codfw.wmnet 164.0.192.10.in-addr.arpa 4.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [18:18:27] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2048 [18:19:02] (03CR) 10Andrew Bogott: [C:03+2] Keystone and Apache, 2gether again [puppet] - 10https://gerrit.wikimedia.org/r/1067461 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [18:19:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P68078 and previous config saved to /var/cache/conftool/dbconfig/20240828-181908-ladsgroup.json [18:19:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2048 [18:19:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2048 [18:22:49] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:53] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:23:11] I am assuming this is related to wikikube-worker2048 [18:24:02] (03PS2) 10RLazarus: deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 [18:27:52] (03PS1) 10RLazarus: mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 [18:28:49] (03CR) 10CI reject: [V:04-1] mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus) [18:28:55] sukhe: I’m afk but that ASN is an internal one so that’s likely yeah [18:29:03] RECOVERY - Disk space on restbase2022 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2022&var-datasource=codfw+prometheus/ops [18:29:13] topranks: go offline, we are here :P [18:29:31] haha [18:30:14] (03PS1) 10Andrew Bogott: keystone service module: replace https-socket with uwsgi-socket [puppet] - 10https://gerrit.wikimedia.org/r/1068066 (https://phabricator.wikimedia.org/T359590) [18:30:19] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10100888 (10KFrancis) Thank you! The NDA has been sent via DocuSign. I'll confirm when it's complete. [18:31:06] (03CR) 10Andrew Bogott: [C:03+2] keystone service module: replace https-socket with uwsgi-socket [puppet] - 10https://gerrit.wikimedia.org/r/1068066 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [18:33:03] (03PS2) 10RLazarus: mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 [18:34:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P68079 and previous config saved to /var/cache/conftool/dbconfig/20240828-183416-ladsgroup.json [18:36:21] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage [18:39:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2048.codfw.wmnet with reason: host reimage [18:48:07] (03PS1) 10Ssingh: admin: add southparkfan to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1068073 (https://phabricator.wikimedia.org/T373518) [18:48:36] (03CR) 10Ssingh: "Pending manager/sponsor approval." [puppet] - 10https://gerrit.wikimedia.org/r/1068073 (https://phabricator.wikimedia.org/T373518) (owner: 10Ssingh) [18:49:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T370903)', diff saved to https://phabricator.wikimedia.org/P68080 and previous config saved to /var/cache/conftool/dbconfig/20240828-184923-ladsgroup.json [18:49:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [18:49:28] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:49:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [18:49:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:49:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:49:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68081 and previous config saved to /var/cache/conftool/dbconfig/20240828-184950-ladsgroup.json [18:53:03] (03CR) 10Ottomata: "+1 generally but:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066718 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [18:53:48] (03CR) 10Ottomata: [C:03+1] eventgate-main: Disable end-to-end readinessProbe (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [18:54:07] (03CR) 10Ottomata: [C:03+1] "If we do this for eventgate-main, we should do it for all the other eventgate service too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [18:54:20] (03CR) 10Ottomata: [C:03+1] "(unresolving)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066719 (https://phabricator.wikimedia.org/T373192) (owner: 10JMeybohm) [18:59:40] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2048.codfw.wmnet with OS bullseye [18:59:57] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10100972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [19:02:17] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:08:02] (03CR) 10CDobbins: prometheus: add script to check TCP MSS clamping value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [19:08:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68082 and previous config saved to /var/cache/conftool/dbconfig/20240828-190817-ladsgroup.json [19:08:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:08:32] (03PS29) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [19:09:11] (03CR) 10CI reject: [V:04-1] prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) (owner: 10CDobbins) [19:09:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:23:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P68083 and previous config saved to /var/cache/conftool/dbconfig/20240828-192325-ladsgroup.json [19:24:10] RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [19:24:21] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:28:12] (03CR) 10Scott French: [C:03+1] deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 (owner: 10RLazarus) [19:29:21] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [19:32:50] RECOVERY - Disk space on thanos-be1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [19:34:19] (03CR) 10RLazarus: [C:03+2] deployment: Add undocumented flag --helmfile to mwscript-k8s. [puppet] - 10https://gerrit.wikimedia.org/r/1068061 (owner: 10RLazarus) [19:36:30] RECOVERY - Disk space on thanos-be1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [19:38:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P68084 and previous config saved to /var/cache/conftool/dbconfig/20240828-193832-ladsgroup.json [19:39:20] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:42:37] (03PS1) 10Scott French: kubernetes: re-name/IP kubernetes2029 as wikikube-worker2049 [puppet] - 10https://gerrit.wikimedia.org/r/1068081 (https://phabricator.wikimedia.org/T372878) [19:43:30] RECOVERY - Disk space on thanos-be1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [19:43:42] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [19:44:44] (03PS6) 10Srishakatux: Add site entry for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [19:45:32] (03CR) 10Scott French: [C:03+1] mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus) [19:45:32] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [19:49:20] RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [19:49:27] (03CR) 10Dzahn: [C:03+2] "I added an annotation in grafana for the merge time of this. In the following 3 hours we still had 1 IP pop up a few times." [puppet] - 10https://gerrit.wikimedia.org/r/1068001 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:51:14] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:51:23] jouncebot: next [19:51:23] In 0 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T2000) [19:52:04] RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [19:53:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T370903)', diff saved to https://phabricator.wikimedia.org/P68085 and previous config saved to /var/cache/conftool/dbconfig/20240828-195339-ladsgroup.json [19:53:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:53:44] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:53:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:54:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68086 and previous config saved to /var/cache/conftool/dbconfig/20240828-195401-ladsgroup.json [19:54:12] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:56:33] (03CR) 10Srishakatux: "@hashar@free.fr As per @dziewonski@fastmail.fm the only extra step needed is to run the `namespaceDupes.php` maintenance script. Instructi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [19:57:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [19:58:10] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:59:27] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T2000). [20:00:04] Gerges, MatmaRex, and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:29] hi. i have a couple of patches, they're all independent from each other [20:00:54] "you have my bug." "and my task." "and my patch!" [20:01:42] hi i can deploy [20:01:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T371742)', diff saved to https://phabricator.wikimedia.org/P68087 and previous config saved to /var/cache/conftool/dbconfig/20240828-200154-ladsgroup.json [20:01:59] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:02:06] whee [20:02:37] lol [20:02:40] (03PS30) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [20:02:54] i'll go in order - is Gerges around? [20:03:17] otherwise i'll start with yours MatmaRex [20:03:22] Here [20:03:29] good timing! [20:03:38] ok i'll start with yours Gerges [20:03:57] (03PS3) 10GergesShamon: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) [20:04:17] MatmaRex: can your backports go out together? [20:04:44] cjming: yep [20:05:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [20:05:55] (03Merged) 10jenkins-bot: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067433 (https://phabricator.wikimedia.org/T373468) (owner: 10GergesShamon) [20:06:05] (03CR) 10Clare Ming: [C:03+2] auth: Relax AuthManager session state check while cde00b55 is deployed [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński) [20:06:11] (03CR) 10Clare Ming: [C:03+2] Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński) [20:06:15] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1067433|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] [20:06:20] T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468 [20:06:20] (03CR) 10Clare Ming: [C:03+2] CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [20:08:33] MatmaRex: since your backports are averaging 28 minutes to merge, i'll do your config patch next, then bvibber's config patch, then come back to your backports [20:08:45] ok [20:09:41] thanks [20:09:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:09:58] !log cjming@deploy1003 cjming, gergesshamon: Backport for [[gerrit:1067433|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:21] Gerges: your patch is ready to test - lmk if/when to sync [20:10:31] (03CR) 10RLazarus: [C:03+2] mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus) [20:11:14] cjming: How can I test this patch? [20:11:18] You can't :) [20:11:49] lol - i guess we sync and hope for the best? [20:12:21] If it's to this point, it's syntactically valid etc [20:12:23] Yup :) [20:12:24] (03Merged) 10jenkins-bot: mediawiki: Disable livenessProbe for maintenance scripts. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068065 (owner: 10RLazarus) [20:12:29] alrighty [20:12:37] !log cjming@deploy1003 cjming, gergesshamon: Continuing with sync [20:12:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68088 and previous config saved to /var/cache/conftool/dbconfig/20240828-201250-ladsgroup.json [20:12:55] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:13:27] (03PS2) 10Bartosz Dziewoński: logging: Use '??=' operator to reduce repetition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 [20:17:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P68089 and previous config saved to /var/cache/conftool/dbconfig/20240828-201701-ladsgroup.json [20:17:18] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067433|Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata (T373468)]] (duration: 11m 02s) [20:17:21] T373468: Lift IP cap on this dates 10/09, 17/09, 24/09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T373468 [20:17:40] Gerges: your patch should be live! [20:18:00] MatmaRex: doing your config patch now - assuming it's not really testable either? [20:18:18] other than maybe not breaking things [20:18:20] Thanks :) [20:18:26] cjming: yeah. it should work exactly the same as before [20:18:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński) [20:19:26] (03Merged) 10jenkins-bot: logging: Use '??=' operator to reduce repetition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1066902 (owner: 10Bartosz Dziewoński) [20:19:39] MatmaRex: do you want to check on mwdebug when it's ready or should i just go head and sync? [20:19:44] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1066902|logging: Use '??=' operator to reduce repetition]] [20:20:11] cjming: i think it can be synced directly. CI checks for syntax errors, right? ;) [20:20:18] presumably [20:21:51] !log cjming@deploy1003 cjming, matmarex: Backport for [[gerrit:1066902|logging: Use '??=' operator to reduce repetition]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:52] !log cjming@deploy1003 cjming, matmarex: Continuing with sync [20:24:31] (03CR) 10Amire80: "Actually, anoop is probably right: we need to add the current namespaces as aliases for backwards compatibility, so as not to break the li" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [20:25:02] (03PS3) 10Bvibber: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) [20:25:10] \o/ [20:26:23] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1066902|logging: Use '??=' operator to reduce repetition]] (duration: 06m 39s) [20:26:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) (owner: 10Bvibber) [20:27:02] MatmaRex: config patch should be live - moving onto bvibber's patch while we wait for your backports to merge [20:27:17] 👍 [20:27:21] (03Merged) 10jenkins-bot: Disable HLS VP9 video tracks in TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068054 (https://phabricator.wikimedia.org/T373546) (owner: 10Bvibber) [20:27:38] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1068054|Disable HLS VP9 video tracks in TimedMediaHandler (T373546)]] [20:27:42] T373546: Migrate off HLS mov/mp4 experiment to a flat mov back-compat with WebM and MPEG-DASH - https://phabricator.wikimedia.org/T373546 [20:27:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P68090 and previous config saved to /var/cache/conftool/dbconfig/20240828-202757-ladsgroup.json [20:29:48] !log cjming@deploy1003 bvibber, cjming: Backport for [[gerrit:1068054|Disable HLS VP9 video tracks in TimedMediaHandler (T373546)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:53] bvibber: is your patch testable? up on mwdebug if so - lmk if/when to sync [20:30:00] yeah lemme check it [20:31:17] cjming: confirmed updated correctly :D [20:31:22] go ahead and sync [20:31:23] nice - syncing! [20:31:25] !log cjming@deploy1003 bvibber, cjming: Continuing with sync [20:32:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P68091 and previous config saved to /var/cache/conftool/dbconfig/20240828-203208-ladsgroup.json [20:35:49] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068054|Disable HLS VP9 video tracks in TimedMediaHandler (T373546)]] (duration: 08m 10s) [20:35:55] T373546: Migrate off HLS mov/mp4 experiment to a flat mov back-compat with WebM and MPEG-DASH - https://phabricator.wikimedia.org/T373546 [20:35:57] bvibber: should be live! [20:36:05] \o/ [20:36:26] (03Merged) 10jenkins-bot: auth: Relax AuthManager session state check while cde00b55 is deployed [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068051 (https://phabricator.wikimedia.org/T373504) (owner: 10Bartosz Dziewoński) [20:36:29] (03Merged) 10jenkins-bot: Fix missing definition of setSaveErrorMessage too [extensions/DiscussionTools] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068052 (https://phabricator.wikimedia.org/T373288) (owner: 10Bartosz Dziewoński) [20:36:30] (03Merged) 10jenkins-bot: CentralAuthApiSessionProvider: Avoid error in internal API requests [extensions/CentralAuth] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1068041 (https://phabricator.wikimedia.org/T373507) (owner: 10Bartosz Dziewoński) [20:36:37] cjming: looks good, thanks! [20:36:42] yw! [20:37:41] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1068051|auth: Relax AuthManager session state check while cde00b55 is deployed (T373504)]], [[gerrit:1068052|Fix missing definition of setSaveErrorMessage too (T373288)]], [[gerrit:1068041|CentralAuthApiSessionProvider: Avoid error in internal API requests (T373507)]] [20:37:47] T373504: Wikimedia\NormalizedException\NormalizedException: Authentication failed because of inconsistent provider array - https://phabricator.wikimedia.org/T373504 [20:37:48] T373288: Show error message when a shortened URL prevents user from adding a topic or comment - https://phabricator.wikimedia.org/T373288 [20:37:48] T373507: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralAuthTokenManager::consume() must be of the type string, null given - https://phabricator.wikimedia.org/T373507 [20:39:45] !log cjming@deploy1003 matmarex, cjming: Backport for [[gerrit:1068051|auth: Relax AuthManager session state check while cde00b55 is deployed (T373504)]], [[gerrit:1068052|Fix missing definition of setSaveErrorMessage too (T373288)]], [[gerrit:1068041|CentralAuthApiSessionProvider: Avoid error in internal API requests (T373507)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:39:49] MatmaRex: if they're testable, all your backports are up on test servers - lmk when to sync [20:40:18] looking [20:43:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P68092 and previous config saved to /var/cache/conftool/dbconfig/20240828-204305-ladsgroup.json [20:44:37] cjming: looks good. i verified the DiscussionTools fix. the other two are not easily testable, but we have logging which will show whether they're fixed. [20:44:46] awesome - syncing! [20:44:50] !log cjming@deploy1003 matmarex, cjming: Continuing with sync [20:47:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T371742)', diff saved to https://phabricator.wikimedia.org/P68093 and previous config saved to /var/cache/conftool/dbconfig/20240828-204715-ladsgroup.json [20:47:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance [20:47:20] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:47:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2199.codfw.wmnet with reason: Maintenance [20:49:13] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068051|auth: Relax AuthManager session state check while cde00b55 is deployed (T373504)]], [[gerrit:1068052|Fix missing definition of setSaveErrorMessage too (T373288)]], [[gerrit:1068041|CentralAuthApiSessionProvider: Avoid error in internal API requests (T373507)]] (duration: 11m 31s) [20:49:19] T373504: Wikimedia\NormalizedException\NormalizedException: Authentication failed because of inconsistent provider array - https://phabricator.wikimedia.org/T373504 [20:49:20] T373288: Show error message when a shortened URL prevents user from adding a topic or comment - https://phabricator.wikimedia.org/T373288 [20:49:20] T373507: TypeError: Argument 1 passed to MediaWiki\Extension\CentralAuth\CentralAuthTokenManager::consume() must be of the type string, null given - https://phabricator.wikimedia.org/T373507 [20:49:36] MatmaRex: everything should be live [20:49:41] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:50:33] thanks cjming. very smooth deployment today :) [20:51:09] nice! [20:51:15] !log end of UTC late backport window [20:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:45] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:53:36] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:54:02] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:57:45] RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:58:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T370903)', diff saved to https://phabricator.wikimedia.org/P68094 and previous config saved to /var/cache/conftool/dbconfig/20240828-205812-ladsgroup.json [20:58:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2190.codfw.wmnet with reason: Maintenance [20:58:16] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:58:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2190.codfw.wmnet with reason: Maintenance [20:58:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T370903)', diff saved to https://phabricator.wikimedia.org/P68095 and previous config saved to /var/cache/conftool/dbconfig/20240828-205834-ladsgroup.json [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240828T2100) [21:07:34] (03CR) 10RLazarus: [C:03+1] kubernetes: re-name/IP kubernetes2029 as wikikube-worker2049 [puppet] - 10https://gerrit.wikimedia.org/r/1068081 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [21:10:19] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2029.codfw.wmnet [21:10:56] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2029.codfw.wmnet [21:12:34] (03PS1) 10Reedy: Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 [21:13:13] (03CR) 10Scott French: [C:03+2] kubernetes: re-name/IP kubernetes2029 as wikikube-worker2049 [puppet] - 10https://gerrit.wikimedia.org/r/1068081 (https://phabricator.wikimedia.org/T372878) (owner: 10Scott French) [21:13:17] (03CR) 10CI reject: [V:04-1] Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 (owner: 10Reedy) [21:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:15:52] !log swfrench@cumin2002 START - Cookbook sre.hosts.rename from kubernetes2029 to wikikube-worker2049 [21:16:11] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [21:17:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T370903)', diff saved to https://phabricator.wikimedia.org/P68096 and previous config saved to /var/cache/conftool/dbconfig/20240828-211734-ladsgroup.json [21:17:39] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:20:02] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2029 to wikikube-worker2049 - swfrench@cumin2002" [21:20:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2029 to wikikube-worker2049 - swfrench@cumin2002" [21:20:31] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:20:33] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2049 [21:20:52] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2049 [21:21:33] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2029 to wikikube-worker2049 [21:21:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10101297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [21:22:37] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2049.codfw.wmnet on all recursors [21:22:41] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2049.codfw.wmnet on all recursors [21:23:29] !log swfrench@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2049.codfw.wmnet with OS bullseye [21:23:41] !log swfrench@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2049 [21:23:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10101298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [21:24:19] !log swfrench@cumin2002 START - Cookbook sre.dns.netbox [21:25:57] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:25:59] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:26:13] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:26:29] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:27:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:28:29] !log swfrench@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2049 - swfrench@cumin2002" [21:28:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2049 - swfrench@cumin2002" [21:28:34] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:28:35] !log swfrench@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2049.codfw.wmnet 59.16.192.10.in-addr.arpa 9.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:28:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2049.codfw.wmnet 59.16.192.10.in-addr.arpa 9.5.0.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:28:39] !log swfrench@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2049 [21:29:00] !log swfrench@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2049 [21:29:00] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2049 [21:30:59] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:31:18] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:32:14] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:32:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P68097 and previous config saved to /var/cache/conftool/dbconfig/20240828-213242-ladsgroup.json [21:32:43] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:33:54] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:33:57] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:39:10] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:12] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:39:38] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:43:04] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:43:35] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:46:59] !log swfrench@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage [21:47:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P68098 and previous config saved to /var/cache/conftool/dbconfig/20240828-214749-ladsgroup.json [21:49:22] (03PS1) 10Ladsgroup: Remove the "powered by mediawiki" override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068120 [21:50:43] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2049.codfw.wmnet with reason: host reimage [21:51:40] PROBLEM - Host kubernetes2029 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T370903)', diff saved to https://phabricator.wikimedia.org/P68099 and previous config saved to /var/cache/conftool/dbconfig/20240828-220256-ladsgroup.json [22:02:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2194.codfw.wmnet with reason: Maintenance [22:03:01] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:03:11] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2194.codfw.wmnet with reason: Maintenance [22:03:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T370903)', diff saved to https://phabricator.wikimedia.org/P68100 and previous config saved to /var/cache/conftool/dbconfig/20240828-220318-ladsgroup.json [22:05:48] FIRING: KubernetesCalicoDown: mw2294.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2294.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:09:16] (03PS8) 10Jdlrobson: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [22:11:42] !log swfrench@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2049.codfw.wmnet with OS bullseye [22:11:50] (03CR) 10Jdlrobson: [C:04-1] "I think it's okay to do this for Commons, but we got feedback from English Wikipedia specifically that since portals are not maintained th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 (https://phabricator.wikimedia.org/T366380) (owner: 10Ebrahim) [22:11:53] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10101358 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [22:13:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:14:19] !log running homer 'lsw1-b3-codfw*' commit 'T372878' [22:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:23] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [22:17:17] !log swfrench@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2049.codfw.wmnet [22:17:18] !log swfrench@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2049.codfw.wmnet [22:17:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:18:14] (03CR) 10Bartosz Dziewoński: "There is already an alias for the 'Wikipedia' namespace on every Wikipedia: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) (owner: 10Srishakatux) [22:19:51] (03PS7) 10Srishakatux: Add project talk aliases for mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060893 (https://phabricator.wikimedia.org/T366271) [22:20:30] inflatador: ryankemper: I don't know if you're aware but wdqs is lagging so much the maxlag in wikidata is at 10 basically stopping all bots [22:20:36] https://www.wikidata.org/w/api.php?action=query&format=json&titles=Main%20Page&maxlag=-1 [22:20:41] yeah just saw it 30s ago actually [22:20:44] looking at graphs rn [22:20:49] it's wdqs1015 [22:22:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T370903)', diff saved to https://phabricator.wikimedia.org/P68101 and previous config saved to /var/cache/conftool/dbconfig/20240828-222204-ladsgroup.json [22:22:09] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:22:25] !log [WDQS] `ryankemper@wdqs1015:~$ sudo systemctl restart wdqs-blazegraph` [22:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:20] !log running homer 'cr*codfw*' commit 'T372878' [22:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:24] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [22:23:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:30:20] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 449, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:33:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance [22:33:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Maintenance [22:33:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T371742)', diff saved to https://phabricator.wikimedia.org/P68102 and previous config saved to /var/cache/conftool/dbconfig/20240828-223325-ladsgroup.json [22:33:29] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:37:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P68103 and previous config saved to /var/cache/conftool/dbconfig/20240828-223711-ladsgroup.json [22:37:30] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 531, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:52:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P68104 and previous config saved to /var/cache/conftool/dbconfig/20240828-225218-ladsgroup.json [23:04:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:07:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T370903)', diff saved to https://phabricator.wikimedia.org/P68105 and previous config saved to /var/cache/conftool/dbconfig/20240828-230726-ladsgroup.json [23:07:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2205.codfw.wmnet with reason: Maintenance [23:07:31] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:07:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2205.codfw.wmnet with reason: Maintenance [23:07:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T370903)', diff saved to https://phabricator.wikimedia.org/P68106 and previous config saved to /var/cache/conftool/dbconfig/20240828-230748-ladsgroup.json [23:26:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T370903)', diff saved to https://phabricator.wikimedia.org/P68107 and previous config saved to /var/cache/conftool/dbconfig/20240828-232653-ladsgroup.json [23:26:58] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068175 [23:38:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1068175 (owner: 10TrainBranchBot) [23:42:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P68108 and previous config saved to /var/cache/conftool/dbconfig/20240828-234201-ladsgroup.json [23:57:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P68109 and previous config saved to /var/cache/conftool/dbconfig/20240828-235708-ladsgroup.json [23:57:24] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for Southparkfan - https://phabricator.wikimedia.org/T373518#10101487 (10KFrancis) Hi all, I'm confirming the NDA is signed. Please proceed with next steps.