[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191197 (owner: 10TrainBranchBot) [00:00:20] (03PS5) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [00:01:18] (03CR) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [00:02:22] (03CR) 10CI reject: [V:04-1] [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [00:03:26] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [00:03:42] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [00:04:50] FIRING: DiskSpace: Disk space deploy1003:9100:/srv 2.825% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:08:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191540 [00:08:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191540 (owner: 10TrainBranchBot) [00:10:25] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [00:10:53] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [00:23:07] (03PS6) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [00:23:07] (03PS1) 10Krinkle: varnish: Enable Vary:User-Agent on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) [00:30:00] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1191540 (owner: 10TrainBranchBot) [00:32:56] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [00:33:04] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [00:33:27] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [00:36:19] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [00:36:29] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [00:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:45] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [00:38:45] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [00:41:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:42:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [00:43:45] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [00:43:45] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [00:49:08] (03PS1) 10Krinkle: [WIP] Disable MF more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191546 [00:49:37] (03PS2) 10Krinkle: varnish: Enable Vary:User-Agent on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) [00:49:45] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [01:01:10] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:04:35] RESOLVED: DiskSpace: Disk space deploy1003:9100:/srv 2.825% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=deploy1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:15:04] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 53s) [01:15:51] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217836 (10Krinkle) [01:16:12] (03PS3) 10Krinkle: varnish: Enable Vary:User-Agent on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) [01:16:23] (03PS7) 10Krinkle: [WIP] varnish: Invert unified_mobile_domains logic [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [01:21:35] (03PS1) 10Krinkle: Disable inert MobileFrontend on wikimedia.org wikis lacking DNS (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191553 (https://phabricator.wikimedia.org/T152882) [01:22:54] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217854 (10Krinkle) [01:27:03] (03PS8) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [01:28:06] (03PS9) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [01:37:50] (03PS8) 10Krinkle: varnish: Enable unified mobile routing on misc wikimedia.org wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [01:38:45] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1232.eqiad.wmnet with OS bullseye [01:39:36] (03PS10) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [01:41:00] (03CR) 10Krinkle: varnish: Enable unified mobile routing on misc wikimedia.org wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [01:51:07] (03Abandoned) 10Krinkle: [WIP] Disable MF more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191546 (owner: 10Krinkle) [01:51:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1231.eqiad.wmnet with OS bullseye [01:52:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1231.eqiad.wmnet with OS bullseye [01:55:26] jclark@cumin1002 reimage (PID 3226282) is awaiting input [01:58:02] (03CR) 10Krinkle: "Beta run-puppet-agent on deployment-cache-text:" [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [01:58:43] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [01:59:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191553 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle) [01:59:55] (03Merged) 10jenkins-bot: Disable inert MobileFrontend on wikimedia.org wikis lacking DNS (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191553 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle) [02:00:44] !log krinkle@deploy2002 Started scap sync-world: Backport for [[gerrit:1191553|Disable inert MobileFrontend on wikimedia.org wikis lacking DNS (part 2) (T152882)]] [02:00:50] T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 [02:07:12] !log krinkle@deploy2002 krinkle: Backport for [[gerrit:1191553|Disable inert MobileFrontend on wikimedia.org wikis lacking DNS (part 2) (T152882)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:07:20] T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 [02:08:41] !log krinkle@deploy2002 krinkle: Continuing with sync [02:10:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1231.eqiad.wmnet with reason: host reimage [02:13:30] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217909 (10Krinkle) [02:13:45] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217910 (10Krinkle) [02:13:53] !log krinkle@deploy2002 Finished scap sync-world: Backport for [[gerrit:1191553|Disable inert MobileFrontend on wikimedia.org wikis lacking DNS (part 2) (T152882)]] (duration: 13m 09s) [02:14:00] T152882: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 [02:14:08] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217911 (10Krinkle) [02:14:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1231.eqiad.wmnet with reason: host reimage [02:14:32] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217913 (10Krinkle) [02:14:44] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1232.eqiad.wmnet with OS bullseye [02:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [02:16:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1230.eqiad.wmnet with OS bullseye [02:16:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1232.eqiad.wmnet with OS bullseye [02:16:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1230.eqiad.wmnet with OS bullseye [02:16:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1229.eqiad.wmnet with OS bullseye [02:16:39] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1232.eqiad.wmnet with OS bullseye [02:16:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217918 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1229.eqiad.wmnet with OS bullseye [02:19:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1228.eqiad.wmnet with OS bullseye [02:19:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217932 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1228.eqiad.wmnet with OS bullseye [02:19:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1227.eqiad.wmnet with OS bullseye [02:19:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217933 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1227.eqiad.wmnet with OS bullseye [02:19:49] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1226.eqiad.wmnet with OS bullseye [02:19:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1226.eqiad.wmnet with OS bullseye [02:20:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1225.eqiad.wmnet with OS bullseye [02:20:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217935 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1225.eqiad.wmnet with OS bullseye [02:21:45] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217938 (10Krinkle) [02:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:31:09] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1230.eqiad.wmnet with reason: host reimage [02:31:14] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11217946 (10Krinkle) [02:31:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1232.eqiad.wmnet with reason: host reimage [02:33:04] jclark@cumin1002 reimage (PID 3263690) is awaiting input [02:33:21] jclark@cumin1002 reimage (PID 3263938) is awaiting input [02:33:46] jclark@cumin1002 reimage (PID 3263858) is awaiting input [02:34:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1230.eqiad.wmnet with reason: host reimage [02:36:02] jclark@cumin1002 reimage (PID 3260807) is awaiting input [02:36:27] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1227.eqiad.wmnet with OS bullseye [02:36:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1227.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [02:36:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1225.eqiad.wmnet with OS bullseye [02:36:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1225.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [02:36:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1226.eqiad.wmnet with OS bullseye [02:37:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217956 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1226.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [02:37:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1229.eqiad.wmnet with OS bullseye [02:37:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217957 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1229.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [02:38:03] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:38:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1232.eqiad.wmnet with reason: host reimage [02:38:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:38:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1231.eqiad.wmnet with OS bullseye [02:38:37] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217958 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1231.eqiad.wmnet with OS bullseye completed: - an-worker1231 (**WAR... [02:52:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11217960 (10Papaul) [02:53:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw:frack:rack/install/configuration new switches in rack F5 - https://phabricator.wikimedia.org/T405618#11217962 (10Papaul) [02:55:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:58:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1232.eqiad.wmnet with OS bullseye [02:59:03] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1232.eqiad.wmnet with OS bullseye completed: - an-worker1232 (**PAS... [02:59:25] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (98.52%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [02:59:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:59:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [02:59:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1230.eqiad.wmnet with OS bullseye [03:00:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217964 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1230.eqiad.wmnet with OS bullseye completed: - an-worker1230 (**WAR... [03:00:03] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217965 (10Jclark-ctr) [03:00:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11217967 (10Jclark-ctr) found a number of servers not imaging believe this is that cause 'an-worker1209|an-worker121[0-9]an-worker122[0-9]|an-worker123[0-2]' missing a pipe betwe... [03:15:03] (03CR) 10Ottomata: [C:03+1] [eventgate_*] Bump eventgate to v1.25.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191506 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [03:15:08] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11217983 (10Jclark-ctr) @elukey I finally have servers to test this on do we not have storcli as part of install? i was hoping to do this rather then create x288 VD manually for my in... [03:16:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:16:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:26:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:26:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:16:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:42:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:01:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:01:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:09:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:39:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250926T0600) [06:03:45] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 18.2 [puppet] - 10https://gerrit.wikimedia.org/r/1191572 (https://phabricator.wikimedia.org/T405699) [06:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [06:17:50] (03CR) 10Arnaudb: [C:03+1] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.2 [puppet] - 10https://gerrit.wikimedia.org/r/1191572 (https://phabricator.wikimedia.org/T405699) (owner: 10Jelto) [06:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:26:11] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 18.2 [puppet] - 10https://gerrit.wikimedia.org/r/1191572 (https://phabricator.wikimedia.org/T405699) (owner: 10Jelto) [06:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:25] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (98.47%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250926T0700) [07:00:49] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [07:02:22] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11218160 (10elukey) @Jclark-ctr o/ that server is a Dell, so you'll have to use `/usr/bin/perccli64` (it may differ a bit from storcli's syntax but it should do what you need). Lemme know! [07:04:01] (03CR) 10Slyngshede: [C:03+2] P:openldap::management add netbox-readonly-access to offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191363 (https://phabricator.wikimedia.org/T404494) (owner: 10Slyngshede) [07:06:50] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:09:09] FIRING: SystemdUnitFailed: load-dcatap-weekly.service on wdqs2010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:10:06] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab replica [07:10:14] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359) (owner: 10Dzahn) [07:12:25] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [07:14:09] FIRING: [2x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:16:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:30] (03PS4) 10Stevemunene: admin/data: add the analytics-wikidata system user and user groups [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) [07:21:30] (03PS1) 10Stevemunene: airflow-wikidata: define ATS mapping rules and cache settings [puppet] - 10https://gerrit.wikimedia.org/r/1191578 (https://phabricator.wikimedia.org/T404073) [07:21:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:21:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:21:53] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab replica [07:24:08] !log jelto@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [07:24:09] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:09] FIRING: [8x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:29:41] !log start deploying new backup grants T403166 [07:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:49] T403166: Setup dbprov1007 & dbprov2007; prepare for decommission dbprov1003 & dbprov2003 - https://phabricator.wikimedia.org/T403166 [07:30:58] (03CR) 10Filippo Giunchedi: "I'm with you re: not loving the approach. I did try systemd-networkd drop-in files, however I could not get that to work reliably. My unde" [puppet] - 10https://gerrit.wikimedia.org/r/1191326 (https://phabricator.wikimedia.org/T347681) (owner: 10Filippo Giunchedi) [07:34:09] FIRING: [12x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:39:09] FIRING: [15x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:09] FIRING: [18x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:49:09] FIRING: [20x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:24] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Use custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/1191582 (https://phabricator.wikimedia.org/T283948) [07:52:26] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) [07:52:29] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) [07:53:02] (03PS2) 10Jcrespo: mariadb: Add new grants for dbprov1007 & dbprov2007 backups [puppet] - 10https://gerrit.wikimedia.org/r/1191451 (https://phabricator.wikimedia.org/T403166) [07:53:02] (03PS2) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [07:53:02] (03PS1) 10Jcrespo: site.pp: Fix incorrect and missleading comment: db2201 hasn't s2 [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) [07:53:38] (03PS2) 10Jcrespo: site.pp: Fix missleading comment: db2201 does not have s2 [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) [07:53:50] (03PS3) 10Jcrespo: site.pp: Fix missleading comment: db2201 does not have s2 [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) [07:54:09] FIRING: [23x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:44] (03PS4) 10Jcrespo: site.pp: Fix misleading comment: db2201 does not have s2 [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) [07:56:58] (03PS5) 10Jcrespo: site.pp: Fix misleading comment: db2201 does not have s2 [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) [07:59:09] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:22] (03CR) 10Brouberol: [C:03+1] "Approved on my part but don't merge yet as security must new sudo permissions." [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [08:01:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:01:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:02:37] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) [08:02:37] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) [08:06:50] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:07:27] (03CR) 10MVernon: [C:03+1] "One very nitty nit, which you are free to disregard. If you apply it, I don't need to review again." [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [08:08:42] (03PS3) 10Majavah: P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) [08:08:42] (03PS3) 10Majavah: P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) [08:11:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 6.501 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:11:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 6.608 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:52] (03PS6) 10Jcrespo: site.pp: Fix misleading comment: db2201 does not have s2 [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) [08:12:59] (03CR) 10Jcrespo: site.pp: Fix misleading comment: db2201 does not have s2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [08:14:31] (03CR) 10Jcrespo: [C:03+2] site.pp: Fix misleading comment: db2201 does not have s2 [puppet] - 10https://gerrit.wikimedia.org/r/1191585 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [08:17:59] (03PS4) 10Majavah: P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) [08:17:59] (03PS4) 10Majavah: P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) [08:19:57] 10SRE-tools, 06Infrastructure-Foundations, 10GitLab (Infrastructure): CI error on operations/cookbooks - https://phabricator.wikimedia.org/T405706 (10ABran-WMF) 03NEW [08:21:09] (03CR) 10Brouberol: [C:03+1] Remove the existing spark-operator release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191136 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:21:16] (03CR) 10Brouberol: [C:03+1] Remove our custom spark-operator helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191137 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:21:36] 10SRE-tools, 06Infrastructure-Foundations: CI error on operations/cookbooks - https://phabricator.wikimedia.org/T405706#11218316 (10taavi) [08:21:46] (03CR) 10Brouberol: [C:03+1] Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:24:02] (03CR) 10Brouberol: "You might want to have the previous chart removed, to guarantee that we're installing the upstream/modified one." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:24:30] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: DeprecationWarning: datetime.datetime.utcnow() is deprecated - https://phabricator.wikimedia.org/T401581#11218318 (10A_smart_kitten) [08:25:45] (03CR) 10Brouberol: [C:03+1] Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:28:34] (03CR) 10Brouberol: Customise the imported spark-operator chart for deployment to WMF (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [08:28:57] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [08:36:11] (03PS1) 10Jelto: admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) [08:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:36:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:39:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:42:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:46:34] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54827 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:49:43] (03PS1) 10Jelto: Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845) [08:49:48] (03PS1) 10Jelto: Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) [08:52:00] (03CR) 10Fabfur: "Sorry, ran these for a text and upload hosts but forgot to add it here, anyway" [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [08:55:35] (03PS2) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [08:56:04] (03CR) 10CI reject: [V:04-1] wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [09:04:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:17:43] (03PS1) 10Jelto: Update eqiad to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) [09:19:26] (03PS2) 10Majavah: P:toolforge::k8s::haproxy: Use custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/1191582 (https://phabricator.wikimedia.org/T283948) [09:19:26] (03PS5) 10Majavah: P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) [09:19:26] (03PS5) 10Majavah: P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) [09:19:27] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Make connection limit a parameter [puppet] - 10https://gerrit.wikimedia.org/r/1191657 [09:20:30] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7060/co" [puppet] - 10https://gerrit.wikimedia.org/r/1191657 (owner: 10Majavah) [09:24:06] (03CR) 10Jelto: Update eqiad to k8s 1.31 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [09:26:56] (03CR) 10FNegri: [C:03+1] "Nice, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1191657 (owner: 10Majavah) [09:27:22] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::k8s::haproxy: Make connection limit a parameter [puppet] - 10https://gerrit.wikimedia.org/r/1191657 (owner: 10Majavah) [09:29:54] (03PS3) 10Majavah: P:toolforge::k8s::haproxy: Use custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/1191582 (https://phabricator.wikimedia.org/T283948) [09:29:54] (03PS6) 10Majavah: P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) [09:29:54] (03PS6) 10Majavah: P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) [09:40:56] (03PS3) 10Jcrespo: mariadb: Add new grants for dbprov1007 & dbprov2007 backups [puppet] - 10https://gerrit.wikimedia.org/r/1191451 (https://phabricator.wikimedia.org/T403166) [09:41:40] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11218562 (10Tgr) >>! In T152882#11217659, @Krinkle wrote: > That means MobileFrontend on loginwiki, in theory, provides just two things: > * Allowing calls to `M... [09:43:28] (03CR) 10Jcrespo: [C:03+2] mariadb: Add new grants for dbprov1007 & dbprov2007 backups [puppet] - 10https://gerrit.wikimedia.org/r/1191451 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [09:44:12] !log finished deploying new grants T403166 [09:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:19] T403166: Setup dbprov1007 & dbprov2007; prepare for decommission dbprov1003 & dbprov2003 - https://phabricator.wikimedia.org/T403166 [09:46:36] (03CR) 10Filippo Giunchedi: [C:03+1] "Untested on my end, LGTM though" [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:47:32] (03PS3) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [09:50:20] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Use custom error pages [puppet] - 10https://gerrit.wikimedia.org/r/1191582 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:50:22] (03PS4) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [09:50:30] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [09:51:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713 (10gengh) 03NEW [09:51:33] (03PS7) 10Majavah: P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) [09:51:33] (03PS7) 10Majavah: P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) [09:51:51] (03PS5) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [09:53:02] (03PS6) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [09:55:32] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Move per-tool rate limiting to HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/1191583 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [10:00:24] (03PS7) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [10:00:33] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [10:05:30] (03PS1) 10Majavah: haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662 [10:05:52] (03PS8) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [10:07:24] (03PS9) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [10:07:47] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [10:14:07] (03PS2) 10Majavah: haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662 [10:14:07] (03PS1) 10Majavah: haproxy::cloud::base: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 [10:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [10:15:13] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7061/co" [puppet] - 10https://gerrit.wikimedia.org/r/1191664 (owner: 10Majavah) [10:19:30] (03PS1) 10Btullis: Correct the preseed value for the new an-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1191667 (https://phabricator.wikimedia.org/T399964) [10:20:40] (03CR) 10MVernon: [C:03+1] "Looks reasonable to me, assuming you're happy with the CI output." [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [10:21:27] (03CR) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [10:22:26] (03PS2) 10Majavah: haproxy::cloud::base: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 [10:23:27] (03PS10) 10Jcrespo: dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) [10:23:57] (03CR) 10Btullis: [C:03+2] Correct the preseed value for the new an-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1191667 (https://phabricator.wikimedia.org/T399964) (owner: 10Btullis) [10:24:06] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7062/co" [puppet] - 10https://gerrit.wikimedia.org/r/1191664 (owner: 10Majavah) [10:24:49] (03PS3) 10Majavah: haproxy::cloud::base: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 [10:24:50] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:26:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7063/co" [puppet] - 10https://gerrit.wikimedia.org/r/1191664 (owner: 10Majavah) [10:27:40] (03CR) 10Clément Goubert: [C:03+1] Update eqiad pod ip range [puppet] - 10https://gerrit.wikimedia.org/r/1191652 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [10:27:54] (03CR) 10Clément Goubert: [C:03+1] admin_ng: Change eqiad pod ip range to 10.67.128.0/17 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191647 (https://phabricator.wikimedia.org/T375845) (owner: 10Jelto) [10:29:04] (03PS4) 10Majavah: haproxy::cloud: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 [10:29:35] (03CR) 10Jcrespo: [C:03+2] "Change amended." [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [10:29:43] (03CR) 10Clément Goubert: [C:03+1] Update eqiad to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1191653 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [10:29:44] (03PS5) 10Majavah: haproxy::cloud: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 [10:29:54] (03CR) 10Jcrespo: [C:03+2] dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [10:30:15] (03CR) 10Clément Goubert: Update eqiad to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [10:31:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7065/co" [puppet] - 10https://gerrit.wikimedia.org/r/1191664 (owner: 10Majavah) [10:33:24] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:35:45] (03CR) 10MVernon: [C:03+1] dbbackups: Migrate dbprov[12]003 database backups to dbprov[12]007 [puppet] - 10https://gerrit.wikimedia.org/r/1191455 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [10:36:06] (03PS3) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [10:36:42] (03CR) 10CI reject: [V:04-1] wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [10:44:31] (03PS1) 10Clément Goubert: taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) [10:45:07] (03PS4) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [10:47:19] (03CR) 10CI reject: [V:04-1] taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [10:47:29] *grmbl* [10:52:55] 06SRE, 10Wikimedia-Mailing-lists: Request for a mailing list for Moore Wikimedians - https://phabricator.wikimedia.org/T405164#11218785 (10Hasslaebetch) Thank you for your support and assistance. [10:54:10] (03CR) 10Clément Goubert: "Latest CI failure seems unrelated, reported in https://phabricator.wikimedia.org/T401383#11218788" [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [10:54:23] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11218796 (10BTullis) What about `megacli` as well? There are still quite a few older servers that use this. [10:54:50] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11218800 (10Jclark-ctr) @elukey Thanks for confirming. I did try that command first, but I was getting failures since it won’t show any controllers without sudo. Here’s my output compared... [10:55:08] (03CR) 10Clément Goubert: [C:03+1] (api|rest)-gateway: Add option to disable CSP, disable for rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191344 (https://phabricator.wikimedia.org/T405368) (owner: 10Hnowlan) [10:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:39] !log testing backups after new config deploy T403166 [10:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:46] T403166: Setup dbprov1007 & dbprov2007; prepare for decommission dbprov1003 & dbprov2003 - https://phabricator.wikimedia.org/T403166 [10:59:25] FIRING: GanetiMemoryPressure: Ganeti: High memory usage (98.88%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250926T0700) [11:00:05] jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250926T1100). [11:17:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:33:04] (03CR) 10Filippo Giunchedi: [C:03+1] haproxy::cloud: Do not duplicate main haproxy class [puppet] - 10https://gerrit.wikimedia.org/r/1191664 (owner: 10Majavah) [11:35:16] (03CR) 10Btullis: [C:03+2] Remove the existing spark-operator release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191136 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:43:30] (03Merged) 10jenkins-bot: Remove the existing spark-operator release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191136 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [11:44:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:53:54] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11218988 (10MoritzMuehlenhoff) >>! In T395939#11218796, @BTullis wrote: > What about `megacli` as well? There are still quite a few older servers that use this. megacli is already covered... [11:55:26] (03PS1) 10Majavah: P:wmcs: maintain-dbusers: Remove accounts names from Prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/1191675 (https://phabricator.wikimedia.org/T405728) [11:57:31] (03CR) 10Muehlenhoff: "I'll add this to the agenda for Monday's meeting" [puppet] - 10https://gerrit.wikimedia.org/r/1191349 (https://phabricator.wikimedia.org/T404073) (owner: 10Stevemunene) [11:59:50] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:07] jouncebot nowandnext [12:04:07] For the next 18 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250926T0700) [12:04:07] In 18 hour(s) and 55 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250927T0700) [12:08:59] Hello. There's a JavaScript warning being triggered on every pageview at the moment. An analytics instrument that was deployed as part of the train is calling a method that wasn't [12:09:13] The change that introduced the missing method was backported to -wmf.19 but not to -wmf.20 [12:09:34] Can I deploy the backport to -wmf.20? [12:09:46] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MetricsPlatform/+/1190648 [12:21:06] (03PS1) 10Muehlenhoff: osm_master: Store kartotherian and tegola passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191680 (https://phabricator.wikimedia.org/T381565) [12:21:30] (03CR) 10Jforrester: [C:03+1] "Nice fix." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191514 (https://phabricator.wikimedia.org/T152882) (owner: 10Krinkle) [12:23:52] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11219036 (10MoritzMuehlenhoff) [12:26:45] (03Abandoned) 10Majavah: ldap::client::sssd: use strongly typed parameters [puppet] - 10https://gerrit.wikimedia.org/r/924985 (owner: 10Majavah) [12:28:44] phuedx: I believe you should ask for an emergency deployment, see https://wikitech.wikimedia.org/wiki/Deployments/Emergencies for who to ping [12:29:25] (in my experience there’s a pretty good chance to get JS fixes deployed… though I now see that backport also touches PHP files) [12:31:56] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11219044 (10Jclark-ctr) B Because no controller found i am unable to create Raids using perccli65 ` jclark@an-worker1230:~$ perccli64 /c0 add vd each r0 wb ra CLI Version = 007.1910.0000... [12:32:51] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11219052 (10Jclark-ctr) ` ` jclark@an-worker1230:~$ sudo perccli64 /c0 show We trust you have received the usual lecture from the local System Administrator. It usually boils down to thes... [12:33:50] I need an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MetricsPlatform/+/1190648 -- context is that there's a JS warning being triggered on every pageview at the moment. An analytics instrument that was deployed as part of the train is calling a method that wasn't. Are SRE ok with a deployment? (cc: thcipriani brennen). I can deploy [12:33:57] Lucas_WMDE: Many thanks for the pointer [12:35:12] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [12:35:43] !log created cn=airflow-wikidata-ops group T405557 [12:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:50] T405557: Request for airflow-wikidata-ops primary group - https://phabricator.wikimedia.org/T405557 [12:36:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:44] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:36:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:36:59] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1229.eqiad.wmnet with OS bullseye [12:37:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1229.eqiad.wmnet with OS bullseye [12:37:45] phuedx: I don't think there's an issue with that, head's up on-call arnoldokoth tappof slyngs bblack [12:38:11] Please test thoroughly at the testservers phase though [12:38:17] +1 [12:41:38] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54829 bytes in 4.626 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 4.792 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:41:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:42:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [12:45:14] 10SRE-tools, 06Infrastructure-Foundations: CI error on operations/cookbooks - https://phabricator.wikimedia.org/T405706#11219110 (10LSobanski) Adding @ltoscano as this is likely to be related to a Dell firmware change. [12:46:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:55] jelto@cumin1003 jelto: The backup on gitlab1004 is complete, ready to proceed with upgrade. [12:47:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1228.eqiad.wmnet with OS bullseye [12:48:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1228.eqiad.wmnet with OS bullseye [12:48:31] (03PS1) 10Phuedx: lib: Update metrics-platform to fc7678c10a1f [extensions/EventLogging] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191683 (https://phabricator.wikimedia.org/T401380) [12:49:00] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1227.eqiad.wmnet with OS bullseye [12:49:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1227.eqiad.wmnet with OS bullseye [12:49:55] jelto@cumin1003 upgrade (PID 47404) is awaiting input [12:50:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1226.eqiad.wmnet with OS bullseye [12:50:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1226.eqiad.wmnet with OS bullseye [12:51:21] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1225.eqiad.wmnet with OS bullseye [12:51:41] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1225.eqiad.wmnet with OS bullseye [12:52:03] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1229.eqiad.wmnet with reason: host reimage [12:52:54] Just waiting for CI to complete on the dependent patch [12:55:01] (03CR) 10Filippo Giunchedi: [C:03+1] P:wmcs: maintain-dbusers: Remove accounts names from Prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/1191675 (https://phabricator.wikimedia.org/T405728) (owner: 10Majavah) [12:55:51] (03CR) 10Majavah: [C:03+2] P:wmcs: maintain-dbusers: Remove accounts names from Prometheus labels [puppet] - 10https://gerrit.wikimedia.org/r/1191675 (https://phabricator.wikimedia.org/T405728) (owner: 10Majavah) [12:57:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1229.eqiad.wmnet with reason: host reimage [13:00:38] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [13:00:49] ^ expected because of the maintenance [13:01:38] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 115466 bytes in 0.477 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [13:03:24] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1228.eqiad.wmnet with reason: host reimage [13:03:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190648 (https://phabricator.wikimedia.org/T401380) (owner: 10Phuedx) [13:03:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy2002 using scap backport" [extensions/EventLogging] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191683 (https://phabricator.wikimedia.org/T401380) (owner: 10Phuedx) [13:04:02] (03Merged) 10jenkins-bot: lib: Update metrics-platform to fc7678c10a1f [extensions/EventLogging] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1191683 (https://phabricator.wikimedia.org/T401380) (owner: 10Phuedx) [13:04:05] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab [13:04:14] (03CR) 10Btullis: [C:03+2] Remove our custom spark-operator helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191137 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [13:04:29] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1227.eqiad.wmnet with reason: host reimage [13:05:23] (03Merged) 10jenkins-bot: Remove our custom spark-operator helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191137 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [13:05:31] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1226.eqiad.wmnet with reason: host reimage [13:06:16] (03Merged) 10jenkins-bot: ext.xLab: Add mw.xLab.getInstrument() [extensions/MetricsPlatform] (wmf/1.45.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1190648 (https://phabricator.wikimedia.org/T401380) (owner: 10Phuedx) [13:06:40] !log phuedx@deploy2002 Started scap sync-world: Backport for [[gerrit:1190648|ext.xLab: Add mw.xLab.getInstrument() (T401380 T404851)]], [[gerrit:1191683|lib: Update metrics-platform to fc7678c10a1f (T401380)]] [13:06:48] T401380: MetricsPlatform: Initialize MetricsClient with instrument configs fetched from xLab - https://phabricator.wikimedia.org/T401380 [13:06:48] T404851: MetricsPlatform should parse sample_rate value to number - https://phabricator.wikimedia.org/T404851 [13:07:32] jclark@cumin1002 reimage (PID 3858126) is awaiting input [13:09:13] (03PS3) 10Btullis: Add the spark-operator CRDs for version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191138 (https://phabricator.wikimedia.org/T405490) [13:09:31] (03PS4) 10Btullis: Import the upstream spark-operator chart version 2.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) [13:09:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1228.eqiad.wmnet with reason: host reimage [13:09:38] (03PS6) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [13:09:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1225.eqiad.wmnet with OS bullseye [13:09:47] (03PS6) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [13:09:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1225.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [13:09:54] (03PS6) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [13:10:51] (03CR) 10Btullis: Customise the imported spark-operator chart for deployment to WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [13:12:57] !log phuedx@deploy2002 phuedx: Backport for [[gerrit:1190648|ext.xLab: Add mw.xLab.getInstrument() (T401380 T404851)]], [[gerrit:1191683|lib: Update metrics-platform to fc7678c10a1f (T401380)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:13:05] T401380: MetricsPlatform: Initialize MetricsClient with instrument configs fetched from xLab - https://phabricator.wikimedia.org/T401380 [13:13:06] T404851: MetricsPlatform should parse sample_rate value to number - https://phabricator.wikimedia.org/T404851 [13:13:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1227.eqiad.wmnet with reason: host reimage [13:14:14] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [13:14:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1225.eqiad.wmnet with OS bullseye [13:14:45] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1225.eqiad.wmnet with OS bullseye [13:14:50] (03CR) 10FNegri: [C:03+1] haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662 (owner: 10Majavah) [13:18:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1226.eqiad.wmnet with reason: host reimage [13:18:38] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:20:51] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis tokwiki in section s5 [13:21:10] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:21:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:21:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1229.eqiad.wmnet with OS bullseye [13:21:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1229.eqiad.wmnet with OS bullseye completed: - an-worker1229 (**WAR... [13:23:43] Right. milimetric and I have tested on the test servers thoroughly. The warning has gone and I have seen the analytics instrument send data. The logs look clear. I was able to login [13:24:22] +1, I also tested editing, some special pages, view history, etc. [13:25:32] !log phuedx@deploy2002 phuedx: Continuing with sync [13:27:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:27:49] Cool, thanks [13:28:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:28:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1228.eqiad.wmnet with OS bullseye [13:28:43] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219425 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1228.eqiad.wmnet with OS bullseye completed: - an-worker1228 (**PAS... [13:29:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1225.eqiad.wmnet with reason: host reimage [13:30:12] (03PS1) 10DDesouza: Undeploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191688 (https://phabricator.wikimedia.org/T405577) [13:30:35] !log phuedx@deploy2002 Finished scap sync-world: Backport for [[gerrit:1190648|ext.xLab: Add mw.xLab.getInstrument() (T401380 T404851)]], [[gerrit:1191683|lib: Update metrics-platform to fc7678c10a1f (T401380)]] (duration: 23m 55s) [13:30:43] T401380: MetricsPlatform: Initialize MetricsClient with instrument configs fetched from xLab - https://phabricator.wikimedia.org/T401380 [13:30:43] T404851: MetricsPlatform should parse sample_rate value to number - https://phabricator.wikimedia.org/T404851 [13:31:50] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:31:50] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis tokwiki in section s5 [13:32:12] (03PS1) 10DDesouza: Remove reader foundational survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191691 (https://phabricator.wikimedia.org/T405410) [13:32:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:32:47] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis tokwiki in section s5 [13:32:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:32:52] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis tokwiki in section s5 [13:32:57] (03PS2) 10DDesouza: Undeploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191688 (https://phabricator.wikimedia.org/T405577) [13:33:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1225.eqiad.wmnet with reason: host reimage [13:34:28] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:35:12] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitize-wiki (exit_code=97) Checking sanitization for wikis tokwiki in section s5 [13:35:16] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitize-wiki Checking sanitization for wikis tokwiki in section s5 [13:35:24] (03PS1) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) [13:36:03] (03CR) 10Lucas Werkmeister (WMDE): "I’ll try to deploy this on Monday – should be low-risk but also not urgent enough to justify a Friday deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) (owner: 10Lucas Werkmeister (WMDE)) [13:36:25] (03PS1) 10Muehlenhoff: imposm-initial-import: Set service passwords [puppet] - 10https://gerrit.wikimedia.org/r/1191693 (https://phabricator.wikimedia.org/T381565) [13:37:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:37:21] I've watched the logs for a while and they haven't changed (backend and frontend). There's been a slight uptick in event production rate in EventGate, which is to be expected. There's no new event validation errors [13:37:26] So all good [13:37:39] Thanks claime and slyngs [13:39:19] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitize-wiki (exit_code=99) Checking sanitization for wikis tokwiki in section s5 [13:39:25] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d5-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405384#11219478 (10Jhancock.wm) 05Open→03Resolved this one will require a server to be moved out or decommed. updated tracking sheet. [13:39:28] (03PS1) 10Aklapper: Punish exponentially when removing all subscribers or project tags [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1191694 [13:39:57] (03PS1) 10Muehlenhoff: Track airflow-wikidata-ops for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191695 (https://phabricator.wikimedia.org/T405557) [13:40:23] jclark@cumin1002 reimage (PID 3855911) is awaiting input [13:40:41] (03CR) 10Jelto: [C:03+1] "lgtm, I'm wondering if we have to add the new gateway also to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/head" [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651) (owner: 10Dduvall) [13:41:17] (03PS2) 10Muehlenhoff: Track airflow-wikidata-ops for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191695 (https://phabricator.wikimedia.org/T405557) [13:41:18] (03CR) 10Ssingh: [C:03+1] "Looks good, let's merge Monday!" [puppet] - 10https://gerrit.wikimedia.org/r/1191010 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [13:41:23] (03CR) 10Aklapper: [V:03+2 C:03+2] Punish exponentially when removing all subscribers or project tags [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1191694 (owner: 10Aklapper) [13:41:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:42:35] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:42:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:42:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1226.eqiad.wmnet with OS bullseye [13:43:04] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1226.eqiad.wmnet with OS bullseye completed: - an-worker1226 (**WAR... [13:44:10] RESOLVED: GanetiMemoryPressure: Ganeti: High memory usage (98.66%) on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Ganeti#Memory_pressure - https://grafana.wikimedia.org/d/gd6vep5Iz/ganeti-memory-pressure?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DGanetiMemoryPressure [13:48:46] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::k8s::haproxy: Fix tool address in CSP header [puppet] - 10https://gerrit.wikimedia.org/r/1191584 (https://phabricator.wikimedia.org/T283948) (owner: 10Majavah) [13:49:01] (03CR) 10Filippo Giunchedi: [C:03+1] haproxy::cloud: Add an admin-level socket [puppet] - 10https://gerrit.wikimedia.org/r/1191662 (owner: 10Majavah) [13:49:34] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1191695 (https://phabricator.wikimedia.org/T405557) (owner: 10Muehlenhoff) [13:53:04] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11219585 (10Gehel) [13:53:49] 07sre-alert-triage, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Alert in need of triage: Dell PowerEdge or Supermicro Broadcom RAID Controller (instance an-worker1187) - https://phabricator.wikimedia.org/T405217#11219611 (10Gehel) [13:54:27] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Requesting Kerberos access for sd - https://phabricator.wikimedia.org/T405219#11219613 (10Gehel) [13:54:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11219633 (10Gehel) [13:55:48] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11219649 (10Gehel) [13:57:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:58:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:58:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1227.eqiad.wmnet with OS bullseye [13:59:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:59:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1225.eqiad.wmnet with OS bullseye [13:59:20] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191698 [13:59:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1227.eqiad.wmnet with OS bullseye completed: - an-worker1227 (**WAR... [13:59:50] (03PS7) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [13:59:54] (03PS7) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [13:59:58] (03PS7) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [14:00:04] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11219762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1225.eqiad.wmnet with OS bullseye completed: - an-worker1225 (**WAR... [14:00:08] (03CR) 10CI reject: [V:04-1] Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:00:12] (03CR) 10CI reject: [V:04-1] Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:00:16] (03CR) 10CI reject: [V:04-1] Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:00:36] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06Product Safety and Integrity, 06serviceops: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464#11219764 (10OKryva-WMF) [14:01:22] (03PS2) 10CDanis: haproxy: use Lua 5.3 for CI tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) [14:09:34] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11219896 (10MoritzMuehlenhoff) [14:09:57] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Product Safety and Integrity, and 3 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733#11219899 (10OKryva-WMF) [14:10:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17): Q2:rack/setup/install dse-k8s-worker200[45] - https://phabricator.wikimedia.org/T405406#11219910 (10Gehel) [14:12:30] (03PS3) 10CDanis: haproxy: use Lua 5.3 for CI tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) [14:12:30] (03PS1) 10CDanis: taskgen: add haproxy Lua tests [puppet] - 10https://gerrit.wikimedia.org/r/1191703 [14:12:50] (03CR) 10Muehlenhoff: [C:03+2] Track airflow-wikidata-ops for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1191695 (https://phabricator.wikimedia.org/T405557) (owner: 10Muehlenhoff) [14:14:11] (03CR) 10Ssingh: [C:03+1] taskgen: add haproxy Lua tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [14:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [14:15:01] (03CR) 10CI reject: [V:04-1] haproxy: use Lua 5.3 for CI tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis) [14:15:28] (03CR) 10CI reject: [V:04-1] taskgen: add haproxy Lua tests [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [14:18:54] (03PS4) 10CDanis: haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) [14:21:05] (03CR) 10Btullis: Import the upstream spark-operator chart version 2.2.1 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191139 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:21:15] (03PS8) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [14:21:18] (03CR) 10CI reject: [V:04-1] haproxy: use Lua 5.3 for Docker tests, for utf8 [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis) [14:21:33] (03PS9) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [14:21:33] (03PS8) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [14:21:33] (03PS8) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [14:21:46] (03CR) 10CI reject: [V:04-1] Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:21:53] (03CR) 10CI reject: [V:04-1] Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:21:59] (03CR) 10CI reject: [V:04-1] Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:22:50] (03PS10) 10Btullis: Customise the imported spark-operator chart for deployment to WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) [14:23:04] (03PS9) 10Btullis: Create a helmfile release for the updated spark-operator-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191141 (https://phabricator.wikimedia.org/T405490) [14:23:12] (03PS9) 10Btullis: Add a helmfile release for the updated spark-operator version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191142 (https://phabricator.wikimedia.org/T405490) [14:24:20] (03CR) 10Btullis: Customise the imported spark-operator chart for deployment to WMF (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191140 (https://phabricator.wikimedia.org/T405490) (owner: 10Btullis) [14:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:28:37] (03CR) 10TChin: [C:03+2] [eventgate_*] Bump eventgate to v1.25.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191506 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [14:30:34] (03Merged) 10jenkins-bot: [eventgate_*] Bump eventgate to v1.25.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191506 (https://phabricator.wikimedia.org/T403169) (owner: 10TChin) [14:33:19] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 07Puppet (Puppet 7.0): Review/cleanup content of /srv/git/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622#11220060 (10MoritzMuehlenhoff) > I was able to remove it with `sudo puppet ssl clean kib... [14:34:23] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [14:35:10] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [14:46:55] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191708 [14:47:12] phuedx: thanks for the ping and the deploy, looks like you got what you need. [14:48:32] (03CR) 10Ahmon Dancy: Update eqiad to k8s 1.31 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191656 (https://phabricator.wikimedia.org/T405703) (owner: 10Jelto) [14:52:16] !log Ran `scap clean-images` on deploy1003. Trimmed /srv down to 48% usage. (T401647) [14:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:23] T401647: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647 [14:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:59] 10ops-codfw, 06DC-Ops: Alert for device ps1-b3-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T405755 (10phaultfinder) 03NEW [15:02:19] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191708 [15:04:37] (03CR) 10Dzahn: [C:03+2] admin: upgrade elishacohenwmde to privatedata-users, no shell access [puppet] - 10https://gerrit.wikimedia.org/r/1191507 (https://phabricator.wikimedia.org/T404359) (owner: 10Dzahn) [15:07:00] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:07:27] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11220228 (10Dzahn) Hello @ECohen_WMDE re: > I need access to my team's analytics data/dashboards (Wikibase Reuse Team) You have been added to the requested grou... [15:07:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11220230 (10Dzahn) a:05Dzahn→03ECohen_WMDE [15:08:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11220233 (10Dzahn) p:05High→03Medium [15:08:46] 06SRE, 06Infrastructure-Foundations, 10Mail, 06Trust-and-Safety, and 3 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733#11220234 (10OKryva-WMF) [15:09:39] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191708 [15:09:51] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:13:13] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:15:18] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11220272 (10Dzahn) aha! thanks dancy. seems like maybe the `scap clean-images` command could be added to a systemd timer to close this out permanently? [15:16:33] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:17:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:23:02] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis) [15:24:39] 06SRE, 06Data-Engineering, 06Data-Engineering-Icebox, 06serviceops, 06Trust-and-Safety: Disable GeoIP Legacy Download / Identify all users of legacy (v1) GeoIP datasets and inform them of the need to switch to GeoIP2 dataset - https://phabricator.wikimedia.org/T303464#11220338 (10OKryva-WMF) [15:26:15] (03CR) 10Ssingh: "11:23:31 haproxylua: OK (0.61=setup[0.47]+cmd[0.14] seconds)" [puppet] - 10https://gerrit.wikimedia.org/r/1191698 (https://phabricator.wikimedia.org/T401383) (owner: 10CDanis) [15:26:27] (03CR) 10Ssingh: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:31:26] (03CR) 10Ssingh: [C:03+1] "This seems to be unrelated to our change, I think." [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:35:38] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:35:48] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:38:42] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [15:39:20] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [15:41:03] (03CR) 10CDanis: "I asked in IRC #-cloud-admin earlier but didn't get a response. Weirdly, those same tests passed in CI in the patch that made the last ed" [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:41:55] (03CR) 10Ssingh: [C:03+1] "Yeah! And I don't see any Lua invocation for these anyway so I wonder what changed." [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:42:23] (03CR) 10CDanis: "Ah -- editing taskgen.rb triggers *all* tasks to be run :)" [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:43:02] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:44:36] (03CR) 10Ssingh: [C:03+1] "But that should have been in the previous case as well? And that time only the haproxylua bits failed." [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [15:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:47:29] (03CR) 10Scott French: [C:03+2] dnsdisc: set a timeout on udp_with_fallback [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190770 (https://phabricator.wikimedia.org/T405397) (owner: 10Scott French) [15:49:16] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11220422 (10Clement_Goubert) We have a systemd timer that does `/usr/bin/docker image prune --all --force --filter until=72h` every day, maybe that could be added there (`modules/profile/manifests/docker/prune... [15:50:27] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [15:51:05] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11220430 (10MoritzMuehlenhoff) [15:51:11] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [15:55:45] (03Merged) 10jenkins-bot: dnsdisc: set a timeout on udp_with_fallback [software/spicerack] - 10https://gerrit.wikimedia.org/r/1190770 (https://phabricator.wikimedia.org/T405397) (owner: 10Scott French) [15:56:38] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 10Spicerack, 13Patch-For-Review: Spicerack's `Discovery.resolve_with_client_ip` should set a timeout on `udp_with_fallback` - https://phabricator.wikimedia.org/T405397#11220443 (10Scott_French) I've gone ahead and merged https://gerrit.wikimedia.or... [15:59:50] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:11] (03PS1) 10Majavah: tox: wmcs: Pin older mypy / types-pymysql [puppet] - 10https://gerrit.wikimedia.org/r/1191719 [16:00:58] (03CR) 10CDanis: [C:03+1] tox: wmcs: Pin older mypy / types-pymysql [puppet] - 10https://gerrit.wikimedia.org/r/1191719 (owner: 10Majavah) [16:01:09] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#11220463 (10Dzahn) 05Open→03In progress p:05Medium→03High I would argue that this is already resolved because clearly a gerrit exists on gerrit200... [16:05:03] (03CR) 10Majavah: [C:03+2] tox: wmcs: Pin older mypy / types-pymysql [puppet] - 10https://gerrit.wikimedia.org/r/1191719 (owner: 10Majavah) [16:06:15] (03PS1) 10Majavah: P:wmcs: maintain-dbusers: Fix remaining labels [puppet] - 10https://gerrit.wikimedia.org/r/1191720 [16:06:38] (03CR) 10Dzahn: "Yea, I would say the answer to Jelto's question is yes." [puppet] - 10https://gerrit.wikimedia.org/r/1191486 (https://phabricator.wikimedia.org/T405651) (owner: 10Dduvall) [16:07:50] (03PS2) 10RLazarus: mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663) [16:07:50] (03PS2) 10RLazarus: mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663) [16:07:50] (03PS1) 10RLazarus: mesh: Copy configuration_1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191721 (https://phabricator.wikimedia.org/T404036) [16:07:52] (03PS1) 10RLazarus: mesh.configuration: Envoy config updates for 1.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) [16:08:59] (03CR) 10Majavah: [C:03+2] P:wmcs: maintain-dbusers: Fix remaining labels [puppet] - 10https://gerrit.wikimedia.org/r/1191720 (owner: 10Majavah) [16:09:27] (03CR) 10BCornwall: [V:03+2 C:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [16:10:40] (03CR) 10CI reject: [V:04-1] mesh: Copy configuration_1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191721 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [16:10:54] (03PS5) 10Daniel Kinzler: apigw chart: for rest call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [16:11:51] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [16:12:01] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [16:12:56] (03PS2) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) [16:14:15] (03PS3) 10Lucas Werkmeister (WMDE): wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191692 (https://phabricator.wikimedia.org/T405720) [16:16:52] (03PS2) 10RLazarus: mesh: Copy configuration_1.14.1 to 1.14.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191721 (https://phabricator.wikimedia.org/T404036) [16:16:52] (03PS2) 10RLazarus: mesh.configuration: Envoy config updates for 1.29 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) [16:16:52] (03PS1) 10RLazarus: ### try in kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191723 [16:17:11] (03Abandoned) 10RLazarus: ### try in kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191723 (owner: 10RLazarus) [16:21:30] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11220520 (10Dzahn) Assuming the subtask T400994 being resolved means the checkbox "everything on WMCS/toolserver" here can be checked... [16:22:55] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11220533 (10WMDE-leszek) servus @Dzahn , thank you for the above > Is this only for private data access in superset and/or other web UIs / dashboards? this would be sufficient for now I... [16:25:09] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Enable Vary:User-Agent on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [16:26:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: Extend NEL headers to sites not fronted by CDN - https://phabricator.wikimedia.org/T303725#11220563 (10Dzahn) That would leave only GitLab which isn't as straight-forwarded as the other services because the webserver is ngin... [16:28:09] (03CR) 10Scott French: [C:03+1] mw-*: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191522 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [16:28:22] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11220565 (10Dzahn) @WMDE-leszek Great! Thanks for confirming. In that case I would optimistically say this ticket is resolved. And we didn't actually need the SSH key. I am going to cla... [16:29:11] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11220566 (10Dzahn) [16:29:17] (03CR) 10Scott French: [C:03+1] mw-videoscaler: Upgrade to Envoy 1.29.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191523 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [16:29:23] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11220567 (10Dzahn) 05In progress→03Resolved [16:29:39] (03CR) 10Scott French: [C:03+1] kubernetes: Set default Envoy version to 1.29.12 [puppet] - 10https://gerrit.wikimedia.org/r/1191526 (https://phabricator.wikimedia.org/T403663) (owner: 10RLazarus) [16:30:05] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:30:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11220569 (10Dzahn) [16:33:05] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11220582 (10Dzahn) Hi @gengh the ticket looks good and I think we know what type of access you need. Could you please take a look and sign L3 while we start working on the rest... [16:33:23] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [16:33:48] (03CR) 10Ssingh: [C:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [16:34:15] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11220590 (10Dzahn) I take that back, I saw not that you did this back in 2023. Disregard :) [16:34:28] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11220593 (10Dzahn) [16:34:56] (03CR) 10Krinkle: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1191543 (https://phabricator.wikimedia.org/T403866) (owner: 10Krinkle) [16:36:38] (03CR) 10Ssingh: [C:03+1] "Thanks to taavi for the fixes in I6a6a6964ee61c66acad6c6f84957850400f3c40a and I6a6a6964e09a3b6c594640f87fdb56b052b7b002" [puppet] - 10https://gerrit.wikimedia.org/r/1191703 (owner: 10CDanis) [16:38:29] !incidents [16:38:29] (03PS9) 10Krinkle: varnish: Enable unified mobile routing on misc wikimedia.org wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [16:38:29] No incidents occurred in the past 24 hours for team SRE [16:38:37] what [16:38:38] this is the old alert for deployment [16:38:57] what does this mean? ulsfo -> deployment server but only ulsfo? [16:39:10] did this happen yesterday during the switch, and then never resolved? [16:39:12] oh, "old" sounds good in this context [16:39:38] swfrench-wmf: I *think* so but I only saw this is in passing [16:39:46] alert received: 2025-09-25 [16:39:55] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [16:40:00] i ACKED it via SMS [16:40:02] mutante: thanks [16:40:05] I check graphs of ulsfo just to double check [16:40:09] (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191708 [16:40:23] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [16:40:31] ack is not enough I think if it's not resolved? [16:40:35] (after 24 hours) [16:41:00] yeah it will fire again after 24 hours if just ACKed, and if it not resolved (either manually or itself) [16:41:12] not sure, but either way it should be the first step so that others not at laptop know it's known [16:41:18] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [16:41:30] I am hesitant to mark it as resolved since at least I don't have the full context [16:41:44] let me take a look at what did happen and whether it's still happening [16:41:45] I dont know enough yet to determine if it's resolved. leaving that to others for now. [16:41:47] this seems rather odd [16:41:50] same, I go back to dinner. Let me know if I can help on anything [16:41:58] the site is up and ulsfo seems up too [16:42:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:43:02] okay, so deployment.eqiad.wmnet is the ATS backend for spiderpig.wikimedia.org [16:43:14] yeah, wasn't clear what the problem was from yesterday... Saw spiderpig mentioned. [16:43:20] what it seems to say is "when traffic servers in ulsfo want to take to deployment server" but not any other DC.. right? [16:43:23] hieradata/common/profile/trafficserver/backend.yaml [16:43:25] 314: replacement: https://deployment.eqiad.wmnet [16:43:34] right, exactly [16:44:09] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:44:26] so, two things: 1. we should probably silence this during the next deployment switchover, and 2. this alert isn't firing, so it's puzzling that it never resolved [16:44:52] i.e., #1 is a (new) process issue and #2 seems to be some sort of monitoring weirdness [16:44:53] it seems now that during switchover this was downtimed for 24 hours and that downtime expired now [16:44:55] should there be a deployment.discovery.wmnet ? [16:45:19] cdanis: no, by convention we always just use the eqiad CNAME [16:45:21] this question definitely came up before [16:45:48] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [16:45:48] it's arguably less confusing than putting discovery.wmnet on there, since that has a specific semantic meaning [16:45:54] hah [16:46:05] I see [16:46:10] maybe it was downtimed as part of the procedure but 24 hours is just not long enough for it to forget about elevated errors [16:46:12] does ATS re-resolve its backends [16:46:36] presumably? since discovery works? [16:46:39] (03CR) 10Ahmon Dancy: "I considered leaving in the line that tries to read from the old filenames but I was worried about creating a confusing situation where th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189905 (https://phabricator.wikimedia.org/T405110) (owner: 10Ahmon Dancy) [16:46:51] 300 seconds ttl on discovery.eqiad.wmnet CNAME [16:46:56] which is long enough to make this alert fire [16:47:01] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [16:48:08] swfrench-wmf: https://grafana.wikimedia.org/goto/77EkSV3Hg?orgId=1 [16:48:34] from ATS's point of view, it served errors for about 20 minutes [16:48:54] and, we have seen "resolved" events get dropped going to VO before, right? [16:50:22] cdanis: yup, that's expected, since the new switch procedure is effectively a downtime (not in the icinga sense) for spiderpig. meaning, we should create a silence for this :) [16:50:35] ah right [16:50:43] https://logstash.wikimedia.org/goto/b04f5dcb224a58fb93e1446a73a257d4 - alertmanager never created a resolve event? [16:51:08] per the logs, it _only_ fired [16:51:40] is it odd that this is limited to just ulsfo, like a "random" POP? or is that just conincidence or monitoring itself reporting it this way [16:51:58] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:52:08] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:52:08] mutante: it probably means someone located near ulsfo tried to use spiderpig during the switch :) [16:52:24] ah, that makes sense. yep [16:53:53] alright, unless anyone else has, I might go resolve this in VO, since everything appears to be WAI with the exception of alertmanager [16:54:09] (and the fact that we need to update our procedure to silence this) [16:54:30] +1 [16:54:58] +1 [16:55:03] {{done}} [16:55:23] still puzzled by alertmanager ... [16:56:07] or I guess it could also be prom if it never generated the event [16:58:36] I didn't see anything untoward in the alertmanager logs (just a webhook that has been failing for a long time, failing for a long time) [16:58:42] I'm stopping looking now though [16:59:04] thanks for looking! and thanks, all, who responded :) [17:00:33] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [17:00:47] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [17:00:57] (03PS5) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191708 [17:01:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11220679 (10jcrespo) 05Resolved→03Open This was not installed as per instructions. Re opening and will provide details next week. [17:04:55] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbprov1007.eqiad.wmnet with reason: needs reinstall [17:05:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install dbprov1007 - https://phabricator.wikimedia.org/T400412#11220703 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=41bb8747-414e-4f04-90cb-2a3f9ada4358) set by jynus@cumin1003 for 3 days, 0:00:00 on 1 host(s) and their... [17:06:28] (03CR) 10Dzahn: [C:03+2] "reminder to self: we need to Hiera'ize this value so that we can set it differently for the test instance.. or we can't re-enable puppet t" [puppet] - 10https://gerrit.wikimedia.org/r/1180643 (https://phabricator.wikimedia.org/T401157) (owner: 10Brennen Bearnes) [17:07:24] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [17:08:03] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [17:13:20] (03PS1) 10Jcrespo: dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/1191731 (https://phabricator.wikimedia.org/T403166) [17:13:59] (03PS6) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191708 [17:16:04] (03PS2) 10Jcrespo: dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/1191731 (https://phabricator.wikimedia.org/T403166) [17:18:59] (03CR) 10Jcrespo: [C:03+2] dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/1191731 (https://phabricator.wikimedia.org/T403166) (owner: 10Jcrespo) [17:21:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11220762 (10Dzahn) [17:22:45] (03PS1) 10Jcrespo: Revert "dbbackups: Partial revert of dbprov1007 setup, back to dbprov1003" [puppet] - 10https://gerrit.wikimedia.org/r/1191734 [17:23:01] (03CR) 10Jcrespo: [C:04-2] "Not until disks are fixed." [puppet] - 10https://gerrit.wikimedia.org/r/1191734 (owner: 10Jcrespo) [17:23:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11220765 (10Dzahn) @gengh We just need one thing. An approval from your manager. @DSantamaria Do you approve of this request? Could you please leave a quick comment here? tha... [17:24:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11220766 (10Dzahn) a:03DSantamaria [17:24:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11220771 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [17:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:12] (03PS11) 10Krinkle: Disable wmgUseMdotRouting on misc wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [17:47:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1224.eqiad.wmnet with OS bullseye [17:47:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1223.eqiad.wmnet with OS bullseye [17:47:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11220865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1224.eqiad.wmnet with OS bullseye [17:47:51] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11220866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1223.eqiad.wmnet with OS bullseye [17:48:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1222.eqiad.wmnet with OS bullseye [17:48:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1221.eqiad.wmnet with OS bullseye [17:48:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11220867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1222.eqiad.wmnet with OS bullseye [17:48:16] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11220868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1221.eqiad.wmnet with OS bullseye [17:48:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1220.eqiad.wmnet with OS bullseye [17:48:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11220869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1220.eqiad.wmnet with OS bullseye [17:55:47] (03PS12) 10Krinkle: Disable wmgUseMdotRouting on wikimedia.org wikis (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) [17:57:03] (03PS10) 10Krinkle: varnish: Enable unified mobile routing on wikimedia.org wikis [puppet] - 10https://gerrit.wikimedia.org/r/1191497 (https://phabricator.wikimedia.org/T403510) [17:57:31] (03CR) 10Krinkle: "Removed exception for loginwiki, cleared by Gergo at https://phabricator.wikimedia.org/T152882#11218562" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191504 (https://phabricator.wikimedia.org/T403510) (owner: 10Krinkle) [17:59:49] (03CR) 10Dzahn: "still waiting for the approval. if it doesn't come in today but we get it next week then next clinic duty person can feel free to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1191462 (https://phabricator.wikimedia.org/T405129) (owner: 10Dzahn) [18:01:30] (03CR) 10Scott French: "Thanks as always for the handy links to the docs! Two questions, but otherwise looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [18:02:18] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1224.eqiad.wmnet with reason: host reimage [18:02:41] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1223.eqiad.wmnet with reason: host reimage [18:03:11] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1222.eqiad.wmnet with reason: host reimage [18:03:17] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1221.eqiad.wmnet with reason: host reimage [18:03:32] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1220.eqiad.wmnet with reason: host reimage [18:06:49] (03PS7) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1191708 [18:09:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1224.eqiad.wmnet with reason: host reimage [18:12:02] (03PS1) 10Dzahn: admin: upgrade ebomani from ldap_only to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1191742 (https://phabricator.wikimedia.org/T405124) [18:13:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1222.eqiad.wmnet with reason: host reimage [18:14:50] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [18:15:57] (03PS1) 10Krinkle: gerrit: fix "key :host is duplicated" warning [puppet] - 10https://gerrit.wikimedia.org/r/1191743 [18:16:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for tais-lessa - https://phabricator.wikimedia.org/T405129#11220918 (10Dzahn) Thank you as well! All looks good. Only pending on the approval. Please note the people handling these request... [18:16:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1221.eqiad.wmnet with reason: host reimage [18:20:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1223.eqiad.wmnet with reason: host reimage [18:23:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1220.eqiad.wmnet with reason: host reimage [18:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:25:56] (03CR) 10RLazarus: mesh.configuration: Envoy config updates for 1.29 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [18:27:24] (03PS8) 10CDanis: WMF-Uniq -> analytics: better stats & privacy [puppet] - 10https://gerrit.wikimedia.org/r/1191708 (https://phabricator.wikimedia.org/T405783) [18:33:10] (03CR) 10Scott French: [C:03+1] "Wow, I don't know how I missed that last time around. Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1190274 (https://phabricator.wikimedia.org/T402014) (owner: 10JMeybohm) [18:33:18] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:33:29] (03CR) 10Dzahn: [C:03+1] gerrit: fix "key :host is duplicated" warning [puppet] - 10https://gerrit.wikimedia.org/r/1191743 (owner: 10Krinkle) [18:33:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:33:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1224.eqiad.wmnet with OS bullseye [18:33:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1224.eqiad.wmnet with OS bullseye completed: - an-worker1224 (**WAR... [18:33:56] (03CR) 10Dzahn: [C:03+2] gerrit: fix "key :host is duplicated" warning [puppet] - 10https://gerrit.wikimedia.org/r/1191743 (owner: 10Krinkle) [18:34:07] (03CR) 10Dzahn: [C:03+2] ""spec-only"" [puppet] - 10https://gerrit.wikimedia.org/r/1191743 (owner: 10Krinkle) [18:35:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1222.eqiad.wmnet with OS bullseye [18:35:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1222.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [18:39:37] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:41:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:41:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1221.eqiad.wmnet with OS bullseye [18:41:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1221.eqiad.wmnet with OS bullseye completed: - an-worker1221 (**WAR... [18:43:24] (03PS1) 10Dzahn: phabricator: hiera'ize the apc_shm_size variable [puppet] - 10https://gerrit.wikimedia.org/r/1191747 [18:44:42] (03CR) 10Scott French: [C:03+1] "Thanks for the follow-up, and indeed that makes sense. Having both files specified is not a state we'd want to leave things in - i.e., cle" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1189905 (https://phabricator.wikimedia.org/T405110) (owner: 10Ahmon Dancy) [18:45:06] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:45:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:45:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1223.eqiad.wmnet with OS bullseye [18:45:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221044 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1223.eqiad.wmnet with OS bullseye completed: - an-worker1223 (**WAR... [18:45:42] (03CR) 10CI reject: [V:04-1] phabricator: hiera'ize the apc_shm_size variable [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (owner: 10Dzahn) [18:46:40] (03PS2) 10Dzahn: phabricator: hiera'ize the apc_shm_size variable [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) [18:46:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:49:47] jclark@cumin1002 reimage (PID 29873) is awaiting input [18:51:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:51:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1220.eqiad.wmnet with OS bullseye [18:51:54] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1220.eqiad.wmnet with OS bullseye completed: - an-worker1220 (**WAR... [18:52:49] (03PS3) 10Dzahn: phabricator: hiera'ize the apc_shm_size variable [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) [18:53:18] (03PS1) 10Snwachukwu: Replace old sqoop wiki list file with new autoupdated file [puppet] - 10https://gerrit.wikimedia.org/r/1191750 [18:55:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1219.eqiad.wmnet with OS bullseye [18:55:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221064 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1219.eqiad.wmnet with OS bullseye [18:56:13] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1218.eqiad.wmnet with OS bullseye [18:56:27] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221065 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1218.eqiad.wmnet with OS bullseye [18:58:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1217.eqiad.wmnet with OS bullseye [18:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:58:21] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1217.eqiad.wmnet with OS bullseye [18:59:59] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1191747/7079/" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [19:00:17] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1191747/7079/phabricator-bullseye.devtools.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [19:01:18] jclark@cumin1002 reimage (PID 105104) is awaiting input [19:02:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1215.eqiad.wmnet with OS bullseye [19:02:33] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1215.eqiad.wmnet with OS bullseye [19:02:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1216.eqiad.wmnet with OS bullseye [19:03:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221082 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1216.eqiad.wmnet with OS bullseye [19:07:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1214.eqiad.wmnet with OS bullseye [19:07:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221089 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1214.eqiad.wmnet with OS bullseye [19:07:39] (03CR) 10Dzahn: [V:03+1] phabricator: hiera'ize the apc_shm_size variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [19:07:56] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1213.eqiad.wmnet with OS bullseye [19:08:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1213.eqiad.wmnet with OS bullseye [19:08:18] (03CR) 10Dzahn: [V:03+1] phabricator: hiera'ize the apc_shm_size variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [19:10:52] 06SRE, 06Commons, 10TimedMediaHandler: Videos on Commons take long to load - https://phabricator.wikimedia.org/T405760#11221098 (10A_smart_kitten) (tagging with #sre & #timedmediahandler as an initial triage, feel free to retag as appropriate!) [19:10:55] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1219.eqiad.wmnet with reason: host reimage [19:11:23] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1218.eqiad.wmnet with reason: host reimage [19:13:10] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1217.eqiad.wmnet with reason: host reimage [19:14:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1219.eqiad.wmnet with reason: host reimage [19:17:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:17:39] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1215.eqiad.wmnet with reason: host reimage [19:17:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1217.eqiad.wmnet with reason: host reimage [19:17:56] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1216.eqiad.wmnet with reason: host reimage [19:20:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:21:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1215.eqiad.wmnet with reason: host reimage [19:22:09] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1214.eqiad.wmnet with OS bullseye [19:22:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1213.eqiad.wmnet with OS bullseye [19:22:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221124 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1214.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [19:22:21] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221125 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1213.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [19:24:08] !incidents [19:24:09] No incidents occurred in the past 24 hours for team SRE [19:25:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1218.eqiad.wmnet with reason: host reimage [19:25:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:27:31] (03PS9) 10CDanis: WMF-Uniq -> analytics: better stats & privacy [puppet] - 10https://gerrit.wikimedia.org/r/1191708 (https://phabricator.wikimedia.org/T405783) [19:29:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1216.eqiad.wmnet with reason: host reimage [19:33:26] (03CR) 10Scott French: [C:03+1] "Looks good, pending Chris' confirmation about `service_name`. Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191722 (https://phabricator.wikimedia.org/T404036) (owner: 10RLazarus) [19:33:39] (03PS2) 10Scott French: deployment_server: support environment in release values file name [puppet] - 10https://gerrit.wikimedia.org/r/1191751 (https://phabricator.wikimedia.org/T405110) [19:34:16] (03CR) 10BBlack: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1191708 (https://phabricator.wikimedia.org/T405783) (owner: 10CDanis) [19:37:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:38:09] (03CR) 10Dzahn: "I am not sure what to do with this now. I consider it stalled. comments welcome." [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [19:40:23] (03CR) 10Dzahn: [V:03+1] "I got distracted from this for a bit and looking at things in my gerrit queue again. I see now the last comment was a question to Brett. a" [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [19:40:53] jclark@cumin1002 reimage (PID 103268) is awaiting input [19:41:42] (03CR) 10Dzahn: [V:03+1] "also renaming slightly because the "apply to phab" part isn't true anymore. this is supposed to do nothing more than add a new option to u" [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [19:42:52] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:43:05] (03CR) 10Dzahn: [V:03+1] "disregard the last comment. it DOES apply it to phab." [puppet] - 10https://gerrit.wikimedia.org/r/1171263 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [19:43:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:43:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1217.eqiad.wmnet with OS bullseye [19:43:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221167 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1217.eqiad.wmnet with OS bullseye completed: - an-worker1217 (**WAR... [19:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:44:54] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:45:01] (03CR) 10Dzahn: "Tyler: no urgency but let's talk about this between our teams when we have a moment" [puppet] - 10https://gerrit.wikimedia.org/r/1173468 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [19:46:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:46:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1215.eqiad.wmnet with OS bullseye [19:46:12] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221169 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1215.eqiad.wmnet with OS bullseye completed: - an-worker1215 (**WAR... [19:46:36] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:47:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:47:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1216.eqiad.wmnet with OS bullseye [19:47:34] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1216.eqiad.wmnet with OS bullseye completed: - an-worker1216 (**PAS... [19:49:54] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:50:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:50:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1218.eqiad.wmnet with OS bullseye [19:50:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1218.eqiad.wmnet with OS bullseye completed: - an-worker1218 (**WAR... [19:51:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:51:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1219.eqiad.wmnet with OS bullseye [19:51:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221173 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1219.eqiad.wmnet with OS bullseye completed: - an-worker1219 (**WAR... [19:55:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1214.eqiad.wmnet with OS bullseye [19:55:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1214.eqiad.wmnet with OS bullseye [19:55:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1213.eqiad.wmnet with OS bullseye [19:55:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1213.eqiad.wmnet with OS bullseye [19:56:23] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1212.eqiad.wmnet with OS bullseye [19:56:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221180 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1212.eqiad.wmnet with OS bullseye [19:57:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1211.eqiad.wmnet with OS bullseye [19:57:15] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221181 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1211.eqiad.wmnet with OS bullseye [19:57:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1209.eqiad.wmnet with OS bullseye [19:57:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1210.eqiad.wmnet with OS bullseye [19:58:13] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221182 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1209.eqiad.wmnet with OS bullseye [19:58:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1210.eqiad.wmnet with OS bullseye [19:59:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:48] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1213.eqiad.wmnet with reason: host reimage [20:11:24] jclark@cumin1002 reimage (PID 167722) is awaiting input [20:11:51] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1212.eqiad.wmnet with reason: host reimage [20:12:29] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1211.eqiad.wmnet with reason: host reimage [20:13:02] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1209.eqiad.wmnet with reason: host reimage [20:13:08] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1210.eqiad.wmnet with reason: host reimage [20:13:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1213.eqiad.wmnet with reason: host reimage [20:15:49] (03CR) 10RLazarus: [C:03+1] "Thanks for coordinating this! Agree it doesn't need a three-step "make before break" rollout." [puppet] - 10https://gerrit.wikimedia.org/r/1191751 (https://phabricator.wikimedia.org/T405110) (owner: 10Scott French) [20:16:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1209.eqiad.wmnet with reason: host reimage [20:20:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1212.eqiad.wmnet with reason: host reimage [20:23:15] jclark@cumin1002 reimage (PID 167722) is awaiting input [20:23:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1211.eqiad.wmnet with reason: host reimage [20:27:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1210.eqiad.wmnet with reason: host reimage [20:31:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:33:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:34:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1213.eqiad.wmnet with OS bullseye [20:34:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1213.eqiad.wmnet with OS bullseye completed: - an-worker1213 (**PAS... [20:34:34] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:35:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:35:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1209.eqiad.wmnet with OS bullseye [20:35:19] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1209.eqiad.wmnet with OS bullseye completed: - an-worker1209 (**PAS... [20:41:04] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:41:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:41:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1211.eqiad.wmnet with OS bullseye [20:41:31] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221273 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1211.eqiad.wmnet with OS bullseye completed: - an-worker1211 (**PAS... [20:42:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:44:38] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:44:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:45:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1212.eqiad.wmnet with OS bullseye [20:45:16] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1212.eqiad.wmnet with OS bullseye completed: - an-worker1212 (**WAR... [20:46:12] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796 (10jrbs) 03NEW [20:50:21] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:50:37] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221314 (10jrbs) [20:52:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:52:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1210.eqiad.wmnet with OS bullseye [20:52:18] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221325 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1210.eqiad.wmnet with OS bullseye completed: - an-worker1210 (**WAR... [21:01:50] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221363 (10MKopec) [21:08:40] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1214.eqiad.wmnet with OS bullseye executed with errors: - an-worker... [21:12:40] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221379 (10Dzahn) Hello! this sounds like it is about running maintenance commands on maintenance servers (mwmaint*). Is that right? Or could you add a little detail what the "change... [21:13:17] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1214.eqiad.wmnet with OS bullseye [21:13:28] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221382 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1214.eqiad.wmnet with OS bullseye [21:17:23] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221398 (10jrbs) Hi @Dzahn, sorry for the vagueness of the request. >>! In T405796#11221379, @Dzahn wrote: > this sounds like it is about running maintenance commands on maintenance serv... [21:21:10] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221413 (10Dzahn) Thank you! That does clarify it. deployment server and the restricted group should work, as far as I see right now. expiry_contact means an email address to ask around... [21:24:17] PROBLEM - MD RAID on dbproxy1024 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:24:19] ACKNOWLEDGEMENT - MD RAID on dbproxy1024 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T405804 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:24:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on dbproxy1024 - https://phabricator.wikimedia.org/T405804 (10ops-monitoring-bot) 03NEW [21:26:20] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221423 (10Dzahn) To get this kicked off; here are other things we will need: - an approval from @thcipriani - an approval from the direct manager (but if that's you then creating the t... [21:28:30] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1214.eqiad.wmnet with reason: host reimage [21:30:04] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221442 (10Dzahn) [21:30:51] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221455 (10Dzahn) 05Open→03In progress [21:31:01] 06SRE, 10SRE-Access-Requests: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11221456 (10Dzahn) p:05Triage→03Medium [21:32:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1214.eqiad.wmnet with reason: host reimage [21:33:39] (03CR) 10Krinkle: "Suggestion 1:" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [21:34:33] (03PS3) 10Scott French: deployment_server: support environment in release values file name [puppet] - 10https://gerrit.wikimedia.org/r/1191751 (https://phabricator.wikimedia.org/T405110) [21:34:57] 06SRE, 10DNS, 06MediaWiki-Platform-Team, 06Traffic, and 3 others: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#11221462 (10Krinkle) @Tgr Thanks, I'll include loginwiki in the next batch of rollouts on Monday 29 Sep (T403510). [21:41:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:42:58] (03CR) 10Scott French: "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1191751 (https://phabricator.wikimedia.org/T405110) (owner: 10Scott French) [21:43:47] (03PS15) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [21:44:14] (03CR) 10CI reject: [V:04-1] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [21:53:34] (03CR) 10Dr0ptp4kt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [21:56:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:56:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:56:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1214.eqiad.wmnet with OS bullseye [21:56:43] 10ops-eqiad, 06SRE, 06DC-Ops: Q1:rack/setup/install an-worker12[09-32].eqiad.wmnet - https://phabricator.wikimedia.org/T399964#11221487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1214.eqiad.wmnet with OS bullseye completed: - an-worker1214 (**WAR... [21:58:49] (03CR) 10RLazarus: [C:03+1] deployment_server: support environment in release values file name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191751 (https://phabricator.wikimedia.org/T405110) (owner: 10Scott French) [22:10:27] (03PS8) 10Dr0ptp4kt: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [22:13:05] (03PS16) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [22:13:32] (03CR) 10CI reject: [V:04-1] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [22:14:51] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [22:18:19] (03CR) 10Dr0ptp4kt: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [22:22:41] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 (10RLazarus) 03NEW [22:24:51] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate restbase.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:31:43] (03CR) 10Brennen Bearnes: [C:03+1] "LGTM. Ought to be plenty for the dev instance. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1191747 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [22:34:35] (03PS1) 10RLazarus: Update to v1.32.12 [debs/envoyproxy] (v1.32) - 10https://gerrit.wikimedia.org/r/1191768 (https://phabricator.wikimedia.org/T405808) [22:35:19] (03CR) 10RLazarus: [C:03+2] Update to v1.32.12 [debs/envoyproxy] (v1.32) - 10https://gerrit.wikimedia.org/r/1191768 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:41:29] !log rzl@apt1002:~$ sudo -i reprepro -C component/envoy-future include bullseye-wikimedia /home/rzl/envoyproxy/envoyproxy_1.32.12-1_amd64.changes # T405808 [22:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:35] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [22:58:10] FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:07:14] (03PS1) 10RLazarus: envoy-future: Update to v1.32.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1191772 (https://phabricator.wikimedia.org/T405808) [23:09:31] (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1191772 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:17:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:0 (Peering: DE-CIX (PF-AP-DAL5-1677062 MAC filter, via PacketFabric) {#12243_12334-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:32:53] anyone available to help me with https://wikitech.wikimedia.org/wiki/Maintenance_scripts#Output_to_a_file ? [23:33:43] I've gotten as far as creating the file in /tmp, but in the other session `kubectl cp mw-script.codfw.ax8lssng:/tmp/FA-media-formats.xml FA-media-formats.xml` errors with `Error from server (NotFound): pods "mw-script.codfw.ax8lssng" not found` [23:34:37] I can see the pod listed in `kubectl get pods`, so it's there… Maybe I need to specify the namespace, or something? I'm not sure what that should be [23:36:24] musikanimal: it looks like `mw-script.codfw.ax8lssng` is the name of the job, while `mw-script.codfw.ax8lssng-2sk9j` would be the name of the pod [23:36:45] maybe try the latter in your `kubectl cp`? [23:38:02] ahh! I see now. That worked \o/ [23:38:05] thank you!!! [23:38:24] glad that did it :) [23:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191775 [23:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191775 (owner: 10TrainBranchBot) [23:38:52] so I need to cross reference with `get pods` to find out the pod name? I didn't see `mw-script.codfw.ax8lssng-2sk9j` printed in the output of the session where I created the file [23:40:34] precisely, yeah - when you start your job with `mwscript-k8s` it displays the name of the job object, rather than the name of the pod object that's created under the hood [23:41:13] okay cool! Also FYI, I only discovered I needed to use `kube_env mw-script-deploy …` from search Phab and seeing your comment https://phabricator.wikimedia.org/T401252#11066941 . I will update the "Output to a file" docs with a working example [23:42:16] ah, glad you found that, and thank you for improving docs! :) [23:44:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:45:01] gladly! and thank you for the prompt assistance :) I even checked to see if you were online and I thought I didn't see your name, hehe [23:45:13] (03CR) 10Thcipriani: admin: upgrade ebomani from ldap_only to deployers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1191742 (https://phabricator.wikimedia.org/T405124) (owner: 10Dzahn) [23:54:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1191775 (owner: 10TrainBranchBot) [23:57:08] (03CR) 10Scott French: [C:03+1] "Thanks for the context surrounding the edit to the basic config." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1191772 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:59:51] FIRING: [24x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed