[00:15:25] RESOLVED: ProbeDown: Service aqs1024-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1024-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:03] (03PS1) 10Dzahn: zuul::executor: remove mounting of /etc/cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1269082 [00:21:44] (03PS2) 10Dzahn: zuul::executor: remove mounting of /etc/cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1269082 (https://phabricator.wikimedia.org/T395938) [00:22:23] !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1024.eqiad.wmnet [00:22:23] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1024.eqiad.wmnet [00:28:21] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11802627 (10Eevans) [00:30:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11802628 (10Eevans) [00:34:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:39:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:41:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:47:22] jouncebot: nowandnext [00:47:23] No deployments scheduled for the next 5 hour(s) and 12 minute(s) [00:47:23] In 5 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600) [00:47:23] In 5 hour(s) and 12 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600) [00:47:27] (03CR) 10Zabe: [C:03+2] Start reading from new file tables everwhere except enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268574 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:48:22] (03Merged) 10jenkins-bot: Start reading from new file tables everwhere except enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268574 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:49:45] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1268574|Start reading from new file tables everwhere except enwiki and commons (T416548)]] [00:49:48] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:51:37] !log zabe@deploy1003 zabe: Backport for [[gerrit:1268574|Start reading from new file tables everwhere except enwiki and commons (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:53:34] !log zabe@deploy1003 zabe: Continuing with sync [00:57:25] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268574|Start reading from new file tables everwhere except enwiki and commons (T416548)]] (duration: 07m 40s) [00:57:28] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:58:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:30] (03PS1) 10Zabe: Start reading from new file tables on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269086 (https://phabricator.wikimedia.org/T416548) [01:09:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269088 [01:09:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269088 (owner: 10TrainBranchBot) [01:22:55] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269088 (owner: 10TrainBranchBot) [01:23:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [01:28:30] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [01:30:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:31:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:00:54] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:06] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 11s) [02:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:22] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [02:25:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:30:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:31:23] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [02:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:52:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:54:57] (03PS1) 10Aaron Schulz: Add Wikimedia REST API ?spec route for *.wikinews.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269099 (https://phabricator.wikimedia.org/T418318) [03:00:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:00:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:02:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:02:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:04:15] (03PS2) 10Bernard Wang: Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 [03:05:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:06:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:09:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:09:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:17:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:20:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:23:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:23:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:27:09] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1022 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:27:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:28:09] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1022 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:34:35] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:35:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:35:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:38:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:39:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:41:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:41:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:09:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1016: Reimage [05:09:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:09:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:09:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1016: Reimage [05:10:13] (03PS1) 10Marostegui: pc2010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269223 (https://phabricator.wikimedia.org/T422368) [05:10:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Reimage to Debian Trixie [05:11:15] (03PS2) 10Marostegui: pc2016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269223 (https://phabricator.wikimedia.org/T422368) [05:11:49] (03CR) 10Marostegui: [C:03+2] pc2016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269223 (https://phabricator.wikimedia.org/T422368) (owner: 10Marostegui) [05:13:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc2016.codfw.wmnet with OS trixie [05:26:44] (03PS1) 10Marostegui: installservers: Do not format /srv on an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1269227 (https://phabricator.wikimedia.org/T422778) [05:31:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [05:33:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [05:48:22] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "P:toolforge::prometheus: Disable istio-gateway scrape for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268981 (https://phabricator.wikimedia.org/T421386) (owner: 10Majavah) [05:48:43] (03CR) 10Filippo Giunchedi: [C:03+1] dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) (owner: 10Majavah) [05:54:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:55:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:56:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:56:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:59:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2016.codfw.wmnet with OS trixie [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600). [06:05:26] (03CR) 10Ayounsi: "Thanks for the data. Unfortunately we're a bit blind about YT." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [06:10:13] (03PS1) 10Marostegui: Revert "pc2016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269238 [06:26:40] (03PS1) 10Muehlenhoff: Record LDAP access for lbecker [puppet] - 10https://gerrit.wikimedia.org/r/1269241 (https://phabricator.wikimedia.org/T422537) [06:29:01] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for lbecker [puppet] - 10https://gerrit.wikimedia.org/r/1269241 (https://phabricator.wikimedia.org/T422537) (owner: 10Muehlenhoff) [06:37:48] (03PS1) 10Marostegui: pc1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269243 (https://phabricator.wikimedia.org/T422368) [06:38:22] (03CR) 10Marostegui: [C:03+2] pc1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269243 (https://phabricator.wikimedia.org/T422368) (owner: 10Marostegui) [06:38:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc1016.eqiad.wmnet with OS trixie [06:38:52] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1016.eqiad.wmnet with OS trixie [06:39:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc1016.eqiad.wmnet with OS trixie [06:40:58] (03PS1) 10Muehlenhoff: Record LDAP access for bliviero [puppet] - 10https://gerrit.wikimedia.org/r/1269244 [06:41:52] (03PS1) 10Ryan Kemper: growthbook: Add API key placeholders for automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) [06:43:07] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11802981 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium a:03MoritzMuehlenhoff >>! In T422537#11802156, @MMigurski-WMF wrote: > I have updated my email to... [06:46:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11802986 (10ayounsi) [06:46:53] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for bliviero [puppet] - 10https://gerrit.wikimedia.org/r/1269244 (owner: 10Muehlenhoff) [06:49:15] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11802989 (10ayounsi) [06:49:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11802988 (10ayounsi) [06:54:15] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:55:08] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [06:59:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:35] (03CR) 10JMeybohm: [C:03+1] "Oh, that makes total sense! Sorry I did not catch this when reviewing the initial patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [07:03:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on pc2016.codfw.wmnet with reason: Maintenance [07:04:10] (03CR) 10Marostegui: [C:03+2] Revert "pc2016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269238 (owner: 10Marostegui) [07:10:26] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:13:06] (03PS1) 10Marostegui: Revert "pc1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269252 [07:13:55] (03CR) 10Majavah: [C:03+2] Revert "P:toolforge::prometheus: Disable istio-gateway scrape for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268981 (https://phabricator.wikimedia.org/T421386) (owner: 10Majavah) [07:14:02] (03CR) 10Majavah: [C:03+2] dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) (owner: 10Majavah) [07:15:29] (03CR) 10Marostegui: [C:03+2] Revert "pc1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269252 (owner: 10Marostegui) [07:16:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1016.eqiad.wmnet with OS trixie [07:17:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc1016: After reimage [07:17:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:17:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:17:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1016: After reimage [07:34:38] (03CR) 10Brouberol: "Looks good! You also need to bump the `growthbook-backend` subchart version, as well as the `growthbook` chart version, to ensure the chan" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper) [07:43:58] (03PS2) 10Elukey: service: allow k8s-ingress-aux-rw to be active/active [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486) [07:43:58] (03PS1) 10Elukey: kubernetes: move aux-k8s-eqiad to 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1269259 (https://phabricator.wikimedia.org/T414486) [07:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:55] (03PS1) 10Elukey: admin_ng: Upgrade aux-k8s-eqiad to 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269329 (https://phabricator.wikimedia.org/T414486) [07:56:06] PROBLEM - SSH on logstash1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:56:15] (03CR) 10Filippo Giunchedi: [C:03+1] dumps: web: Use 429 for connection limit issues [puppet] - 10https://gerrit.wikimedia.org/r/1269021 (owner: 10Majavah) [07:56:31] FIRING: [2x] ProbeDown: Service logstash1032:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1032:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:56:48] (03CR) 10Majavah: [C:03+2] dumps: web: Use 429 for connection limit issues [puppet] - 10https://gerrit.wikimedia.org/r/1269021 (owner: 10Majavah) [07:58:56] RECOVERY - SSH on logstash1032 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:59:06] PROBLEM - SSH on logstash1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:00:05] dancy and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0800) [08:00:26] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:29] (03PS1) 10Filippo Giunchedi: memcached: install liburi-perl [puppet] - 10https://gerrit.wikimedia.org/r/1269333 [08:01:28] (03PS1) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) [08:01:31] FIRING: [4x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:03:56] RECOVERY - SSH on logstash1023 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:06:31] RESOLVED: [4x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:08:52] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) [08:08:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1269333 (owner: 10Filippo Giunchedi) [08:08:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [08:10:00] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) [08:11:51] !log upgrading eqiad to haproxy 3.2 (T421402) [08:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:54] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [08:11:58] (03CR) 10Filippo Giunchedi: [C:03+2] memcached: install liburi-perl [puppet] - 10https://gerrit.wikimedia.org/r/1269333 (owner: 10Filippo Giunchedi) [08:12:06] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [08:14:29] (03CR) 10CI reject: [V:04-1] Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [08:14:36] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad - 3.2 upgrade (T421402) [08:14:37] (03CR) 10Mszwarc: "recheck" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [08:14:38] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad - 3.2 upgrade (T421402) [08:16:12] (03CR) 10Elukey: [C:03+2] service: allow k8s-ingress-aux-rw to be active/active [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [08:17:07] (03CR) 10Muehlenhoff: [C:03+1] "os-reports is only static content, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [08:20:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [08:20:44] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1013.eq... [08:21:06] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/aux-codfw: maintenance [08:21:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-fe1013 [08:21:19] PROBLEM - gdnsd checkconf #page on dns2006 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:21:21] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [08:21:29] !incidents [08:21:30] 7818 (UNACKED) dns2006/gdnsd checkconf (paged) [08:21:38] !ack 7818 [08:21:38] 7818 (ACKED) dns2006/gdnsd checkconf (paged) [08:21:59] here too [08:22:17] oh my, is it related to my patch? [08:22:19] PROBLEM - gdnsd checkconf #page on dns7002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:22:21] PROBLEM - gdnsd checkconf #page on dns5003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:22:27] !incidents [08:22:28] 7818 (ACKED) dns2006/gdnsd checkconf (paged) [08:22:28] 7819 (UNACKED) dns5003/gdnsd checkconf (paged) [08:22:28] 7820 (UNACKED) dns7002/gdnsd checkconf (paged) [08:22:30] !ack [08:22:30] 7819 (ACKED) dns5003/gdnsd checkconf (paged) [08:22:30] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:22:35] (03PS1) 10Hashar: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) [08:22:38] elukey: not sure, checking [08:22:56] for context: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268902/3/hieradata/common/service.yaml [08:22:59] (03PS1) 10Hashar: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) [08:23:04] checking as well [08:23:26] !log elukey@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-cluster (exit_code=93) pool all services in codfw/aux-codfw: maintenance [08:23:26] <_joe_> elukey: yes I think you need puppet to run everywhere before you change things [08:23:52] Invalid resource name disc-k8s-ingress-aux-rw detected from zonefile lookup [08:23:52] (03CR) 10Michael Große: [C:03+1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:23:53] elukey: looks like it [08:23:56] error: plugin_metafo: Invalid resource name 'disc-k8s-ingress-aux-rw' detected from zonefile lookup error: Name 'k8s-ingress-aux-rw.discovery.wmnet.': resolver plugin 'metafo' rejected resource name 'disc-k8s-ingress-aux-rw' [08:24:03] <_joe_> so run puppet everywhere [08:24:05] (03CR) 10Michael Große: [C:03+1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:24:08] Apr 09 08:18:57 dns2006 gdnsd[2640858]: Name 'k8s-ingress-aux-rw.discovery.wmnet.': resolver plugin 'metafo' rejected resource name 'disc-k8s-ingress-aux-rw' [08:24:23] yeah sorry folks [08:24:32] running puppet [08:24:32] so yes gdnsd is complaining about the aux ingress [08:25:23] okay thanks, let's see if this calms gdnsd [08:26:16] I'm in the middle of a re-image and new-vlan cookbook and it's also unhappy [08:26:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [08:26:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P90334 and previous config saved to /var/cache/conftool/dbconfig/20260409-082633-fceratto.json [08:26:37] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:26:44] because of the DNS check failures [08:27:07] running Puppet is not enough, gdnsd still fails [08:27:13] tested on dns5003 [08:27:40] dns[2006,5003,7002] all still unhappy AFAICT [08:27:41] FIRING: [3x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:28:13] so what I tried to to was to run the k8s pool cookbook [08:28:38] for aux-k8s-codfw, that tried to set k8s ingress rw to active active [08:28:50] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [08:28:50] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [08:28:56] PROBLEM - gdnsd daemon runs exactly once on dns5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 496 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS [08:28:58] PROBLEM - Auth DNS on dns5003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [08:29:02] PROBLEM - AuthDNS-over-TLS Works on dns5003 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [08:29:05] <_joe_> uh damn [08:29:15] <_joe_> revert the patch now I guess [08:29:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:29:57] (03PS1) 10Elukey: Revert "service: allow k8s-ingress-aux-rw to be active/active" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 [08:30:01] <_joe_> actually it's the opposite puppet makes dns fail worse [08:30:11] <_joe_> elukey: merge with +2 +2 please [08:30:19] PROBLEM - gdnsd checkconf #page on dns1005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:30:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 (owner: 10Elukey) [08:30:23] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "service: allow k8s-ingress-aux-rw to be active/active" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 (owner: 10Elukey) [08:30:24] mvernon@cumin2002 reimage (PID 990280) is awaiting input [08:30:27] !ack [08:30:28] 7821 (ACKED) dns1005/gdnsd checkconf (paged) [08:30:34] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:30:43] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 (owner: 10Elukey) [08:30:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:30:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:30:51] [that reimage awaiting input is me waiting to retry the DNS changes once this is fixed] [08:31:44] <_joe_> uhm so I think I understand what the problem is [08:32:06] I mean there must be a stale setting somewhere that makes gdnsd rightfully upset [08:32:22] <_joe_> you added a resource as ./discovery-geo-resources and not ./discovery-metafo-resources [08:32:30] <_joe_> elukey: did you puppet-merge? [08:32:34] yeah [08:32:35] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803193 (10ayounsi) > @jcrespo > We don't reimage backups hosts. Let us know the alternative method (we can put them out... [08:32:41] FIRING: [13x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:32:48] (03PS2) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) [08:32:54] <_joe_> running puppet on dns5003 [08:33:03] <_joe_> to confirm if gdnsd comes back [08:33:08] <_joe_> we might need more reverts [08:33:15] okok I was about to do the same [08:33:17] already running puppet on 5003 [08:33:52] <_joe_> otherwise I'll design a way to fix all this [08:34:14] (03CR) 10CI reject: [V:04-1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:34:15] FIRING: [2x] JobUnavailable: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:16] and gdnsd now properly starts on 5003 [08:34:18] PROBLEM - gdnsd checkconf #page on dns2005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:34:19] RECOVERY - gdnsd checkconf #page on dns2006 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:34:20] RECOVERY - gdnsd checkconf #page on dns5003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:34:34] !incidents [08:34:34] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:34:34] 7821 (ACKED) dns1005/gdnsd checkconf (paged) [08:34:34] !incidents [08:34:34] 7822 (UNACKED) dns2005/gdnsd checkconf (paged) [08:34:35] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:34:35] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:34:35] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:34:35] 7821 (ACKED) dns1005/gdnsd checkconf (paged) [08:34:35] 7822 (UNACKED) dns2005/gdnsd checkconf (paged) [08:34:36] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:34:36] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:34:40] dns5003 is happy now [08:34:46] !ack 7822 [08:34:46] 7822 (ACKED) dns2005/gdnsd checkconf (paged) [08:34:47] 2005 might need another puppet run? [08:34:48] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [08:34:48] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [08:34:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:34:50] <_joe_> ok let's run puppet on all servers [08:34:54] RECOVERY - gdnsd daemon runs exactly once on dns5003 is OK: PROCS OK: 1 process with UID = 496 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS [08:34:56] RECOVERY - Auth DNS on dns5003 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [08:34:57] going to do it [08:34:57] <_joe_> dns servers I mean [08:35:02] RECOVERY - AuthDNS-over-TLS Works on dns5003 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [08:35:09] <_joe_> now I need to understand what you did wrong [08:35:12] ok thanks elukey [08:35:57] <_joe_> I think we don't support at the moment the transition from active/active to active/passive, but I need to understand what you did and what the consequences are again, I've last looked at this stuff eons ago [08:36:27] I can take care of it later on if you want, I made the mess so I can try to dig into it [08:36:45] (03PS1) 10Gkyziridis: ml-services: Remove models from experimental staging that were not deployed corrextly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269343 [08:36:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:56] My cookbook is still failing to get happyness from dns[1005,2005,7002] at least [08:37:10] (03CR) 10CI reject: [V:04-1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:37:11] 2005 was still missing a puppet run [08:37:19] RECOVERY - gdnsd checkconf #page on dns1005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:37:25] !incidents [08:37:25] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:37:25] 7822 (ACKED) dns2005/gdnsd checkconf (paged) [08:37:25] 7821 (RESOLVED) dns1005/gdnsd checkconf (paged) [08:37:26] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:37:26] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:37:32] ok perfect [08:37:41] FIRING: [13x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:38:06] (03PS1) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [08:38:19] RECOVERY - gdnsd checkconf #page on dns2005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:38:34] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:39:15] FIRING: [2x] JobUnavailable: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:39:52] mvernon@cumin2002 reimage (PID 990280) is awaiting input [08:40:06] halfway through the puppet runs [08:40:18] RECOVERY - gdnsd checkconf #page on dns7002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:40:20] yeah only 7002 is missing I think [08:40:24] ah there it is [08:40:25] I just ran puppet there [08:40:28] !incitents [08:40:32] !incidents [08:40:33] 7820 (RESOLVED) dns7002/gdnsd checkconf (paged) [08:40:33] 7822 (RESOLVED) dns2005/gdnsd checkconf (paged) [08:40:33] 7821 (RESOLVED) dns1005/gdnsd checkconf (paged) [08:40:33] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:40:33] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:40:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:41:07] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1013 - mvernon@cumin2002" [08:41:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1013 - mvernon@cumin2002" [08:41:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:13] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-fe1013.eqiad.wmnet 149.48.64.10.in-addr.arpa 9.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:41:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1013.eqiad.wmnet 149.48.64.10.in-addr.arpa 9.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:41:18] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1013 [08:41:26] OK, DNS run in my cookbook went through OK now, thanks :) [08:41:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1013 [08:41:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1013 [08:42:03] Emperor: next time please wait the green light from others, I am still completing the puppet runs [08:42:56] elukey: sorry, I figured asking it to retry again was likely harmless [08:45:28] Emperor: no problem, but if you see the logs above the cookbook worked with DNS, while we are still cleaning up. It can backfire in a lot of ways :D [08:45:33] (03PS1) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) [08:45:51] ok runs completed [08:46:44] great thank you elukey. Then I'll let you figure out the correct approach for active-active pooling and resume my work ? or do you need anything from oncallers? :) [08:47:25] (03CR) 10Michael Große: [C:03+1] "This needs to first backport If4ff840dc65aef61029a12fe554364d38fa277b1 and then have this changed behind it." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:47:51] <_joe_> elukey: do you have a pcc for your original change? [08:48:18] _joe_ I don't no, I really thought it was harmless to be honest, my bad [08:48:38] jelto: the last bit is https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed, but I can find what's wrong don't worry [08:48:41] sorry for the noise [08:48:59] <_joe_> elukey: I'm aslo not sure why you wanted to make a -rw endpoint active-active [08:49:24] <_joe_> elukey: anyways, what are you trying to fix rn? [08:49:52] <_joe_> I can help I think [08:50:11] _joe_ the only thing that pointed to it was os-reports, I thought it was a leftover from when we had only aux-k8s-eqiad [08:50:27] if it is supposed to be always active/passive ok, I'll leave it there [08:50:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:50:48] <_joe_> usually we put something under -rw that needs to be written to a specific location, yes [08:51:21] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad - 3.2 upgrade (T421402) [08:51:24] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [08:52:18] the only thing to clean up is /var/lib/gdnsd/discovery-k8s-ingress-aux-rw.state on the dns servers, since afaics it points to two IPs with UP [08:52:28] (03CR) 10Mszwarc: "recheck" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:52:33] my understanding is that the check fails because of that [08:52:36] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803296 (10jcrespo) >>! In T421719#11803193, @ayounsi wrote: > We can work together on that, the process is a bit more man... [08:53:21] (03CR) 10Hashar: "recheck due to T422469 - *ReviseTone.cy.ts flakily fails to find ve-ui-editCheck-gutter-action-warning*" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:53:32] <_joe_> elukey: very probable yes but uhm wait [08:54:09] (03CR) 10Volans: "Thanks for porting the work started with pywmflib to spicerack!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:54:25] elukey, jelto: os-reports is entirely static content, generated on the puppetdb hosts, it can simply be removed from the rw endpoint? [08:54:28] <_joe_> so yes, I don't think currently there's an entry in confd for discovery-k8s-ingress-aux-rw [08:54:34] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:54:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:55:06] or is that used for rsync for the data sync? [08:55:36] yes that's also my understanding, os-reports is static and read only, the syncing happens on rsync level (the k8s pod fetches it from the puppetserver) and not over https [08:55:36] (03CR) 10Hashar: "The second `recheck` I have done is because I had it prepared in a window and did not sent it immediately because I was replying on anothe" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:55:48] <_joe_> elukey: which was pooled before? eqiad or codfw? [08:56:09] eqiad [08:56:16] <_joe_> ok [08:56:19] (03CR) 10Hashar: "Oh nice, thank you. I will propose the backport and rebase this change on top of it." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:56:21] <_joe_> running confctl --object-type discovery select 'dnsdisc=k8s-ingress-aux-rw,name=codfw' set/pooled=false [08:56:23] !log oblivian@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-aux-rw,name=codfw [08:56:43] (03PS1) 10Hashar: fix: adjust to return type changed by upstream [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269351 [08:56:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:59] (03PS2) 10Hashar: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) [08:57:11] (03CR) 10CI reject: [V:04-1] Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [08:57:34] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:57:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2157: Pooling in [08:58:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [08:59:07] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad - 3.2 upgrade (T421402) [08:59:10] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [08:59:39] <_joe_> elukey: I have to hop into a meeting, but AFAICT most things are ok now, there's just one problem left but it should not have to do with confd [08:59:54] <_joe_> there's probably some error files leftover, but I can't check rn [08:59:54] okok thanks [09:04:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [09:04:19] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987#11803370 (10Effeietsanders) @Legoktm Thanks again for this! Is there progress on the more high level task? I have another few lists where I was trying to add this, and just f... [09:05:05] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11803372 (10jcrespo) >>! In T420623#11801010, @Papaul wrote: > @jcrespo we can do this next week Wednesday April 15th at 10am CT .... [09:06:11] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:07:17] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:07:35] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove models from experimental staging that were not deployed corrextly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269343 (owner: 10Gkyziridis) [09:09:30] (03Merged) 10jenkins-bot: ml-services: Remove models from experimental staging that were not deployed corrextly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269343 (owner: 10Gkyziridis) [09:09:34] 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T413507#11803391 (10TheDJ) > error on line 18 at column 39: xmlns:ns0: '&#38;#38;ns_ai;' is not a valid URI" The file uses a xml namespace that we do not recognize and do not allow. It's apparently a bug in a ve... [09:09:37] (03CR) 10CI reject: [V:04-1] fix: adjust to return type changed by upstream [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269351 (owner: 10Hashar) [09:09:40] (03PS1) 10Gmodena: RSU: increase parallelism for staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269354 (https://phabricator.wikimedia.org/T422791) [09:09:53] (03PS4) 10Fabfur: hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) [09:09:57] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:10:38] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:11:06] ah right I found it [09:11:24] there are /var/run/confd-template/_var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.err in the various dns hosts, I was looking on one that didn't have them [09:11:41] !log upgrading esams to haproxy 3.2 (T421402) [09:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:44] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [09:12:09] !log remove /var/run/confd-template/_var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.err on dns5004 and restart confd [09:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:53] 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T413507#11803397 (10MatthewVernon) @Jeff_G please open new tickets when reporting new issues, unless you're really 100% sure you've got a recurrence of exactly the same issue again - it's really easy to merge tickets... [09:13:09] logs are good now on 5004 but it takes time for the alert to clear [09:13:12] (probably) [09:13:58] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:14:05] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11803425 (10ABran-WMF) >>! In T420623#11803372, @jcrespo wrote: > CC @ABran-WMF @Jelto that backups from gerrit & gitlab (and attem... [09:15:51] !log remove /var/run/confd-template/_var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.err on affected dns servers and restart confd [09:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:18] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams - 3.2 upgrade (T421402) [09:16:25] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams - 3.2 upgrade (T421402) [09:17:41] FIRING: [11x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:19:53] this is not true, only two remaining --^ [09:19:57] should be cleared soon [09:20:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:22:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1013.eqiad.wmnet with OS bullseye [09:22:16] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1013.eqiad.... [09:22:41] RESOLVED: [11x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:23:19] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11803455 (10cmooney) [09:23:48] (03PS1) 10Fabfur: hiera: cleanup custom haproxy version [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) [09:25:03] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe[1009-1012,1014-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:25:58] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803462 (10MatthewVernon) [09:29:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:29:43] (03CR) 10Elukey: Move linting to Ruff and apply code fixes (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:32:33] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:32:34] (03CR) 10Clément Goubert: [C:03+1] Add Wikimedia REST API ?spec route for *.wikinews.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269099 (https://phabricator.wikimedia.org/T418318) (owner: 10Aaron Schulz) [09:33:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe[1009-1012,1014-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:33:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:43] (03PS2) 10Fabfur: hiera: cleanup custom haproxy version [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) [09:35:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:36:53] (03PS1) 10Gkyziridis: ml-services: Deploy again edit-check model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 [09:39:13] (03CR) 10Elukey: "Left a couple of comments but overall it looks very good. Have you already tested it on a node? amd-smi commands, the unit, etc.." [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:40:34] (03Abandoned) 10Daniel Kinzler: rest gateway: use IP as rate limit key for compliant bots [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) (owner: 10Daniel Kinzler) [09:43:12] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2157: Pooling in [09:44:43] (03CR) 10Kevin Bazira: ml-services: Deploy again edit-check model on experimental staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 (owner: 10Gkyziridis) [09:46:20] (03CR) 10Clément Goubert: [C:03+1] rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [09:46:57] (03PS1) 10Arnaudb: gerrit: disable connection reuse on Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1269364 (https://phabricator.wikimedia.org/T421827) [09:46:57] (03CR) 10Arnaudb: [C:03+2] "that change has been manually applied and has not negatively impacted production, merging this to let the test run its course." [puppet] - 10https://gerrit.wikimedia.org/r/1269364 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [09:48:18] (03PS2) 10Gkyziridis: ml-services: Deploy again edit-check model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 [09:48:53] (03CR) 10Gkyziridis: ml-services: Deploy again edit-check model on experimental staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 (owner: 10Gkyziridis) [09:52:33] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy again edit-check model on experimental staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 (owner: 10Gkyziridis) [09:53:23] (03CR) 10Clément Goubert: [C:03+1] rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler) [09:53:29] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy again edit-check model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 (owner: 10Gkyziridis) [09:55:40] (03Merged) 10jenkins-bot: ml-services: Deploy again edit-check model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 (owner: 10Gkyziridis) [09:57:37] <_joe_> elukey: so, I'm not entirely sure what went wrong with your change. I thought that running puppet should've solved the alerts on the configuration, actually [09:57:49] <_joe_> so that's something we need to understand better [09:58:28] <_joe_> but also - not sure what that change was intended to achieve in the context of the linked task [09:58:56] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1000) [10:00:57] 06SRE, 06ServiceOps new, 07Datacenter-Switchover: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11803595 (10Blake) a:03Blake [10:01:49] (03PS1) 10Volans: debdeploy: fix variable name after refactor [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1269375 [10:04:57] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams - 3.2 upgrade (T421402) [10:05:00] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [10:07:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1019.eqiad.wmnet with OS bullseye [10:07:59] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams - 3.2 upgrade (T421402) [10:08:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1019.eq... [10:08:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-fe1019 [10:08:34] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [10:10:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:11:42] (03CR) 10Fabfur: [C:03+2] hiera: cleanup custom haproxy version [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [10:13:07] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1019 - mvernon@cumin2002" [10:13:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1019 - mvernon@cumin2002" [10:13:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:13:13] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-fe1019.eqiad.wmnet 92.32.64.10.in-addr.arpa 2.9.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:13:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1019.eqiad.wmnet 92.32.64.10.in-addr.arpa 2.9.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [10:13:17] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1019 [10:13:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1019 [10:13:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1019 [10:20:47] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good and tested on cumin2002" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1269375 (owner: 10Volans) [10:21:31] (03CR) 10Volans: [V:03+2 C:03+2] debdeploy: fix variable name after refactor [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1269375 (owner: 10Volans) [10:22:02] (03PS1) 10Fabfur: hiera: cleanup on cp2041 for haproxy 3.2 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1269387 (https://phabricator.wikimedia.org/T421402) [10:29:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1019.eqiad.wmnet with reason: host reimage [10:29:59] (03CR) 10Trueg: [C:03+2] RSU: increase parallelism for staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269354 (https://phabricator.wikimedia.org/T422791) (owner: 10Gmodena) [10:30:45] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [10:31:06] (03CR) 10Lucas Werkmeister (WMDE): "Is there a Phabricator task for this? Would be nice to include it in the commit message and/or a PHP comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (owner: 10Bernard Wang) [10:32:24] (03Merged) 10jenkins-bot: RSU: increase parallelism for staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269354 (https://phabricator.wikimedia.org/T422791) (owner: 10Gmodena) [10:33:28] (03CR) 10Vgutierrez: [C:04-1] aptrepo,haproxy: add haproxy-awslc component/package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [10:34:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1019.eqiad.wmnet with reason: host reimage [10:38:06] (03PS1) 10Fabfur: Revert "haproxy: temporary removing haproxy3.2 specific conf" [puppet] - 10https://gerrit.wikimedia.org/r/1269389 [10:39:27] (03CR) 10Vgutierrez: [C:03+1] "please amend the commit to add a phab task reference using `Bug:`" [puppet] - 10https://gerrit.wikimedia.org/r/1269389 (owner: 10Fabfur) [10:40:02] (03PS2) 10Fabfur: Revert "haproxy: temporary removing haproxy3.2 specific conf" [puppet] - 10https://gerrit.wikimedia.org/r/1269389 (https://phabricator.wikimedia.org/T422030) [10:41:58] (03PS3) 10Fabfur: Revert "haproxy: temporary removing haproxy3.2 specific conf" [puppet] - 10https://gerrit.wikimedia.org/r/1269389 (https://phabricator.wikimedia.org/T421402) [10:44:20] (03CR) 10Fabfur: [C:03+2] Revert "haproxy: temporary removing haproxy3.2 specific conf" [puppet] - 10https://gerrit.wikimedia.org/r/1269389 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [10:45:33] !log upgrade debdeploy-server on cumin2002 to 0.0.99.14-1+deb12u1+exp2 (temporary build with Cumin 6 compat before we have Cumin 6 universally) [10:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:25] !log installing openssl security updates [10:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:42] (03PS1) 10Muehlenhoff: Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1269394 [10:49:42] (03CR) 10Fabfur: [C:03+2] hiera: cleanup on cp2041 for haproxy 3.2 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1269387 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [10:54:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1019.eqiad.wmnet with OS bullseye [10:54:27] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1019.eqiad.... [10:54:31] (03PS3) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) [10:55:33] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe[1009-1018,1020-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [10:57:52] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803797 (10MatthewVernon) [11:01:02] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:01:11] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:01:26] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:01:42] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:02:48] (03PS1) 10Clément Goubert: service-catalog: Add recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/1269398 (https://phabricator.wikimedia.org/T422804) [11:03:03] (03PS1) 10Vgutierrez: cache::haproxy,aptrepo: Clean-up old haproxy versions [puppet] - 10https://gerrit.wikimedia.org/r/1269399 [11:04:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe[1009-1018,1020-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [11:04:18] (03PS1) 10Clément Goubert: rest-gateway: Add liftwing listeners and network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269401 (https://phabricator.wikimedia.org/T422804) [11:04:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:05:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:05:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269399 (owner: 10Vgutierrez) [11:14:21] (03PS3) 10Aude: Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (https://phabricator.wikimedia.org/T420878) (owner: 10Bernard Wang) [11:14:44] (03PS1) 10Muehlenhoff: sre.cdn.roll-restart-reboot-ncredir: Fix aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1269402 [11:16:53] !log installing tiff security updates [11:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:59] (03PS4) 10Aude: Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (https://phabricator.wikimedia.org/T420878) (owner: 10Bernard Wang) [11:20:29] (03CR) 10Aude: "added the Phabricator task in the commit message and as a comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (https://phabricator.wikimedia.org/T420878) (owner: 10Bernard Wang) [11:25:20] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1269394 (owner: 10Muehlenhoff) [11:25:52] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1020.eqiad.wmnet with OS bullseye [11:26:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1020.eq... [11:26:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-fe1020 [11:27:04] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [11:27:28] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [11:28:49] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:29:00] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [11:29:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [11:30:55] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1020 - mvernon@cumin2002" [11:31:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1020 - mvernon@cumin2002" [11:31:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:31:01] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-fe1020.eqiad.wmnet 113.48.64.10.in-addr.arpa 3.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:31:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1020.eqiad.wmnet 113.48.64.10.in-addr.arpa 3.1.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:31:06] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1020 [11:32:53] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 80%, RTA = 8243.14 ms [11:33:19] RECOVERY - Host titan1002 is UP: PING WARNING - Packet loss = 0%, RTA = 1810.96 ms [11:35:03] FIRING: ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:03] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1020 [11:44:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1020 [11:48:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:30] !log installing nginx security updates [11:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:43] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269419 [11:56:35] (03CR) 10A smart kitten: throttle: Allow for overriding temp account creation limits (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan) [11:58:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269351 (owner: 10Hashar) [11:58:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [11:59:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1200) [12:00:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1020.eqiad.wmnet with reason: host reimage [12:01:04] PROBLEM - Host cirrussearch1103 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:04] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [12:03:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:42] RECOVERY - Host cirrussearch1103 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:04:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1020.eqiad.wmnet with reason: host reimage [12:05:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [12:05:23] (03PS1) 10Clément Goubert: services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) [12:05:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [12:05:43] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding apus-be1005 to eqiad - jclark@cumin1003" [12:05:48] (03CR) 10Ladsgroup: [C:03+1] services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [12:05:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding apus-be1005 to eqiad - jclark@cumin1003" [12:05:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:06:06] PROBLEM - Host clouddb1019 is DOWN: PING CRITICAL - Packet loss = 100% [12:06:24] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host apus-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:06:25] dhinus: ^ [12:06:27] is that you? [12:06:48] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host apus-be1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:07:50] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:08:07] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [12:08:14] (03CR) 10Tiziano Fogli: [C:03+2] pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [12:08:24] !log restarting Postfix on mx-out to pick up OpenSSL updates [12:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:37] marostegui: nope, I did not touch it [12:09:43] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:10:18] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:11:19] 10ops-eqiad, 06cloud-services-team, 10Data-Services, 06DBA, 06DC-Ops: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804005 (10Marostegui) #ops-eqiad can you check on site? The above errors seem HW related. [12:12:13] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:13:40] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet [12:16:47] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [12:18:06] !log restarting Postfix on mx-in to pick up OpenSSL updates [12:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:37] (03CR) 10Elukey: [C:03+1] service-catalog: Add recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/1269398 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [12:21:35] 06SRE, 06Infrastructure-Foundations, 10netops: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T422816 (10ayounsi) 03NEW [12:21:36] (03CR) 10Clément Goubert: "Hmm, right, this is a standard timeout, we can currently define either that or an `idle_timeout` for service proxy listeners. I think we c" [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [12:23:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1020.eqiad.wmnet with OS bullseye [12:24:02] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1020.eqiad.... [12:24:05] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe[1009-1019,1021-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [12:27:08] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1269399 (owner: 10Vgutierrez) [12:27:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:28:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-be1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:30:31] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:30:40] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [12:32:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe[1009-1019,1021-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [12:33:10] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [12:33:25] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804102 (10MatthewVernon) [12:33:25] (03PS2) 10Tiziano Fogli: thanos/compact: assign prometheus instances to compactors [puppet] - 10https://gerrit.wikimedia.org/r/1265429 (https://phabricator.wikimedia.org/T386911) [12:33:26] (03PS1) 10Tiziano Fogli: thanos/compact: run compactor on ruler-block host even without other instances [puppet] - 10https://gerrit.wikimedia.org/r/1269439 (https://phabricator.wikimedia.org/T386911) [12:34:19] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: run compactor on ruler-block host even without other instances [puppet] - 10https://gerrit.wikimedia.org/r/1269439 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [12:34:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe1007.eqiad.wmnet with OS bullseye [12:34:45] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-fe100... [12:35:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host thanos-fe1007 [12:35:37] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install apus-be100[56] - https://phabricator.wikimedia.org/T418901#11804113 (10Jclark-ctr) a:03Jclark-ctr [12:38:25] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:38:26] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [12:38:52] jclark@cumin1003 netbox (PID 4138077) is awaiting input [12:38:56] (03CR) 10Daniel Kinzler: rest gateway: prevent abuse of exempt api modules (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [12:39:15] (03CR) 10JMeybohm: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1269259 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [12:39:15] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:31] (03CR) 10JMeybohm: [C:03+1] admin_ng: Upgrade aux-k8s-eqiad to 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269329 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [12:40:10] jclark-ctr: looks like my DNS update cookbook took a lock from yours. [12:40:16] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:40:23] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:40:35] (03PS2) 10Clément Goubert: services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) [12:40:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T422748#11804123 (10Jclark-ctr) #1: Sensor: Line, BA:L1, Current Value: 12.04 A (current) Thresholds: High: 12 moved cable from L1-L3 to L2-L3 [12:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:41:24] (03CR) 10Ladsgroup: [C:03+1] services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [12:41:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T422748#11804124 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:42:02] (03CR) 10Clément Goubert: "Although the default `stream_idle_timeout` for envoy is 5 minutes, and is a connection manager-wide setting, so I think we should keep tha" [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [12:42:09] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding phab1006 to eqiad - jclark@cumin1003" [12:42:09] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in eqiad/aux-eqiad: maintenance [12:42:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding phab1006 to eqiad - jclark@cumin1003" [12:42:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:42:15] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:42:25] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:42:38] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in eqiad/aux-eqiad: maintenance [12:42:39] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [12:42:42] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:43:01] (03PS1) 10Krinkle: Enable wgTrackMediaRequestProvenance on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269440 (https://phabricator.wikimedia.org/T414338) [12:43:03] (03PS1) 10Krinkle: Enable wgTrackMediaRequestProvenance on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269441 (https://phabricator.wikimedia.org/T414338) [12:43:06] (03PS1) 10Krinkle: Enable wgTrackMediaRequestProvenance on remaining Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269442 (https://phabricator.wikimedia.org/T414338) [12:44:57] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804137 (10Jclark-ctr) a:03Jclark-ctr [12:44:58] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q3:rack/setup/install phab1006 - https://phabricator.wikimedia.org/T418905#11804138 (10Jclark-ctr) [12:46:09] !log elukey@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster aux-eqiad: Kubernetes upgrade [12:46:25] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host thanos-fe1007 - mvernon@cumin2002" [12:46:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host thanos-fe1007 - mvernon@cumin2002" [12:46:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:46:31] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache thanos-fe1007.eqiad.wmnet 186.48.64.10.in-addr.arpa 6.8.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:46:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) thanos-fe1007.eqiad.wmnet 186.48.64.10.in-addr.arpa 6.8.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:46:35] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe1007 [12:47:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe1007 [12:47:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host thanos-fe1007 [12:47:44] jclark@cumin1003 provision (PID 4146975) is awaiting input [12:48:51] (03CR) 10Elukey: [C:03+2] admin_ng: Upgrade aux-k8s-eqiad to 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269329 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [12:49:12] (03CR) 10Elukey: [C:03+2] kubernetes: move aux-k8s-eqiad to 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1269259 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [12:49:30] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl1003.eqiad.wmnet are marked down but pooled: k8s-ingress-aux_30443: Servers aux-k8s-worker1006.eqiad.wmnet, aux-k8s-worker1008.eqiad.wmnet, aux-k8s-worker1002.eqiad.wmnet, aux-k8s-worker1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:51:16] FIRING: [2x] ProbeDown: Service people1005:30443 has failed probes (http_os_reports_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:48] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804164 (10Marostegui) [12:53:58] !log Directly pushed GrowthExperiments wmf/1.46.0-wmf.22 patch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1269351 due to a chicken-and-egg issue on that branch [12:53:59] jclark@cumin1003 provision (PID 4146975) is awaiting input [12:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:03] jouncebot: refresh [12:54:04] I refreshed my knowledge about deployments. [12:54:08] jouncebot: nowandnext [12:54:09] For the next 0 hour(s) and 5 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1200) [12:54:09] In 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1300) [12:54:15] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:54:21] (03CR) 10Hashar: [C:03+2] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [12:54:26] (03CR) 10Hashar: [C:03+2] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [12:54:44] I am firing those in ahead of the start of the window [12:54:48] in the interest of time [12:56:47] (03CR) 10Muehlenhoff: [C:03+2] thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269419 (owner: 10Muehlenhoff) [12:56:51] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804185 (10Jclark-ctr) this server is out of warranty i performed flea power drain and did come up. i am updating firmwares right now you might see it reboot a few times [12:57:54] RECOVERY - Host clouddb1019 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [12:58:07] (03CR) 10Hashar: "Why does this change depends on I9947ab95cbe5c404fbff0c81a1d085c3a84962f4? It is unrelated code wise. My understanding is that it was er" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [12:58:32] PROBLEM - SSH on clouddb1019 is CRITICAL: connect to address 10.64.48.9 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:58:51] (03CR) 10Hashar: "See my comments on the wmf.22 backport at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1269345/comments/1134fa34_bf930d2d . The `Depe" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [12:59:01] I am curious when the train is today for group2? [12:59:03] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:59:12] aude: see https://wikitech.wikimedia.org/wiki/Deployments ? [12:59:25] !log jmm@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [12:59:36] we have two window per day [12:59:44] !log jmm@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:59:51] one in the european morning which is run by European releng people (Andre, Jnuche and me [12:59:53] elukey@cumin1003 wipe-cluster (PID 4150633) is awaiting input [12:59:56] ok thanks, I did not see the second one [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1300). [13:00:05] aude, hashar, and Msz2001: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:13] o/ [13:00:14] (03CR) 10Mszwarc: "It was due to CI failures trigerred by GrowthExperiments, see: https://integration.wikimedia.org/ci/job/quibble-with-GrowthExperiments-ext" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:00:16] and a second window later which is run by releng USA people [13:00:16] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804220 (10Marostegui) Thank you @Jclark-ctr - let us know when we can take over. Thankfully its replacement will arrive soon (famous last words) (T405296) [13:00:33] aude: so that will be later tonight because Ahmon is running the train this week -:] [13:00:40] ok thanks [13:00:42] o/ [13:00:48] Msz2001: I commented on your backport change for mediawiki/core [13:01:01] they have a Depends-On header that is unrelated to the code and seems to confuse CI / Zuul somehow [13:01:07] responded, I'll remove them [13:01:15] my understanding is you have added it in the master change to work around an ongoing deployment in CI [13:01:17] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:01:19] which was smart [13:01:23] but TOO smart for Zuul :b [13:01:30] (03PS2) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) [13:01:37] (03CR) 10Mszwarc: "Done" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:01:38] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:01:43] Msz2001: thanks!!!! :-] [13:01:46] (03PS3) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) [13:02:06] I have proposed some patches for GrowthExperimens, one of them I have had to push it directly to the branch due to a chicken and egg problem [13:02:13] the other two I +2ed them ~ 8 minutes ago [13:02:27] there are the two mediawiki/core changes by Msz2001 [13:02:30] and a config change by aude [13:02:43] and I am not sure whether we should roll everything in one go ? [13:02:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1007.eqiad.wmnet with reason: host reimage [13:03:07] !log jmm@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:03:12] I would probably do the config change separately [13:03:15] mine is ready [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:28] I can deploy mine after that [13:03:54] the “browser-tests” GrowthExperiments changes look like no-ops (not sure why we’re deploying them at all), so they should be safe to combine [13:04:00] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:04:01] yup [13:04:43] FIRING: OtelCollectorEnqueuedSpans: Some spans have been enqueued by exporter otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorEnqueuedSpans [13:04:55] (03Merged) 10jenkins-bot: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [13:04:57] (03Merged) 10jenkins-bot: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [13:05:04] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: sync [13:05:16] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: sync [13:05:19] (03PS3) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) [13:05:23] so either: 1) everything in one deploy 2) config then all the code changes 3) all the code changes then config [13:05:23] (03PS4) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) [13:05:36] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: sync [13:05:37] I am not sure what is the preference, I haven't run backports in a while [13:05:38] the backports got merged, so they should be deployed [13:05:45] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: sync [13:05:50] otherwise they’ll just get included in whatever the next deploy is anyway [13:05:55] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/redioscope: sync [13:05:59] good point [13:06:03] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/redioscope: sync [13:06:10] so 3) all the code changes then config [13:06:10] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/sophroid: sync [13:06:18] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/sophroid: sync [13:06:33] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:06:35] that'lll teach me to CR+2 things ahad of time :b [13:06:50] yeah, if you CR+2 a backport or config change then you’re committing to also deploying it [13:07:12] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: sync [13:07:18] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:07:28] !log elukey@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: sync [13:07:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:07:38] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:07:39] elukey@cumin1003 wipe-cluster (PID 4150633) is awaiting input [13:08:38] !log jmm@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:08:40] let me pick the change numbers in spiderpig [13:08:45] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.wipe-cluster (exit_code=0) Wipe the K8s cluster aux-eqiad: Kubernetes upgrade [13:08:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1007.eqiad.wmnet with reason: host reimage [13:09:04] (03PS5) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) [13:09:06] !log jmm@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:09:20] PROBLEM - Host clouddb1019 is DOWN: PING CRITICAL - Packet loss = 100% [13:09:31] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in eqiad/aux-eqiad: maintenance [13:09:55] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) pool all services in eqiad/aux-eqiad: maintenance [13:10:18] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sat 25 Apr 2026 01:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [13:11:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:11:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:11:16] RESOLVED: [2x] ProbeDown: Service people1005:30443 has failed probes (http_os_reports_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:20] changes are in the pipelines [13:12:26] ack [13:13:08] aude: will you be able to self deploy your config change after the code changes have been deployed? [13:13:17] I got a meeting coming (though I can multitask) [13:13:29] If needed, I can deploy that [13:13:35] i can deploy mine [13:13:39] excellent! [13:13:55] I am so happy we have so many deployment people nowadays! [13:14:51] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host phab1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:15:20] spiderpig, spiderpig, does whatever spiderpig does :D (namely: make deployments less scary) [13:15:40] !log jmm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:16:38] !bash spiderpig, spiderpig, does whatever spiderpig does :D (namely: make deployments less scary) [13:16:38] Lucas_WMDE: Stored quip at https://bash.toolforge.org/quip/Tkxjcp0B8tZ8Ohr0OXOI [13:16:49] IT'S MY FIRST QUIP [13:17:03] (03PS1) 10Fabfur: admin: added fabfur yubikey backup pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1269448 [13:17:11] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:19:16] omg really?? [13:19:17] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:19:38] yes! [13:19:43] lets QUIP it! [13:19:45] (03CR) 10Bking: [C:03+2] "self-merging in time for pairing/deployment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268222 (https://phabricator.wikimedia.org/T422378) (owner: 10Bking) [13:20:12] do we need to !bash some of your Parsoid toots [13:20:59] (03PS3) 10Tiziano Fogli: thanos/compact: assign prometheus instances to compactors [puppet] - 10https://gerrit.wikimedia.org/r/1265429 (https://phabricator.wikimedia.org/T386911) [13:20:59] (03PS1) 10Tiziano Fogli: thanos/compact: adjust relabel config and systemd unit quoting [puppet] - 10https://gerrit.wikimedia.org/r/1269449 (https://phabricator.wikimedia.org/T386911) [13:21:17] (03CR) 10Vgutierrez: [C:03+1] "key validated OOB via Slack" [puppet] - 10https://gerrit.wikimedia.org/r/1269448 (owner: 10Fabfur) [13:21:36] (03CR) 10Fabfur: [C:03+2] admin: added fabfur yubikey backup pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1269448 (owner: 10Fabfur) [13:21:48] i don't remember if my parsoid toots are funny :P [13:21:54] (03Merged) 10jenkins-bot: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:22:04] gallows humor *ducks* [13:22:06] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: adjust relabel config and systemd unit quoting [puppet] - 10https://gerrit.wikimedia.org/r/1269449 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [13:22:45] (03CR) 10CI reject: [V:04-1] Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:23:09] that probably happens every time i end up in table markup tbh [13:23:36] 06SRE, 10SRE-Access-Requests: Requesting access to stats hosts for Daniel Kinzler - https://phabricator.wikimedia.org/T422827 (10daniel) 03NEW [13:23:40] Joy some test failed in GrowthExperiments [13:23:47] but I think it is a flappy one ( GrowthExperiments\Tests\Integration\UncachedMenteeOverviewDataProviderTest::testGetFormattedDataForMentors ) [13:23:54] (03PS1) 10Jelto: wmnet: point os-reports to k8s-ingress-aux-ro [dns] - 10https://gerrit.wikimedia.org/r/1269452 (https://phabricator.wikimedia.org/T422819) [13:24:09] (03Merged) 10jenkins-bot: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [13:24:17] and the other one still merged, yay [13:24:20] (03CR) 10Kamila Součková: [C:03+1] rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [13:24:24] wait [13:24:36] scap says “The change '1269334' failed build tests and could not be merged” but that change was merged [13:24:45] gate-and-submit was fine, only main failed, we shouldn’t care about that [13:24:46] bad scap [13:25:04] hashar: retry? [13:25:53] that was on test I think [13:25:53] (03CR) 10Kamila Součková: [C:03+1] rest gateway: avoid re-defining routes for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269024 (owner: 10Daniel Kinzler) [13:25:53] https://phabricator.wikimedia.org/T422828 [13:26:06] but the gate-and-submit passed so the change got merged [13:26:27] I think the scap issue might be T420501 [13:26:27] T420501: scap backport fails too fast when trying to re-merge a patch that had failed CI - https://phabricator.wikimedia.org/T420501 [13:27:11] yeah [13:27:25] what I thinkhappens is the `test` pipeline voted verified -1 [13:27:28] scap detects it and aborts [13:27:38] yes [13:27:40] `gate-and-submit` eventually V+1/merge the change [13:27:45] but scap had already aborted [13:27:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe1007.eqiad.wmnet with OS bullseye [13:27:50] keeping track of state is complicated [13:27:56] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-fe1007.eq... [13:28:17] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.3 point update - https://phabricator.wikimedia.org/T414179#11804387 (10MoritzMuehlenhoff) [13:28:42] * hashar retries [13:28:57] !log bking@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [13:29:07] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on P{thanos-fe[1004-1006].eqiad.wmnet} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad) [13:29:27] 06SRE, 10SRE-Access-Requests: Requesting access to stats hosts for Daniel Kinzler - https://phabricator.wikimedia.org/T422827#11804392 (10daniel) I think in order to use Jupyter on the stats hosts I also need Kerberos access. Should I file a separate ticket for that? [13:29:58] RECOVERY - Host clouddb1019 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [13:30:13] (03CR) 10Muehlenhoff: [C:03+2] Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1268522 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [13:30:13] “Change '1269339' has 1 Depends-On relationship(s) (1268559) but none were deemed relevant by the dependency analysis rules. This may be unexpected.” o_O [13:30:20] what’s dependency analysis rules my precious [13:30:38] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804396 (10MatthewVernon) [13:30:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on P{thanos-fe[1004-1006].eqiad.wmnet} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad) [13:31:13] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [13:31:44] Apparently scap doesn't know that CI config doesn't have to be backported to wmf.XX branches :D [13:32:07] oh, “relevant” is about different branches, I see [13:32:20] (T365146) [13:32:21] T365146: Backport: implement new criteria for relevant dependencies - https://phabricator.wikimedia.org/T365146 [13:33:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [13:33:24] !log bking@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [13:34:25] 06SRE, 06ServiceOps new, 07Datacenter-Switchover, 13Patch-For-Review: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11804418 (10Blake) @Scott_French It sounds like it might be reasonable to exclude this service from the s... [13:34:27] hashar: Just making sure you're aware of it – Spiderpig is waiting for user interaction ;) [13:34:41] (03PS1) 10DCausse: search: add alt. completion indices to test keyword tokenizer (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269464 (https://phabricator.wikimedia.org/T420427) [13:34:43] (03PS1) 10DCausse: search: add alt. completion indices to test keyword tokenizer (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269465 (https://phabricator.wikimedia.org/T420427) [13:35:17] hashar: https://spiderpig.wikimedia.org/jobs/1724 is waiting for your input :) [13:35:24] ah, sorry, Msz2001 beat me to the ping ^^ [13:35:37] ahh yeah [13:35:38] (03PS1) 10Muehlenhoff: Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1269466 [13:35:43] thank you for watching me [13:36:07] now playing: fetch ALL the submodules \o/ [13:36:36] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1269351|fix: adjust to return type changed by upstream]], [[gerrit:1269339|browser-tests: hide Cypress tests from CI (T419574)]], [[gerrit:1269340|browser-tests: hide Cypress tests from CI (T419574)]], [[gerrit:1269345|Fix BackfillInterwikiRightsLog wrt. cyclic renames (T6055)]], [[gerrit:1269334|Fix BackfillInterwikiRightsLog wrt. cyclic renames (T6055 [13:36:36] )]] [13:36:40] T419574: Create a separate CI job for GrowthExperiments cypress tests - https://phabricator.wikimedia.org/T419574 [13:36:41] T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055 [13:37:18] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:37:41] whoa, a four-digit task ID [13:37:56] Yeah, I went to kindergaden when it was created :D [13:38:33] !log hashar@deploy1003 mszwarc, hashar: Backport for [[gerrit:1269351|fix: adjust to return type changed by upstream]], [[gerrit:1269339|browser-tests: hide Cypress tests from CI (T419574)]], [[gerrit:1269340|browser-tests: hide Cypress tests from CI (T419574)]], [[gerrit:1269345|Fix BackfillInterwikiRightsLog wrt. cyclic renames (T6055)]], [[gerrit:1269334|Fix BackfillInterwikiRightsLog wrt. cyclic renames (T6055)]] sync [13:38:33] ed to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:06] There's nothing to verify now for my patch, I think – it's a change to maint. script [13:39:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe1003.eqiad.wmnet with OS bookworm [13:39:41] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host apus-fe1003.... [13:40:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host apus-fe1003 [13:40:16] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [13:40:29] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804450 (10Jclark-ctr) @Marostegui the hardware error has cleared for now, but the system is reporting filesystem corruption and will need to be reimaged. Let’s keep the... [13:41:31] Msz2001: we had our first tasks in SourceForge and I am not sure they got ported to Bugzilla (and later Phabricator) [13:42:48] I didn't know about Sourceforge. But once I checked the open tasks with lowest numbers, and I was disappointed to see that a few lowest numbers were just seemingly created in 2010s before import from Bugzilla [13:44:15] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:17] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:44:19] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 2 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804456 (10Marostegui) Thanks John - let me reimage it now [13:44:20] Such as T45, for example – the lowest still open task [13:44:22] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:44:29] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host apus-fe1003 - mvernon@cumin2002" [13:44:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host apus-fe1003 - mvernon@cumin2002" [13:44:35] Seemingly even bots don't recognize it :) [13:44:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:36] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache apus-fe1003.eqiad.wmnet 102.32.64.10.in-addr.arpa 2.0.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:44:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apus-fe1003.eqiad.wmnet 102.32.64.10.in-addr.arpa 2.0.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:44:40] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host apus-fe1003 [13:44:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-fe1003 [13:44:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host apus-fe1003 [13:45:33] hashar: Spiderpig is waiting for approval ;) – IMO there's nothing to check for my patch, as it's a maint script [13:45:34] !log bking@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [13:46:25] (03PS1) 10Marostegui: installserver: Wipe clouddb1019 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1269467 (https://phabricator.wikimedia.org/T422813) [13:46:49] !log bking@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [13:47:01] (03CR) 10Marostegui: "This will be reverted once it is wiped." [puppet] - 10https://gerrit.wikimedia.org/r/1269467 (https://phabricator.wikimedia.org/T422813) (owner: 10Marostegui) [13:48:13] !log hashar@deploy1003 mszwarc, hashar: Continuing with sync [13:48:17] Msz2001: approved! :] [13:48:47] Msz2001: we started using Phabricator BEFORE migrating the tasks from Bugzilla [13:48:53] Msz2001: yeah, the bugzilla tasks start at T2001 (add 2000 to the bugzilla task ID to get the Phabricator task number) [13:48:54] T2001: [DO NOT USE] Documentation is out of date, incomplete (tracking) [superseded by #Documentation] - https://phabricator.wikimedia.org/T2001 [13:49:04] and that T2001 was Bug #1 [13:49:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:49:27] with its legendary task description [13:49:41] (03Abandoned) 10Kamila Součková: shellbox-icu72: Add ClusterIP to TLS cert SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1266250 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [13:51:00] I haven't seen that before. But yeah, very nice description :p [13:52:11] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269351|fix: adjust to return type changed by upstream]], [[gerrit:1269339|browser-tests: hide Cypress tests from CI (T419574)]], [[gerrit:1269340|browser-tests: hide Cypress tests from CI (T419574)]], [[gerrit:1269345|Fix BackfillInterwikiRightsLog wrt. cyclic renames (T6055)]], [[gerrit:1269334|Fix BackfillInterwikiRightsLog wrt. cyclic renames (T605 [13:52:11] 5)]] (duration: 15m 35s) [13:52:20] T419574: Create a separate CI job for GrowthExperiments cypress tests - https://phabricator.wikimedia.org/T419574 [13:52:20] T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055 [13:52:21] T605: October 22 Tech Talk: Design Research in Product Development - https://phabricator.wikimedia.org/T605 [13:52:23] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Data-Services, and 3 others: clouddb1019 down - https://phabricator.wikimedia.org/T422813#11804507 (10fnegri) [13:52:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:52:54] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [13:53:52] aude: Msz2001: patches deployed [13:54:01] thanks, I'll proceed with my config change [13:54:12] hashar: Thanks! [13:54:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (https://phabricator.wikimedia.org/T420878) (owner: 10Bernard Wang) [13:55:20] (03Merged) 10jenkins-bot: Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 (https://phabricator.wikimedia.org/T420878) (owner: 10Bernard Wang) [13:55:42] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1269063|Enable reading list beta feature for pilot wikis (T420878)]] [13:55:45] T420878: [Reading list web beta] Deploy beta feature to 5 pilot wikis - https://phabricator.wikimedia.org/T420878 [13:57:34] !log aude@deploy1003 bwang, aude: Backport for [[gerrit:1269063|Enable reading list beta feature for pilot wikis (T420878)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:00:06] !log aude@deploy1003 bwang, aude: Continuing with sync [14:01:33] PROBLEM - Host clouddb1019 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe1003.eqiad.wmnet with reason: host reimage [14:02:07] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy,aptrepo: Clean-up old haproxy versions [puppet] - 10https://gerrit.wikimedia.org/r/1269399 (owner: 10Vgutierrez) [14:04:22] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269063|Enable reading list beta feature for pilot wikis (T420878)]] (duration: 08m 40s) [14:04:27] T420878: [Reading list web beta] Deploy beta feature to 5 pilot wikis - https://phabricator.wikimedia.org/T420878 [14:04:39] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 102482736 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:04:44] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1103:9290 - https://phabricator.wikimedia.org/T422832 (10phaultfinder) 03NEW [14:05:42] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 80872 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:06:32] !log mszwarc@deploy1003 mwscript-k8s job started: foreachwikiindblist all backfillInterwikiRightsLog.php --remote-wiki metawiki 20260311190000 # T6055 (second attempt) [14:06:36] T6055: Interwiki rights logs should be duplicated at related wikis - https://phabricator.wikimedia.org/T6055 [14:09:29] (03PS1) 10Marostegui: clouddb1019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269474 (https://phabricator.wikimedia.org/T422813) [14:09:33] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe1003.eqiad.wmnet with reason: host reimage [14:09:34] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1103:9290 - https://phabricator.wikimedia.org/T422832#11804596 (10Jclark-ctr) a:03Jclark-ctr [14:10:17] (03CR) 10Marostegui: [C:03+2] clouddb1019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269474 (https://phabricator.wikimedia.org/T422813) (owner: 10Marostegui) [14:13:27] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:14:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:14:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:14:33] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11804625 (10ayounsi) [14:14:49] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300#11804630 (10ayounsi) [14:21:08] (03PS1) 10Aude: Opt-in new accounts to ReadingLists beta feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269475 (https://phabricator.wikimedia.org/T422833) [14:23:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269475 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [14:23:30] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:23:44] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:24:43] RESOLVED: OtelCollectorEnqueuedSpans: Some spans have been enqueued by exporter otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorEnqueuedSpans [14:25:48] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:25:53] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:26:30] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:29:45] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:29:52] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1430) [14:30:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe1003.eqiad.wmnet with OS bookworm [14:30:59] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804735 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host apus-fe1003.eqia... [14:32:13] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11804759 (10MatthewVernon) [14:33:33] (03CR) 10Elukey: [C:03+1] "No idea what wmf-navigator is, but maybe same thing?" [dns] - 10https://gerrit.wikimedia.org/r/1269452 (https://phabricator.wikimedia.org/T422819) (owner: 10Jelto) [14:34:33] (03PS1) 10Arnaudb: gerrit: disable connection reuse on the httpd → jetty layer [puppet] - 10https://gerrit.wikimedia.org/r/1269479 (https://phabricator.wikimedia.org/T421827) [14:35:39] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:35:51] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:37:36] !log ceph orch host drain moss-be1002 --zap-osd-devices T421719 [14:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:39] T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719 [14:38:53] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:39:53] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [14:40:03] (03CR) 10Volans: Move linting to Ruff and apply code fixes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [14:40:45] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:41:05] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:41:13] !log javiermonton@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:41:40] (03CR) 10AOkoth: [C:03+1] wmnet: point os-reports to k8s-ingress-aux-ro [dns] - 10https://gerrit.wikimedia.org/r/1269452 (https://phabricator.wikimedia.org/T422819) (owner: 10Jelto) [14:42:25] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:44:20] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:45:29] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:47:44] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and (A:dnsbox) [14:49:34] (03CR) 10Stoyofuku-wmf: [C:03+1] "Love it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269475 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [14:50:07] (03CR) 10Clément Goubert: [C:03+2] service-catalog: Add recommendation-api-ng [puppet] - 10https://gerrit.wikimedia.org/r/1269398 (https://phabricator.wikimedia.org/T422804) (owner: 10Clément Goubert) [14:53:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [14:54:00] (03PS2) 10Muehlenhoff: mariadb::ferm: Rewrite ferm::rule as firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1265374 (https://phabricator.wikimedia.org/T421705) [14:54:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb::ferm: Rewrite ferm::rule as firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1265374 (https://phabricator.wikimedia.org/T421705) (owner: 10Muehlenhoff) [14:58:03] !log dancy@deploy1003 Installing scap version "4.245.0" for 2 host(s) [14:58:53] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11804930 (10MMigurski-WMF) I’ve confirmed I can see both Turnilo & Superset 👍 [14:59:55] !log dancy@deploy1003 Installation of scap version "4.245.0" completed for 2 hosts [15:00:05] dancy and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1500). [15:03:23] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:03:36] (03PS5) 10Bking: opensearch-semantic-search-test: Add to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) [15:04:27] (03PS3) 10Clément Goubert: services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) [15:04:57] (03CR) 10CDanis: [C:03+1] services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [15:09:02] (03CR) 10Scott French: [C:03+1] services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [15:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#11804979 (10VRiley-WMF) [15:10:18] (03PS1) 10Urbanecm: GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) [15:10:31] (03PS1) 10Urbanecm: GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) [15:11:45] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#11805006 (10dancy) Fresh entry seen in the last day. =... [15:12:50] (03Abandoned) 10Jcrespo: mariadb: Add script to generate watchlist_count table on labs [puppet] - 10https://gerrit.wikimedia.org/r/375349 (https://phabricator.wikimedia.org/T59617) (owner: 10Jcrespo) [15:13:13] (03PS4) 10Jcrespo: mediabackup: Set ms-backup[12]00[12] as spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) [15:14:44] urbanecm: those GrowthExperiments patches are going to fail api-testing because we broke Quibble (I am rolling back the upgrade) [15:15:00] hashar: just noticed :) [15:15:12] managed to phabricatorize too, T422843 [15:15:14] T422843: GrowthExperiments CI fails with sh: 1: mocha: not found - https://phabricator.wikimedia.org/T422843 [15:15:17] AWESOME [15:15:27] I'll reuse that task to fix Quibble [15:15:27] (03PS5) 10Jcrespo: mediabackup: Set ms-backup[12]00[12] as spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) [15:15:34] (03CR) 10Bking: opensearch-semantic-search-test: Add to services proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [15:15:35] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [15:15:37] thanks! [15:16:09] (03PS6) 10Jcrespo: mediabackup: Set ms-backup[12]00[12] as spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) [15:16:18] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [15:17:06] (03PS4) 10Jon Harald Søby: Add new protection level (edituserprotected) for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) [15:17:39] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:17:55] (03CR) 10Clément Goubert: [C:03+2] services_proxy: Bump swift timeout [puppet] - 10https://gerrit.wikimedia.org/r/1269420 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [15:19:28] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:19:40] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:19:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby) [15:19:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be1002.eqiad.wmnet with OS bookworm [15:19:54] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11805130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1002.... [15:20:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host moss-be1002 [15:20:19] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [15:21:09] (03CR) 10Elukey: "Adding Clem since we were working on rec-api-ng earlier on, that should be a similar use case. The diff now looks good, let's see what Cle" [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [15:21:33] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:22:07] (03CR) 10CI reject: [V:04-1] GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [15:22:31] (03CR) 10Jcrespo: [C:03+2] mediabackup: Set ms-backup[12]00[12] as spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/1261453 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [15:24:32] (03CR) 10CI reject: [V:04-1] GrowthSuggestionToneCheck: flag as non-experimental [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [15:25:23] (03CR) 10Hashar: "recheck after I rolled back Quibble due to T422843" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269496 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [15:25:24] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host moss-be1002 - mvernon@cumin2002" [15:25:25] (03CR) 10Hashar: "recheck after I rolled back Quibble due to T422843" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269495 (https://phabricator.wikimedia.org/T422835) (owner: 10Urbanecm) [15:25:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host moss-be1002 - mvernon@cumin2002" [15:25:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:30] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache moss-be1002.eqiad.wmnet 79.32.64.10.in-addr.arpa 9.7.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:25:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) moss-be1002.eqiad.wmnet 79.32.64.10.in-addr.arpa 9.7.0.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:25:38] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host moss-be1002 [15:25:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host moss-be1002 [15:25:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host moss-be1002 [15:28:27] (03CR) 10Clément Goubert: "I think that needs a service::catalog entry as well (cf https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Create_an_entry_in_the_serv" [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [15:29:39] jouncebot: nowandnext [15:29:39] For the next 0 hour(s) and 30 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1500) [15:29:39] In 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1600) [15:29:47] quick redeploy of mw-on-k8s, no mediawiki code cahnge [15:31:02] !log cgoubert@deploy1003 Started scap sync-world: swift service proxy configuration cahnges [15:33:00] (03PS6) 10Bking: opensearch on k8s: Add semantic-search and ipoid to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) [15:34:46] (03PS7) 10Cwhite: smart: update smart_data_dump to support standalone disks too [puppet] - 10https://gerrit.wikimedia.org/r/1269054 (https://phabricator.wikimedia.org/T267664) [15:35:32] (03CR) 10Bking: "Thanks to you both! Note that I just added `opensearch-ipoid` to the change as well. My thought is that this will be non-disruptive to exi" [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) (owner: 10Bking) [15:36:07] !log cgoubert@deploy1003 Finished scap sync-world: swift service proxy configuration cahnges (duration: 05m 45s) [15:38:53] (03PS7) 10Bking: opensearch on k8s: Add semantic-search and ipoid to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/1264739 (https://phabricator.wikimedia.org/T421293) [15:40:53] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:42:57] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [15:43:24] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [15:43:33] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [15:43:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:44:03] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [15:44:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage [15:46:39] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and (A:dnsbox) [15:47:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage [15:48:31] (03PS1) 10Cwhite: beta-logs: provision ca on cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1269509 (https://phabricator.wikimedia.org/T350516) [15:50:58] (03CR) 10Hnowlan: prometheus: add recording rules for the appservers RED dashboard (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663) (owner: 10Hnowlan) [15:52:53] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:52:56] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2015 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:52:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:53:31] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [15:53:47] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2015 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:53:50] (03PS2) 10Hnowlan: prometheus: add recording rules for the appservers RED dashboard [puppet] - 10https://gerrit.wikimedia.org/r/1259170 (https://phabricator.wikimedia.org/T249663) [15:56:57] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:57:57] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:58:53] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1014:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:53] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:11:19] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1002.eqiad.wmnet with OS bookworm [16:11:28] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11805510 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1002.eqia... [16:12:27] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11805527 (10MatthewVernon) [16:18:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:19] (03CR) 10Dzahn: "On the gitlab test instance we are seeing this where 5 parameters are given to something that expects only 4. and the 5th one seems to be " [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [16:33:15] (03PS46) 10Tiziano Fogli: sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) [16:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:55] (03CR) 10CI reject: [V:04-1] sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [16:37:02] (03PS47) 10Tiziano Fogli: sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) [16:38:47] jouncebot: nowandnext [16:38:47] For the next 0 hour(s) and 21 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1600) [16:38:47] In 0 hour(s) and 21 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1700) [16:38:47] In 0 hour(s) and 21 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1700) [16:39:24] (03PS1) 10Ladsgroup: Revert^2 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269524 (https://phabricator.wikimedia.org/T328872) [16:39:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269524 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [16:40:01] cdanis: claime pushing it again now ^ [16:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:41:06] (03Merged) 10jenkins-bot: Revert^2 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269524 (https://phabricator.wikimedia.org/T328872) (owner: 10Ladsgroup) [16:41:33] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (T328872)]] [16:41:37] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [16:43:30] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (T328872)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:44:47] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [16:45:14] (03PS1) 10Bking: cloudelastic: Prepare for opensearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) [16:45:44] (03CR) 10CI reject: [V:04-1] cloudelastic: Prepare for opensearch 2 [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [16:45:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269531 (https://phabricator.wikimedia.org/T422860) (owner: 10Bking) [16:46:49] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T422668 [16:48:22] (03CR) 10Dzahn: [C:03+2] zuul::executor: remove mounting of /etc/cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1269082 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [16:48:36] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (T328872)]] (duration: 07m 02s) [16:48:39] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [16:49:23] cdanis: claime I'm seeing a new error now https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2026.04.09?id=UzYlc50BW6F3PnQfRlN7 but it might be transient? [16:49:41] > Shellbox server error: Failed to download input file "original.video": cURL error 7: Failed to connect to localhost port 6101: Connection refused (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:6101/v1/AUTH_mw/wikipedia-commons-local-public.56/ [16:50:14] No I know what's happening I think [16:50:43] should I revert or fixing it is easy? [16:50:56] revert because right now all transcodes will break [16:51:44] on it [16:51:48] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269524|Revert^2 "Use envoy for swift inside mediawiki" (T328872)]] [16:51:52] I need to check videoscaler configuration basically [16:52:20] (03PS1) 10Ladsgroup: Revert^3 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269532 [16:54:18] (03PS1) 10BryanDavis: developer-portal: Bump version to 2026-04-09-122436-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269533 [16:56:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269532 (owner: 10Ladsgroup) [16:56:07] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Release - T422668 [16:56:59] (03Merged) 10jenkins-bot: Revert^3 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269532 (owner: 10Ladsgroup) [16:57:23] It's weird it seems to have the correct listener in envoy configuration [16:57:27] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269532|Revert^3 "Use envoy for swift inside mediawiki"]] [16:58:51] Hmm POSTs to shellbox are what fail? Is it just bringing up the error from shellbox-video [16:59:19] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269532|Revert^3 "Use envoy for swift inside mediawiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:59:44] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:00:04] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1700) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1700) [17:00:17] https://logstash.wikimedia.org/goto/b19728d504b989b12b1987dd3b87378a [17:00:22] o/ [17:00:43] I have a developer-portal build to push out [17:01:12] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T422668 [17:01:31] (03CR) 10Kareid: [C:03+1] Test Kitchen UI: Deploy v1.2.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268897 (https://phabricator.wikimedia.org/T421972) (owner: 10Santiago Faci) [17:01:37] claime: I think it was transient :( https://logstash.wikimedia.org/goto/a917977ae39ef664ec2208a38d53f66a [17:01:41] (03CR) 10Kareid: [C:03+1] Test Kitchen UI: Deploy v1.2.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268898 (https://phabricator.wikimedia.org/T421972) (owner: 10Santiago Faci) [17:02:02] errors stopped ten minutes before my revert going out [17:02:09] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump version to 2026-04-09-122436-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269533 (owner: 10BryanDavis) [17:03:30] Amir1: and they're all for the same video aren't they? [17:03:38] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269532|Revert^3 "Use envoy for swift inside mediawiki"]] (duration: 06m 11s) [17:03:41] Ah no [17:04:18] (03Merged) 10jenkins-bot: developer-portal: Bump version to 2026-04-09-122436-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269533 (owner: 10BryanDavis) [17:07:46] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:08:30] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:08:43] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:08:59] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:09:10] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:09:33] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:10:16] (03CR) 10Dzahn: [C:03+1] "Yea, we should at least test this. Worst case it causes performance to go down but reliability should be improved. The "ttl 25" part seems" [puppet] - 10https://gerrit.wikimedia.org/r/1269479 (https://phabricator.wikimedia.org/T421827) (owner: 10Arnaudb) [17:10:33] (03PS2) 10Atsuko: airflow: dag filter helper function [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) [17:10:58] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Release - T422668 [17:11:30] (03CR) 10Atsuko: "ready for review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [17:13:52] claime: should we try again? 😅 [17:15:42] (03PS1) 10Clément Goubert: shellbox-video: Add swift envoy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269535 (https://phabricator.wikimedia.org/T328872) [17:15:42] Amir1: gimme a second [17:15:55] Need to check that patch works as intended [17:15:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268680 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [17:16:23] sure [17:16:39] (03CR) 10C. Scott Ananian: "Removing C+2 since wmf.23 is going out today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268680 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [17:17:14] (03CR) 10Ladsgroup: [C:03+1] shellbox-video: Add swift envoy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269535 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [17:19:21] (03CR) 10FNegri: [C:03+1] installserver: Wipe clouddb1019 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1269467 (https://phabricator.wikimedia.org/T422813) (owner: 10Marostegui) [17:19:57] (03CR) 10Clément Goubert: [C:03+2] shellbox-video: Add swift envoy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269535 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [17:20:37] Amir1: ok that looks like it does add the listener correctly, what I think happens is mw-videoscaler passes the URL as is to shellbox, which didn't have the listener, so can't actually use it [17:20:48] And what we see is the error mw-videoscaler gets back FROM shellbox [17:21:20] So your patch was fine, but shellbox needed a patch downstream as well [17:21:35] cool cool [17:21:45] thanks for fixing it [17:21:47] so in a second I'll deploy that patch to shellbox-video [17:21:58] * Amir1 sits quietly [17:22:02] I wonder if I need it for other shellboxens [17:22:11] (03Merged) 10jenkins-bot: shellbox-video: Add swift envoy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269535 (https://phabricator.wikimedia.org/T328872) (owner: 10Clément Goubert) [17:22:18] like shellbox-media [17:22:35] let me see what shellbox media is even [17:23:27] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:24:10] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:24:19] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:25:05] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:28:29] it seems shellbox-media needs it too [17:28:36] from reading the code and how it works [17:29:09] ack, adding it then [17:29:45] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11805903 (10Dzahn) [17:30:01] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11805905 (10Dzahn) [17:30:05] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11805908 (10Dzahn) [17:30:11] Amir1: err it doesn't even have egress rules? [17:30:15] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11805910 (10Dzahn) [17:30:27] So I don't think it does tbh :D [17:30:45] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11805912 (10Dzahn) [17:30:45] maybe it falls back to default? [17:31:28] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11805917 (10Dzahn) 05Resolved→03In progress I was pinged to deploy the latest version which implements the last round of bug fixes. [17:31:46] (03PS1) 10Dzahn: wikipedia25: update to latest version of April 6th [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269541 (https://phabricator.wikimedia.org/T408592) [17:32:16] (03CR) 10Dzahn: [C:03+2] wikipedia25: update to latest version of April 6th [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269541 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:32:45] worst case, we add it after the switch :D [17:33:31] Amir1: Yeah, backport your patch again [17:34:09] jayme: no egress rule in a namespace means deny right? [17:34:12] (03PS1) 10Ladsgroup: Revert^4 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269542 [17:34:36] (03Merged) 10jenkins-bot: wikipedia25: update to latest version of April 6th [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269541 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [17:34:38] claime: yes, apart from the global rules [17:34:39] I wonder if I can break some revert tries record [17:34:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269542 (owner: 10Ladsgroup) [17:35:23] Amir1: If you find out it needs it, you need to add the same stuff from helmfile.d/services/shellbox-video/values.yaml L62 onwards to helmfile.d/services/shellbox-main/values.yaml [17:35:30] Then deploy it in eqiad and codfw [17:35:43] !log cgoubert@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [17:35:47] and staging :D [17:35:51] !log cgoubert@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [17:35:51] (03Merged) 10jenkins-bot: Revert^4 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269542 (owner: 10Ladsgroup) [17:36:19] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269542|Revert^4 "Use envoy for swift inside mediawiki"]] [17:38:10] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269542|Revert^4 "Use envoy for swift inside mediawiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:38:58] Amir1: not shellbox-main shellbox-media sorry [17:39:18] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [17:39:37] noted, thanks [17:42:57] ahaha [17:43:05] sigh [17:43:08] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269542|Revert^4 "Use envoy for swift inside mediawiki"]] (duration: 06m 49s) [17:45:21] no errors so far [17:45:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:47:42] Amir1: I checked the logs of the most recent mw-videoscaler pod and I saw a transcode succeeding [17:47:57] but maybe it wasn't a "big" transcode and didn't need to fetch by url [17:48:49] maybe we can manually launch a retry of one that failed earlier? [17:49:09] I have not seen any errors anywhere in anything related to files/upload/video/... [17:51:17] !log dzahn@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [17:51:34] !log dzahn@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:51:37] yeah mercurius seems happy now [17:51:53] !log dzahn@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:52:02] Amir1: https://grafana.wikimedia.org/goto/afilaqlru33swe?orgId=1 envoy telemetry looks ok [17:52:17] !log dzahn@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:52:31] !log dzahn@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:52:47] !log dzahn@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:53:15] at least compared to yesterday, no upstream errors [17:54:35] \o/ [17:55:36] yeah I think you're ~good now :D [17:55:41] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11805960 (10Dzahn) 05In progress→03Resolved The latest version has been deployed. [17:58:19] Amir1: https://grafana.wikimedia.org/goto/afilaqlru33swe?orgId=1 [17:58:51] And sure enough that's some upstream errors in envoy telemetry [17:59:13] soooo bump the timeout even more? [18:00:05] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11805976 (10RobH) Dell has confirmed case update and will dispatch a new mainboard and cpu bracket. Once they do, they'll email/update with tracking and then dispatch will reach back out to schedule... [18:00:05] dancy and jnuche: Time to do the MediaWiki train - Utc-7+Utc-0 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T1800). [18:00:12] o/ [18:00:16] Amir1: https://grafana.wikimedia.org/goto/ffilbhf8cr5dsc?orgId=1 [18:00:20] Hits the 5 min timeout [18:00:29] Anything I need to be aware of before rolling the train? [18:00:39] * dancy eyes... activity.. [18:01:04] dancy: Honestly not really, we moved mediawiki to use the service mesh to reach out to swift [18:01:24] OK! I will charge ahead [18:01:26] It's causing a couple of errors but nothing major, I don't think anything to hold up the train, wdyt Amir1 ? [18:01:36] Yeah agreed [18:01:42] (also it's mediawiki-config changes so...) [18:01:43] Thanks [18:01:50] It would have happened without envoy anyway [18:01:59] !log dancy@deploy1003 Installing scap version "4.246.0" for 2 host(s) [18:03:22] Amir1: well you know what buttons to tweak if you want to change these timeouts yes? [18:03:31] (it's getting late for me) [18:03:51] !log dancy@deploy1003 Installation of scap version "4.246.0" completed for 2 hosts [18:04:47] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:04:51] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:04:54] Yeah [18:04:56] Don't worry [18:05:15] cool good luck :D [18:05:24] I'm still on call for an hour or so [18:05:52] Amir1: MediaWiki\Upload\Exception\UploadChunkFileException: Error storing file in '{chunkPath}': backend-fail-internal; local-swift-codfw is back. I assume you know about that. [18:07:00] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269549 (https://phabricator.wikimedia.org/T420481) [18:07:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269549 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [18:07:59] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269549 (https://phabricator.wikimedia.org/T420481) (owner: 10TrainBranchBot) [18:08:05] dancy: yeah but how often? [18:09:07] Low rate. 73 in the last 15 minutes. But top error, so noticeable [18:12:21] oh that is quite interesting. when we deploy it, initially it didn't have the exceptions [18:12:31] then it started to pile up. [18:12:44] and it's getting more common [18:13:32] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.23 refs T420481 [18:13:35] T420481: 1.46.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T420481 [18:13:59] deployment finished at :43, first error started at :56 [18:14:27] :57 actually [18:16:54] (03PS1) 10Aude: Make onboarding dialog a little less eager beaver 🦫 [extensions/ReadingLists] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269552 (https://phabricator.wikimedia.org/T421942) [18:17:21] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephmon2005-dev.codfw.wmnet [18:17:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ReadingLists] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269552 (https://phabricator.wikimedia.org/T421942) (owner: 10Aude) [18:17:40] cdanis: https://trace.wikimedia.org/ I think this is down :D [18:18:09] Train looks good so I'm out of the way [18:18:49] ha, the errors are back to zero [18:19:18] I have a feeling that new pods reset the timer, so in ten minutes errors should start to show up again (if my hunch is correct) [18:19:39] *15 min [18:23:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet [18:29:15] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:29:19] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:30:21] (03CR) 10Brouberol: airflow: dag filter helper function (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268951 (https://phabricator.wikimedia.org/T420730) (owner: 10Atsuko) [18:31:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1160:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1160 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:32:17] UploadChunkFileException is creeping back up again, right on time [18:35:58] Amir1: should we revert this, I guess? ~ 40 UploadChunkFileException in the last 10m or so [18:37:22] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:37:26] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:40:15] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:40:19] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:41:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1160:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1160 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:44:25] swfrench-wmf: sigh. Yeah. Let's do it [18:46:53] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:46:57] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:47:31] Amir1: sounds good - lemme know if you need a hand with that [18:47:58] !log dancy@deploy1003 Installing scap version "4.247.0" for 2 host(s) [18:49:17] (03PS1) 10Ladsgroup: Revert^5 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269572 [18:49:20] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:49:25] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:49:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269572 (owner: 10Ladsgroup) [18:49:48] !log dancy@deploy1003 Installation of scap version "4.247.0" completed for 2 hosts [18:50:33] (03Merged) 10jenkins-bot: Revert^5 "Use envoy for swift inside mediawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269572 (owner: 10Ladsgroup) [18:50:50] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1269572|Revert^5 "Use envoy for swift inside mediawiki"]] [18:51:07] Amir1: oh maybe this is what Luca was telling me about re: aux cluster and certs, I can take a look tomorrow if not today [18:52:20] sure. no worries [18:52:35] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1269572|Revert^5 "Use envoy for swift inside mediawiki"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:52:55] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [18:56:31] Amir1: https://phabricator.wikimedia.org/T422868 seems relevant [18:56:43] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269572|Revert^5 "Use envoy for swift inside mediawiki"]] (duration: 05m 53s) [18:57:18] oh, another revert [18:57:22] So I guess that should have fixed it [18:58:07] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:58:11] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [18:58:58] (03CR) 10Bartosz Dziewoński: [C:03+1] rest gateway: prevent abuse of exempt api modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255731 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [18:59:08] yeah [19:02:05] !log otto@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:02:09] !log otto@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [19:04:28] !log bking@apt1002 delete old haproxy pkgs P90343 [19:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:06] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [19:09:07] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:09:34] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1055 [19:10:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1055 [19:11:22] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:15:37] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:16:04] 06SRE: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806245 (10Reedy) [19:16:20] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806246 (10Reedy) [19:16:32] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:16:48] !log bking@apt1002 sudo -E reprepro -C thirdparty/opensearch2 copy trixie-wikimedia bookworm-wikimedia opensearch T422860 [19:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:51] T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860 [19:18:00] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:20:03] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:24:48] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:26:42] 06SRE, 06ServiceOps new, 07Datacenter-Switchover, 13Patch-For-Review: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw - https://phabricator.wikimedia.org/T422166#11806288 (10Scott_French) Thanks, @Blake. Yes, excluding the apus service from switchover day 1 seems lik... [19:27:29] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806289 (10Aklapper) > Two API keys under the same username (one for dev, one for prod) Hi, which exact type of "API keys" is this about? > and https://upload.wikimedia.org URIs for image downloads How exactly do you... [19:29:13] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:30:23] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:05] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806321 (10ssingh) In addition to what @Aklapper has mentioned above, please also include the full error message (that you see along with the 429). Thanks. [19:44:52] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:46:12] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1055.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:47:22] (03CR) 10Tiziano Fogli: [C:03+2] sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [19:50:20] (03Merged) 10jenkins-bot: sre.o11y.thanos-compact-restart: add cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261375 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [19:53:46] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:58:15] 06SRE, 06Traffic: Nakavo - Rate Limiting Query - https://phabricator.wikimedia.org/T422872#11806393 (10NakavoDev) Hey, > Hi, which exact type of "API keys" is this about? We've created the api keys in the app management portal {F75436999} > How exactly do you construct such image URIs? We do this in 3 steps... [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T2000). [20:00:06] aude, Jhs, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:11] hi [20:00:13] o/ [20:00:53] !log tappof@cumin1003 START - Cookbook sre.o11y.thanos-compact-restart cookbook test (patch id: 1260650) [20:02:09] aude: do you need a deployer? [20:02:24] i can do mine but maybe we batch the config changes first? [20:02:42] and i can then take care of my 1.46.0-wmf.23 backport [20:02:45] that was my next question: was there any ordering relationship between your wmf.23 backport and the config changes [20:02:55] no dependency [20:03:01] Jhs: ok to group your config change with the rest? [20:03:08] sure [20:03:13] !log tappof@cumin1003 END (PASS) - Cookbook sre.o11y.thanos-compact-restart (exit_code=0) cookbook test (patch id: 1260650) [20:03:28] aude: are you ok with being the spiderpig wrangler? [20:03:41] sure [20:04:10] proceeding [20:04:10] i appreciate the fact that you are testing how well spiderpig handles emojis :) [20:04:19] haha [20:04:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephmon2006-dev.codfw.wmnet [20:05:32] i am take a glance at the config changes [20:06:22] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: assign prometheus instances to compactors [puppet] - 10https://gerrit.wikimedia.org/r/1265429 (https://phabricator.wikimedia.org/T386911) (owner: 10Tiziano Fogli) [20:06:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269475 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [20:06:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby) [20:06:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268680 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [20:07:28] (03Merged) 10jenkins-bot: Opt-in new accounts to ReadingLists beta feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269475 (https://phabricator.wikimedia.org/T422833) (owner: 10Aude) [20:07:33] (03Merged) 10jenkins-bot: Add new protection level (edituserprotected) for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby) [20:07:36] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268680 (https://phabricator.wikimedia.org/T422524) (owner: 10C. Scott Ananian) [20:07:52] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1269475|Opt-in new accounts to ReadingLists beta feature on testwiki (T422833)]], [[gerrit:1047441|Add new protection level (edituserprotected) for nowiki (T367943)]], [[gerrit:1268680|Turn on Parsoid Read Views for dewiki (T422524)]] [20:07:59] T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833 [20:07:59] T367943: Add new protection level for the Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T367943 [20:08:00] T422524: Parsoid Read Views to deploy ~2026-04-07 - https://phabricator.wikimedia.org/T422524 [20:09:35] !log aude@deploy1003 cscott, jhsoby, aude: Backport for [[gerrit:1269475|Opt-in new accounts to ReadingLists beta feature on testwiki (T422833)]], [[gerrit:1047441|Add new protection level (edituserprotected) for nowiki (T367943)]], [[gerrit:1268680|Turn on Parsoid Read Views for dewiki (T422524)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:49] please check cscott Jhs [20:10:43] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet [20:11:39] aude, mine works as expected afaict 👍 [20:11:42] thanks [20:12:56] mine looks good [20:13:01] thanks [20:13:05] !log aude@deploy1003 cscott, jhsoby, aude: Continuing with sync [20:15:35] !log tappof@cumin1003 START - Cookbook sre.o11y.thanos-compact-restart rebalance blocks across compactor instances (patch id: 1265429) [20:16:57] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269475|Opt-in new accounts to ReadingLists beta feature on testwiki (T422833)]], [[gerrit:1047441|Add new protection level (edituserprotected) for nowiki (T367943)]], [[gerrit:1268680|Turn on Parsoid Read Views for dewiki (T422524)]] (duration: 09m 04s) [20:17:03] T422833: Start opting in new accounts on the pilot wikis (arwiki, frwiki, zhwiki, idwiki and viwiki) - https://phabricator.wikimedia.org/T422833 [20:17:03] T367943: Add new protection level for the Norwegian Bokmål Wikipedia - https://phabricator.wikimedia.org/T367943 [20:17:04] T422524: Parsoid Read Views to deploy ~2026-04-07 - https://phabricator.wikimedia.org/T422524 [20:17:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aude@deploy1003 using scap backport" [extensions/ReadingLists] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269552 (https://phabricator.wikimedia.org/T421942) (owner: 10Aude) [20:17:48] !log tappof@cumin1003 END (PASS) - Cookbook sre.o11y.thanos-compact-restart (exit_code=0) rebalance blocks across compactor instances (patch id: 1265429) [20:18:36] (03Merged) 10jenkins-bot: Make onboarding dialog a little less eager beaver 🦫 [extensions/ReadingLists] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269552 (https://phabricator.wikimedia.org/T421942) (owner: 10Aude) [20:18:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:51] !log aude@deploy1003 Started scap sync-world: Backport for [[gerrit:1269552|Make onboarding dialog a little less eager beaver 🦫 (T421942)]] [20:18:53] T421942: Improve onboarding dialog timing - https://phabricator.wikimedia.org/T421942 [20:19:15] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:20:03] (03PS1) 10Dzahn: admin: add backup yubikey to myself, dzahn [puppet] - 10https://gerrit.wikimedia.org/r/1269649 [20:20:32] !log aude@deploy1003 aude: Backport for [[gerrit:1269552|Make onboarding dialog a little less eager beaver 🦫 (T421942)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:09] spot checking [20:24:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:25:07] !log aude@deploy1003 aude: Continuing with sync [20:25:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11806465 (10Dwisehaupt) [20:26:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11806468 (10Dwisehaupt) Host is built and databases cloned. Closing. [20:27:15] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:28:04] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268897 (https://phabricator.wikimedia.org/T421972) (owner: 10Santiago Faci) [20:28:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:56] !log aude@deploy1003 Finished scap sync-world: Backport for [[gerrit:1269552|Make onboarding dialog a little less eager beaver 🦫 (T421942)]] (duration: 10m 05s) [20:28:59] T421942: Improve onboarding dialog timing - https://phabricator.wikimedia.org/T421942 [20:30:05] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268897 (https://phabricator.wikimedia.org/T421972) (owner: 10Santiago Faci) [20:31:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding apus-be2005-6 and phab2003 to codfw - jhancock@cumin2002" [20:31:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding apus-be2005-6 and phab2003 to codfw - jhancock@cumin2002" [20:31:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:31:57] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host apus-be2005 [20:34:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11806507 (10Dwisehaupt) 05Open→03Resolved [20:37:49] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [20:38:08] aude: are you done? i might have one more patch to sneak into this window. [20:38:18] yes i am done [20:38:33] no issues with the emoji :) [20:38:49] squad goals now for me [20:39:34] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [20:45:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-be2005 [20:45:10] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268898 (https://phabricator.wikimedia.org/T421972) (owner: 10Santiago Faci) [20:45:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host apus-be2006 [20:45:23] !log reprepro --noskipold --component thirdparty/opensearch2 update trixie-wikimedia T422860 [20:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:27] T422860: Migrate Cloudelastic to OpenSearch 2.x - https://phabricator.wikimedia.org/T422860 [20:47:08] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268898 (https://phabricator.wikimedia.org/T421972) (owner: 10Santiago Faci) [20:50:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-be2006 [20:50:24] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host phab2003 [20:50:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host phab2003 [20:50:48] (03PS1) 10C. Scott Ananian: ParsoidLanguageConverter: Don't convert inside