[00:15:25] RESOLVED: ProbeDown: Service aqs1024-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1024-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:03] (03PS1) 10Dzahn: zuul::executor: remove mounting of /etc/cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1269082 [00:21:44] (03PS2) 10Dzahn: zuul::executor: remove mounting of /etc/cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1269082 (https://phabricator.wikimedia.org/T395938) [00:22:23] !log eevans@cumin1003 START - Cookbook sre.hosts.remove-downtime for aqs1024.eqiad.wmnet [00:22:23] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1024.eqiad.wmnet [00:28:21] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11802627 (10Eevans) [00:30:07] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11802628 (10Eevans) [00:34:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [00:39:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:41:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:47:22] jouncebot: nowandnext [00:47:23] No deployments scheduled for the next 5 hour(s) and 12 minute(s) [00:47:23] In 5 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600) [00:47:23] In 5 hour(s) and 12 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600) [00:47:27] (03CR) 10Zabe: [C:03+2] Start reading from new file tables everwhere except enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268574 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:48:22] (03Merged) 10jenkins-bot: Start reading from new file tables everwhere except enwiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1268574 (https://phabricator.wikimedia.org/T416548) (owner: 10Zabe) [00:49:45] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1268574|Start reading from new file tables everwhere except enwiki and commons (T416548)]] [00:49:48] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:51:37] !log zabe@deploy1003 zabe: Backport for [[gerrit:1268574|Start reading from new file tables everwhere except enwiki and commons (T416548)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:53:34] !log zabe@deploy1003 zabe: Continuing with sync [00:57:25] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1268574|Start reading from new file tables everwhere except enwiki and commons (T416548)]] (duration: 07m 40s) [00:57:28] T416548: Start reading from file table on wmf production - https://phabricator.wikimedia.org/T416548 [00:58:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:30] (03PS1) 10Zabe: Start reading from new file tables on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269086 (https://phabricator.wikimedia.org/T416548) [01:09:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269088 [01:09:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269088 (owner: 10TrainBranchBot) [01:22:55] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1269088 (owner: 10TrainBranchBot) [01:23:30] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [01:28:30] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [01:30:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [01:31:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:00:54] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:06] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 11s) [02:09:15] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:22] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [02:25:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:30:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:31:23] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [02:34:15] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:52:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:54:57] (03PS1) 10Aaron Schulz: Add Wikimedia REST API ?spec route for *.wikinews.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269099 (https://phabricator.wikimedia.org/T418318) [03:00:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:00:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:02:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:02:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:04:15] (03PS2) 10Bernard Wang: Enable reading list beta feature for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1269063 [03:05:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:06:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:09:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:09:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:17:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:20:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:23:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:23:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:27:09] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1022 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:27:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:28:09] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1022 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:34:35] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:35:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:35:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:38:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:39:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [04:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:41:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:41:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:09:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool depool pc1016: Reimage [05:09:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [05:09:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [05:09:38] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool pc1016: Reimage [05:10:13] (03PS1) 10Marostegui: pc2010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269223 (https://phabricator.wikimedia.org/T422368) [05:10:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Reimage to Debian Trixie [05:11:15] (03PS2) 10Marostegui: pc2016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269223 (https://phabricator.wikimedia.org/T422368) [05:11:49] (03CR) 10Marostegui: [C:03+2] pc2016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269223 (https://phabricator.wikimedia.org/T422368) (owner: 10Marostegui) [05:13:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc2016.codfw.wmnet with OS trixie [05:26:44] (03PS1) 10Marostegui: installservers: Do not format /srv on an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1269227 (https://phabricator.wikimedia.org/T422778) [05:31:44] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [05:33:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [05:48:22] (03CR) 10Filippo Giunchedi: [C:03+1] Revert "P:toolforge::prometheus: Disable istio-gateway scrape for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268981 (https://phabricator.wikimedia.org/T421386) (owner: 10Majavah) [05:48:43] (03CR) 10Filippo Giunchedi: [C:03+1] dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) (owner: 10Majavah) [05:54:27] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:55:37] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:56:27] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:56:37] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:59:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2016.codfw.wmnet with OS trixie [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600) [06:00:05] marostegui, Amir1, and federico3: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0600). [06:05:26] (03CR) 10Ayounsi: "Thanks for the data. Unfortunately we're a bit blind about YT." [dns] - 10https://gerrit.wikimedia.org/r/1267042 (owner: 10Ayounsi) [06:10:13] (03PS1) 10Marostegui: Revert "pc2016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269238 [06:26:40] (03PS1) 10Muehlenhoff: Record LDAP access for lbecker [puppet] - 10https://gerrit.wikimedia.org/r/1269241 (https://phabricator.wikimedia.org/T422537) [06:29:01] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for lbecker [puppet] - 10https://gerrit.wikimedia.org/r/1269241 (https://phabricator.wikimedia.org/T422537) (owner: 10Muehlenhoff) [06:37:48] (03PS1) 10Marostegui: pc1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269243 (https://phabricator.wikimedia.org/T422368) [06:38:22] (03CR) 10Marostegui: [C:03+2] pc1016: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1269243 (https://phabricator.wikimedia.org/T422368) (owner: 10Marostegui) [06:38:45] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc1016.eqiad.wmnet with OS trixie [06:38:52] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc1016.eqiad.wmnet with OS trixie [06:39:36] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host pc1016.eqiad.wmnet with OS trixie [06:40:58] (03PS1) 10Muehlenhoff: Record LDAP access for bliviero [puppet] - 10https://gerrit.wikimedia.org/r/1269244 [06:41:52] (03PS1) 10Ryan Kemper: growthbook: Add API key placeholders for automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) [06:43:07] 06SRE, 10LDAP-Access-Requests: Grant Access to Turnilo and Superset for MMigurski-WMF - https://phabricator.wikimedia.org/T422537#11802981 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium a:03MoritzMuehlenhoff >>! In T422537#11802156, @MMigurski-WMF wrote: > I have updated my email to... [06:46:27] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11802986 (10ayounsi) [06:46:53] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for bliviero [puppet] - 10https://gerrit.wikimedia.org/r/1269244 (owner: 10Muehlenhoff) [06:49:15] 06SRE, 06Infrastructure-Foundations: Failing Trixie VM installations on routed Ganeti - https://phabricator.wikimedia.org/T422596#11802989 (10ayounsi) [06:49:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating eqsin to routed Ganeti - https://phabricator.wikimedia.org/T421863#11802988 (10ayounsi) [06:54:15] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:55:08] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [06:59:10] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1016.eqiad.wmnet with reason: host reimage [07:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:35] (03CR) 10JMeybohm: [C:03+1] "Oh, that makes total sense! Sorry I did not catch this when reviewing the initial patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269069 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [07:03:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on pc2016.codfw.wmnet with reason: Maintenance [07:04:10] (03CR) 10Marostegui: [C:03+2] Revert "pc2016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269238 (owner: 10Marostegui) [07:10:26] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:13:06] (03PS1) 10Marostegui: Revert "pc1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269252 [07:13:55] (03CR) 10Majavah: [C:03+2] Revert "P:toolforge::prometheus: Disable istio-gateway scrape for now" [puppet] - 10https://gerrit.wikimedia.org/r/1268981 (https://phabricator.wikimedia.org/T421386) (owner: 10Majavah) [07:14:02] (03CR) 10Majavah: [C:03+2] dumps: web: Remove plaintext HTTP server [puppet] - 10https://gerrit.wikimedia.org/r/1268985 (https://phabricator.wikimedia.org/T422672) (owner: 10Majavah) [07:15:29] (03CR) 10Marostegui: [C:03+2] Revert "pc1016: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1269252 (owner: 10Marostegui) [07:16:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1016.eqiad.wmnet with OS trixie [07:17:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool pool pc1016: After reimage [07:17:15] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [07:17:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [07:17:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool pc1016: After reimage [07:34:38] (03CR) 10Brouberol: "Looks good! You also need to bump the `growthbook-backend` subchart version, as well as the `growthbook` chart version, to ensure the chan" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269245 (https://phabricator.wikimedia.org/T420696) (owner: 10Ryan Kemper) [07:43:58] (03PS2) 10Elukey: service: allow k8s-ingress-aux-rw to be active/active [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486) [07:43:58] (03PS1) 10Elukey: kubernetes: move aux-k8s-eqiad to 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1269259 (https://phabricator.wikimedia.org/T414486) [07:48:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service on cirrussearch1103:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:55] (03PS1) 10Elukey: admin_ng: Upgrade aux-k8s-eqiad to 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269329 (https://phabricator.wikimedia.org/T414486) [07:56:06] PROBLEM - SSH on logstash1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:56:15] (03CR) 10Filippo Giunchedi: [C:03+1] dumps: web: Use 429 for connection limit issues [puppet] - 10https://gerrit.wikimedia.org/r/1269021 (owner: 10Majavah) [07:56:31] FIRING: [2x] ProbeDown: Service logstash1032:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash1032:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:56:48] (03CR) 10Majavah: [C:03+2] dumps: web: Use 429 for connection limit issues [puppet] - 10https://gerrit.wikimedia.org/r/1269021 (owner: 10Majavah) [07:58:56] RECOVERY - SSH on logstash1032 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:59:06] PROBLEM - SSH on logstash1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:00:05] dancy and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260409T0800) [08:00:26] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:29] (03PS1) 10Filippo Giunchedi: memcached: install liburi-perl [puppet] - 10https://gerrit.wikimedia.org/r/1269333 [08:01:28] (03PS1) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) [08:01:31] FIRING: [4x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:03:56] RECOVERY - SSH on logstash1023 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:06:31] RESOLVED: [4x] ProbeDown: Service logstash1023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:08:52] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) [08:08:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1269333 (owner: 10Filippo Giunchedi) [08:08:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [08:10:00] (03PS3) 10Fabfur: hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) [08:11:51] !log upgrading eqiad to haproxy 3.2 (T421402) [08:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:54] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [08:11:58] (03CR) 10Filippo Giunchedi: [C:03+2] memcached: install liburi-perl [puppet] - 10https://gerrit.wikimedia.org/r/1269333 (owner: 10Filippo Giunchedi) [08:12:06] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [08:14:29] (03CR) 10CI reject: [V:04-1] Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [08:14:36] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad - 3.2 upgrade (T421402) [08:14:37] (03CR) 10Mszwarc: "recheck" [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [08:14:38] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad - 3.2 upgrade (T421402) [08:16:12] (03CR) 10Elukey: [C:03+2] service: allow k8s-ingress-aux-rw to be active/active [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [08:17:07] (03CR) 10Muehlenhoff: [C:03+1] "os-reports is only static content, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1268902 (https://phabricator.wikimedia.org/T414486) (owner: 10Elukey) [08:20:37] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [08:20:44] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1013.eq... [08:21:06] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster pool all services in codfw/aux-codfw: maintenance [08:21:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-fe1013 [08:21:19] PROBLEM - gdnsd checkconf #page on dns2006 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:21:21] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [08:21:29] !incidents [08:21:30] 7818 (UNACKED) dns2006/gdnsd checkconf (paged) [08:21:38] !ack 7818 [08:21:38] 7818 (ACKED) dns2006/gdnsd checkconf (paged) [08:21:59] here too [08:22:17] oh my, is it related to my patch? [08:22:19] PROBLEM - gdnsd checkconf #page on dns7002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:22:21] PROBLEM - gdnsd checkconf #page on dns5003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:22:27] !incidents [08:22:28] 7818 (ACKED) dns2006/gdnsd checkconf (paged) [08:22:28] 7819 (UNACKED) dns5003/gdnsd checkconf (paged) [08:22:28] 7820 (UNACKED) dns7002/gdnsd checkconf (paged) [08:22:30] !ack [08:22:30] 7819 (ACKED) dns5003/gdnsd checkconf (paged) [08:22:30] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:22:35] (03PS1) 10Hashar: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) [08:22:38] elukey: not sure, checking [08:22:56] for context: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1268902/3/hieradata/common/service.yaml [08:22:59] (03PS1) 10Hashar: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) [08:23:04] checking as well [08:23:26] !log elukey@cumin1003 END (FAIL) - Cookbook sre.k8s.pool-depool-cluster (exit_code=93) pool all services in codfw/aux-codfw: maintenance [08:23:26] <_joe_> elukey: yes I think you need puppet to run everywhere before you change things [08:23:52] Invalid resource name disc-k8s-ingress-aux-rw detected from zonefile lookup [08:23:52] (03CR) 10Michael Große: [C:03+1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:23:53] elukey: looks like it [08:23:56] error: plugin_metafo: Invalid resource name 'disc-k8s-ingress-aux-rw' detected from zonefile lookup error: Name 'k8s-ingress-aux-rw.discovery.wmnet.': resolver plugin 'metafo' rejected resource name 'disc-k8s-ingress-aux-rw' [08:24:03] <_joe_> so run puppet everywhere [08:24:05] (03CR) 10Michael Große: [C:03+1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:24:08] Apr 09 08:18:57 dns2006 gdnsd[2640858]: Name 'k8s-ingress-aux-rw.discovery.wmnet.': resolver plugin 'metafo' rejected resource name 'disc-k8s-ingress-aux-rw' [08:24:23] yeah sorry folks [08:24:32] running puppet [08:24:32] so yes gdnsd is complaining about the aux ingress [08:25:23] okay thanks, let's see if this calms gdnsd [08:26:16] I'm in the middle of a re-image and new-vlan cookbook and it's also unhappy [08:26:26] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [08:26:34] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T419635)', diff saved to https://phabricator.wikimedia.org/P90334 and previous config saved to /var/cache/conftool/dbconfig/20260409-082633-fceratto.json [08:26:37] T419635: Drop il_to column from imagelinks table in wmf production - https://phabricator.wikimedia.org/T419635 [08:26:44] because of the DNS check failures [08:27:07] running Puppet is not enough, gdnsd still fails [08:27:13] tested on dns5003 [08:27:40] dns[2006,5003,7002] all still unhappy AFAICT [08:27:41] FIRING: [3x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:28:13] so what I tried to to was to run the k8s pool cookbook [08:28:38] for aux-k8s-codfw, that tried to set k8s ingress rw to active active [08:28:50] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [08:28:50] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [08:28:56] PROBLEM - gdnsd daemon runs exactly once on dns5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 496 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS [08:28:58] PROBLEM - Auth DNS on dns5003 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [08:29:02] PROBLEM - AuthDNS-over-TLS Works on dns5003 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [08:29:05] <_joe_> uh damn [08:29:15] <_joe_> revert the patch now I guess [08:29:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2007.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:29:57] (03PS1) 10Elukey: Revert "service: allow k8s-ingress-aux-rw to be active/active" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 [08:30:01] <_joe_> actually it's the opposite puppet makes dns fail worse [08:30:11] <_joe_> elukey: merge with +2 +2 please [08:30:19] PROBLEM - gdnsd checkconf #page on dns1005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:30:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 (owner: 10Elukey) [08:30:23] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "service: allow k8s-ingress-aux-rw to be active/active" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 (owner: 10Elukey) [08:30:24] mvernon@cumin2002 reimage (PID 990280) is awaiting input [08:30:27] !ack [08:30:28] 7821 (ACKED) dns1005/gdnsd checkconf (paged) [08:30:34] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:30:43] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1269341 (owner: 10Elukey) [08:30:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:30:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:30:51] [that reimage awaiting input is me waiting to retry the DNS changes once this is fixed] [08:31:44] <_joe_> uhm so I think I understand what the problem is [08:32:06] I mean there must be a stale setting somewhere that makes gdnsd rightfully upset [08:32:22] <_joe_> you added a resource as ./discovery-geo-resources and not ./discovery-metafo-resources [08:32:30] <_joe_> elukey: did you puppet-merge? [08:32:34] yeah [08:32:35] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803193 (10ayounsi) > @jcrespo > We don't reimage backups hosts. Let us know the alternative method (we can put them out... [08:32:41] FIRING: [13x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:32:48] (03PS2) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269334 (https://phabricator.wikimedia.org/T6055) [08:32:54] <_joe_> running puppet on dns5003 [08:33:03] <_joe_> to confirm if gdnsd comes back [08:33:08] <_joe_> we might need more reverts [08:33:15] okok I was about to do the same [08:33:17] already running puppet on 5003 [08:33:52] <_joe_> otherwise I'll design a way to fix all this [08:34:14] (03CR) 10CI reject: [V:04-1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:34:15] FIRING: [2x] JobUnavailable: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:16] and gdnsd now properly starts on 5003 [08:34:18] PROBLEM - gdnsd checkconf #page on dns2005 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:34:19] RECOVERY - gdnsd checkconf #page on dns2006 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:34:20] RECOVERY - gdnsd checkconf #page on dns5003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:34:34] !incidents [08:34:34] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:34:34] 7821 (ACKED) dns1005/gdnsd checkconf (paged) [08:34:34] !incidents [08:34:34] 7822 (UNACKED) dns2005/gdnsd checkconf (paged) [08:34:35] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:34:35] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:34:35] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:34:35] 7821 (ACKED) dns1005/gdnsd checkconf (paged) [08:34:35] 7822 (UNACKED) dns2005/gdnsd checkconf (paged) [08:34:36] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:34:36] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:34:40] dns5003 is happy now [08:34:46] !ack 7822 [08:34:46] 7822 (ACKED) dns2005/gdnsd checkconf (paged) [08:34:47] 2005 might need another puppet run? [08:34:48] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [08:34:48] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [08:34:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:34:50] <_joe_> ok let's run puppet on all servers [08:34:54] RECOVERY - gdnsd daemon runs exactly once on dns5003 is OK: PROCS OK: 1 process with UID = 496 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS [08:34:56] RECOVERY - Auth DNS on dns5003 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [08:34:57] going to do it [08:34:57] <_joe_> dns servers I mean [08:35:02] RECOVERY - AuthDNS-over-TLS Works on dns5003 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [08:35:09] <_joe_> now I need to understand what you did wrong [08:35:12] ok thanks elukey [08:35:57] <_joe_> I think we don't support at the moment the transition from active/active to active/passive, but I need to understand what you did and what the consequences are again, I've last looked at this stuff eons ago [08:36:27] I can take care of it later on if you want, I made the mess so I can try to dig into it [08:36:45] (03PS1) 10Gkyziridis: ml-services: Remove models from experimental staging that were not deployed corrextly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269343 [08:36:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:36:56] My cookbook is still failing to get happyness from dns[1005,2005,7002] at least [08:37:10] (03CR) 10CI reject: [V:04-1] browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:37:11] 2005 was still missing a puppet run [08:37:19] RECOVERY - gdnsd checkconf #page on dns1005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:37:25] !incidents [08:37:25] 7820 (ACKED) dns7002/gdnsd checkconf (paged) [08:37:25] 7822 (ACKED) dns2005/gdnsd checkconf (paged) [08:37:25] 7821 (RESOLVED) dns1005/gdnsd checkconf (paged) [08:37:26] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:37:26] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:37:32] ok perfect [08:37:41] FIRING: [13x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:38:06] (03PS1) 10Dpogorzelski: amg-gpu: Set up explicit GPU partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) [08:38:19] RECOVERY - gdnsd checkconf #page on dns2005 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:38:34] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:39:15] FIRING: [2x] JobUnavailable: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:39:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:39:52] mvernon@cumin2002 reimage (PID 990280) is awaiting input [08:40:06] halfway through the puppet runs [08:40:18] RECOVERY - gdnsd checkconf #page on dns7002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/DNS%23gdnsd_checkconf [08:40:20] yeah only 7002 is missing I think [08:40:24] ah there it is [08:40:25] I just ran puppet there [08:40:28] !incitents [08:40:32] !incidents [08:40:33] 7820 (RESOLVED) dns7002/gdnsd checkconf (paged) [08:40:33] 7822 (RESOLVED) dns2005/gdnsd checkconf (paged) [08:40:33] 7821 (RESOLVED) dns1005/gdnsd checkconf (paged) [08:40:33] 7818 (RESOLVED) dns2006/gdnsd checkconf (paged) [08:40:33] 7819 (RESOLVED) dns5003/gdnsd checkconf (paged) [08:40:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:41:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:41:07] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1013 - mvernon@cumin2002" [08:41:12] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1013 - mvernon@cumin2002" [08:41:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:41:13] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-fe1013.eqiad.wmnet 149.48.64.10.in-addr.arpa 9.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:41:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1013.eqiad.wmnet 149.48.64.10.in-addr.arpa 9.4.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:41:18] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1013 [08:41:26] OK, DNS run in my cookbook went through OK now, thanks :) [08:41:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1013 [08:41:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1013 [08:42:03] Emperor: next time please wait the green light from others, I am still completing the puppet runs [08:42:56] elukey: sorry, I figured asking it to retry again was likely harmless [08:45:28] Emperor: no problem, but if you see the logs above the cookbook worked with DNS, while we are still cleaning up. It can backfire in a lot of ways :D [08:45:33] (03PS1) 10Mszwarc: Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) [08:45:51] ok runs completed [08:46:44] great thank you elukey. Then I'll let you figure out the correct approach for active-active pooling and resume my work ? or do you need anything from oncallers? :) [08:47:25] (03CR) 10Michael Große: [C:03+1] "This needs to first backport If4ff840dc65aef61029a12fe554364d38fa277b1 and then have this changed behind it." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:47:51] <_joe_> elukey: do you have a pcc for your original change? [08:48:18] _joe_ I don't no, I really thought it was harmless to be honest, my bad [08:48:38] jelto: the last bit is https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed, but I can find what's wrong don't worry [08:48:41] sorry for the noise [08:48:59] <_joe_> elukey: I'm aslo not sure why you wanted to make a -rw endpoint active-active [08:49:24] <_joe_> elukey: anyways, what are you trying to fix rn? [08:49:52] <_joe_> I can help I think [08:50:11] _joe_ the only thing that pointed to it was os-reports, I thought it was a leftover from when we had only aux-k8s-eqiad [08:50:27] if it is supposed to be always active/passive ok, I'll leave it there [08:50:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:50:48] <_joe_> usually we put something under -rw that needs to be written to a specific location, yes [08:51:21] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad - 3.2 upgrade (T421402) [08:51:24] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [08:52:18] the only thing to clean up is /var/lib/gdnsd/discovery-k8s-ingress-aux-rw.state on the dns servers, since afaics it points to two IPs with UP [08:52:28] (03CR) 10Mszwarc: "recheck" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:52:33] my understanding is that the check fails because of that [08:52:36] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803296 (10jcrespo) >>! In T421719#11803193, @ayounsi wrote: > We can work together on that, the process is a bit more man... [08:53:21] (03CR) 10Hashar: "recheck due to T422469 - *ReviseTone.cy.ts flakily fails to find ve-ui-editCheck-gutter-action-warning*" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:53:32] <_joe_> elukey: very probable yes but uhm wait [08:54:09] (03CR) 10Volans: "Thanks for porting the work started with pywmflib to spicerack!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [08:54:25] elukey, jelto: os-reports is entirely static content, generated on the puppetdb hosts, it can simply be removed from the rw endpoint? [08:54:28] <_joe_> so yes, I don't think currently there's an entry in confd for discovery-k8s-ingress-aux-rw [08:54:34] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2015.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:54:50] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs2013.codfw.wmnet, wdqs2021.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2014.codfw.wmnet, wdqs2008.codfw.wmnet, wdqs2010.codfw.wmnet, wdqs2012.codfw.wmnet, wdqs2011.codfw.wmnet, wdqs2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:55:06] or is that used for rsync for the data sync? [08:55:36] yes that's also my understanding, os-reports is static and read only, the syncing happens on rsync level (the k8s pod fetches it from the puppetserver) and not over https [08:55:36] (03CR) 10Hashar: "The second `recheck` I have done is because I had it prepared in a window and did not sent it immediately because I was replying on anothe" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1269340 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:55:48] <_joe_> elukey: which was pooled before? eqiad or codfw? [08:56:09] eqiad [08:56:16] <_joe_> ok [08:56:19] (03CR) 10Hashar: "Oh nice, thank you. I will propose the backport and rebase this change on top of it." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) (owner: 10Hashar) [08:56:21] <_joe_> running confctl --object-type discovery select 'dnsdisc=k8s-ingress-aux-rw,name=codfw' set/pooled=false [08:56:23] !log oblivian@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-aux-rw,name=codfw [08:56:43] (03PS1) 10Hashar: fix: adjust to return type changed by upstream [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269351 [08:56:50] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:59] (03PS2) 10Hashar: browser-tests: hide Cypress tests from CI [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269339 (https://phabricator.wikimedia.org/T419574) [08:57:11] (03CR) 10CI reject: [V:04-1] Fix BackfillInterwikiRightsLog wrt. cyclic renames [core] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269345 (https://phabricator.wikimedia.org/T6055) (owner: 10Mszwarc) [08:57:34] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:57:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db2157: Pooling in [08:58:10] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [08:59:07] !log fabfur@cumin1003 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad - 3.2 upgrade (T421402) [08:59:10] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [08:59:39] <_joe_> elukey: I have to hop into a meeting, but AFAICT most things are ok now, there's just one problem left but it should not have to do with confd [08:59:54] <_joe_> there's probably some error files leftover, but I can't check rn [08:59:54] okok thanks [09:04:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1013.eqiad.wmnet with reason: host reimage [09:04:19] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: improve new mailing list admin notifications - https://phabricator.wikimedia.org/T281987#11803370 (10Effeietsanders) @Legoktm Thanks again for this! Is there progress on the more high level task? I have another few lists where I was trying to add this, and just f... [09:05:05] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11803372 (10jcrespo) >>! In T420623#11801010, @Papaul wrote: > @jcrespo we can do this next week Wednesday April 15th at 10am CT .... [09:06:11] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:07:17] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:07:35] (03CR) 10Gkyziridis: [C:03+2] ml-services: Remove models from experimental staging that were not deployed corrextly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269343 (owner: 10Gkyziridis) [09:09:30] (03Merged) 10jenkins-bot: ml-services: Remove models from experimental staging that were not deployed corrextly. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269343 (owner: 10Gkyziridis) [09:09:34] 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T413507#11803391 (10TheDJ) > error on line 18 at column 39: xmlns:ns0: '&#38;ns_ai;' is not a valid URI" The file uses a xml namespace that we do not recognize and do not allow. It's apparently a bug in a ve... [09:09:37] (03CR) 10CI reject: [V:04-1] fix: adjust to return type changed by upstream [extensions/GrowthExperiments] (wmf/1.46.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1269351 (owner: 10Hashar) [09:09:40] (03PS1) 10Gmodena: RSU: increase parallelism for staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269354 (https://phabricator.wikimedia.org/T422791) [09:09:53] (03PS4) 10Fabfur: hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) [09:09:57] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:10:38] !log gkyziridis@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:11:06] ah right I found it [09:11:24] there are /var/run/confd-template/_var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.err in the various dns hosts, I was looking on one that didn't have them [09:11:41] !log upgrading esams to haproxy 3.2 (T421402) [09:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:44] T421402: Upgrade HAProxy to version 3.2 - https://phabricator.wikimedia.org/T421402 [09:12:09] !log remove /var/run/confd-template/_var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.err on dns5004 and restart confd [09:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:53] 10SRE-swift-storage, 06Commons: Commons file not found - https://phabricator.wikimedia.org/T413507#11803397 (10MatthewVernon) @Jeff_G please open new tickets when reporting new issues, unless you're really 100% sure you've got a recurrence of exactly the same issue again - it's really easy to merge tickets... [09:13:09] logs are good now on 5004 but it takes time for the alert to clear [09:13:12] (probably) [09:13:58] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:14:05] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11803425 (10ABran-WMF) >>! In T420623#11803372, @jcrespo wrote: > CC @ABran-WMF @Jelto that backups from gerrit & gitlab (and attem... [09:15:51] !log remove /var/run/confd-template/_var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.err on affected dns servers and restart confd [09:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:18] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams - 3.2 upgrade (T421402) [09:16:25] !log fabfur@cumin1003 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams - 3.2 upgrade (T421402) [09:17:41] FIRING: [11x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:19:53] this is not true, only two remaining --^ [09:19:57] should be cleared soon [09:20:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:22:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1013.eqiad.wmnet with OS bullseye [09:22:16] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1013.eqiad.... [09:22:41] RESOLVED: [11x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:23:19] 06SRE, 06Infrastructure-Foundations, 10netops: cr1-esams failed upgrade - https://phabricator.wikimedia.org/T422525#11803455 (10cmooney) [09:23:48] (03PS1) 10Fabfur: hiera: cleanup custom haproxy version [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) [09:25:03] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on P{ms-fe[1009-1012,1014-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:25:58] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06DBA: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421719#11803462 (10MatthewVernon) [09:29:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:29:43] (03CR) 10Elukey: Move linting to Ruff and apply code fixes (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1267058 (https://phabricator.wikimedia.org/T420475) (owner: 10Elukey) [09:32:33] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:32:34] (03CR) 10Clément Goubert: [C:03+1] Add Wikimedia REST API ?spec route for *.wikinews.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269099 (https://phabricator.wikimedia.org/T418318) (owner: 10Aaron Schulz) [09:33:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on P{ms-fe[1009-1012,1014-1024].eqiad.wmnet} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:33:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:43] (03PS2) 10Fabfur: hiera: cleanup custom haproxy version [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) [09:35:55] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1269368 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [09:36:53] (03PS1) 10Gkyziridis: ml-services: Deploy again edit-check model on experimental staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 [09:39:13] (03CR) 10Elukey: "Left a couple of comments but overall it looks very good. Have you already tested it on a node? amd-smi commands, the unit, etc.." [puppet] - 10https://gerrit.wikimedia.org/r/1269344 (https://phabricator.wikimedia.org/T420507) (owner: 10Dpogorzelski) [09:40:34] (03Abandoned) 10Daniel Kinzler: rest gateway: use IP as rate limit key for compliant bots [deployment-charts] - 10https://gerrit.wikimedia.org/r/1268520 (https://phabricator.wikimedia.org/T422471) (owner: 10Daniel Kinzler) [09:43:12] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db2157: Pooling in [09:44:43] (03CR) 10Kevin Bazira: ml-services: Deploy again edit-check model on experimental staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1269369 (owner: 10Gkyziridis) [09:46:20] (03CR) 10Clément Goubert: [C:03+1] rest gateway: introduce policy for Abstract Wikipedia [deployment-charts] - 10https://gerrit.wikimedia.org/r/1267122 (https://phabricator.wikimedia.org/T421581) (owner: 10Daniel Kinzler) [09:46:57]