[00:02:49] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:53] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:04:57] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:18:14] FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:19:29] RESOLVED: JobUnavailable: Reduced availability for job gerrit-metrics in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:20:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:42:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:47:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:52:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:56:28] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:57:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:32:45] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:34:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:34:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:37:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:38:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:09:29] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:44] RESOLVED: [2x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:22:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:27:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:32:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:14] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:01] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:55:01] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:58:14] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:57] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:09:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:23] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki recently updated files not found - https://phabricator.wikimedia.org/T375797#10186429 (10Pppery) The first image seems to have started existing somehow. Some job on the MediaWiki side got delayed? The second one is still missing. [04:06:33] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:09:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:37:50] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:40:22] (03PS1) 10Stevemunene: Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) [05:42:03] (03CR) 10CI reject: [V:04-1] Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [05:47:28] (03PS2) 10Stevemunene: Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) [05:48:39] (03CR) 10CI reject: [V:04-1] Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [05:54:58] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg-cluster: allow the operator to reload a cluster secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076218 (https://phabricator.wikimedia.org/T375853) (owner: 10Brouberol) [05:56:57] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) (owner: 10Brouberol) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:57:34] (03PS1) 10KartikMistry: Section Translation: Add mos, kde and rsk Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076559 (https://phabricator.wikimedia.org/T375017) [07:00:05] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:16:33] (03PS1) 10Aqu: [analytics][webrequest] Extend refined webrequest retention to 180 days [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) [07:18:52] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) (owner: 10Aqu) [07:25:21] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl1003.eqiad.wmnet [07:25:22] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [07:27:22] (03PS4) 10Volans: confctl: add native support for RO in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto) [07:32:13] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10186599 (10elukey) The dry run took ~33 hrs and its log can be found on `registry1004:/home/elukey/docker_registry_dryrun_gc.log`. The blobs/l... [07:33:19] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl1003.eqiad.wmnet - elukey@cumin1002" [07:33:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl1003.eqiad.wmnet - elukey@cumin1002" [07:33:23] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:33:24] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl1003.eqiad.wmnet on all recursors [07:33:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl1003.eqiad.wmnet on all recursors [07:33:53] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl1003.eqiad.wmnet - elukey@cumin1002" [07:33:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl1003.eqiad.wmnet - elukey@cumin1002" [07:33:58] (03CR) 10Brouberol: [C:03+2] wikimedia.org: provision subdomains for our airflow instances [dns] - 10https://gerrit.wikimedia.org/r/1076206 (https://phabricator.wikimedia.org/T371208) (owner: 10Brouberol) [07:36:21] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: allow the operator to reload a cluster secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076218 (https://phabricator.wikimedia.org/T375853) (owner: 10Brouberol) [07:38:08] (03CR) 10Volans: [C:03+2] confctl: add native support for RO in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto) [07:45:12] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-ctrl1003.eqiad.wmnet with OS bullseye [07:46:53] (03PS2) 10Brouberol: cloudnative-pg-cluster: upsize the WAL storage volume to 15GB by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076220 (https://phabricator.wikimedia.org/T375846) [07:46:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:47:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:49:08] (03Merged) 10jenkins-bot: confctl: add native support for RO in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (https://phabricator.wikimedia.org/T362893) (owner: 10Giuseppe Lavagetto) [07:49:37] (03PS2) 10Volans: dbctl: add new module to interact with dbctl [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058586 (https://phabricator.wikimedia.org/T362893) [07:53:07] 06SRE, 06Infrastructure-Foundations: puppetserver* thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10186649 (10elukey) Another very interesting panel is the thread list and their states: {F57571583} I am not 100% sure but all the threads managed by the JVM (JMX r... [07:57:34] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl1003.eqiad.wmnet with reason: host reimage [07:58:10] (03CR) 10Brouberol: airflow: only change the configuration/secret checksum when actual config changes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076217 (https://phabricator.wikimedia.org/T375886) (owner: 10Brouberol) [07:58:23] (03CR) 10Brouberol: airflow: only change the configuration/secret checksum when actual config changes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076217 (https://phabricator.wikimedia.org/T375886) (owner: 10Brouberol) [08:01:08] (03PS1) 10Melos: Add namespace aliases for scn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076661 (https://phabricator.wikimedia.org/T375979) [08:01:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl1003.eqiad.wmnet with reason: host reimage [08:03:02] (03CR) 10Volans: [C:03+2] dbctl: add new module to interact with dbctl [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058586 (https://phabricator.wikimedia.org/T362893) (owner: 10Volans) [08:10:12] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-ctrl1003.eqiad.wmnet with OS bullseye [08:12:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl1003.eqiad.wmnet [08:12:54] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Requesting access to `contint-admins`, `contint-docker`, LDAP `ciadmin` for 'Arthur taylor' - https://phabricator.wikimedia.org/T373969#10186664 (10ArthurTaylor) 05Open→03Resolved Seems to work - thanks f... [08:14:39] (03CR) 10CI reject: [V:04-1] dbctl: add new module to interact with dbctl [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058586 (https://phabricator.wikimedia.org/T362893) (owner: 10Volans) [08:19:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076661 (https://phabricator.wikimedia.org/T375979) (owner: 10Melos) [08:21:02] (03CR) 10Gmodena: [C:03+2] mw-page-content-change-enrich: enable calico network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [08:22:13] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: enable calico network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [08:23:36] (03CR) 10Joal: [C:03+1] "Looks good functionally. We need to keep in mind that when we'll change back to 90 days we'll need a manual run of the script with a chang" [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) (owner: 10Aqu) [08:23:53] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:23:59] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:24:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:26:16] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:26:21] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:28:07] (03PS1) 10Hashar: gerrit: prevent ByteDance from crawling [puppet] - 10https://gerrit.wikimedia.org/r/1076663 (https://phabricator.wikimedia.org/T375996) [08:29:52] (03PS2) 10Aqu: [analytics][webrequest] Extend retention for unique devices analysis [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) [08:30:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [08:30:26] (03CR) 10Volans: [C:03+2] dbctl: add new module to interact with dbctl [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058586 (https://phabricator.wikimedia.org/T362893) (owner: 10Volans) [08:31:04] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:34:43] (03CR) 10Aqu: "@joal@wikimedia.org I've added extra datasets. They may be useful to their analysis." [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) (owner: 10Aqu) [08:35:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [08:36:47] (03PS1) 10Jelto: profile::firewall: separate ipv4 and ipv6 in nftables BLOCKED_NETS [puppet] - 10https://gerrit.wikimedia.org/r/1076665 (https://phabricator.wikimedia.org/T348734) [08:36:48] (03PS1) 10Jelto: sretest: test defs_from_etcd with new separate sets [puppet] - 10https://gerrit.wikimedia.org/r/1076666 (https://phabricator.wikimedia.org/T348734) [08:37:47] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:37:50] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:40:02] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:40:38] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:42:31] jouncebot: nowandnext [08:42:31] No deployments scheduled for the next 1 hour(s) and 17 minute(s) [08:42:32] In 1 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1000) [08:42:36] (03Merged) 10jenkins-bot: dbctl: add new module to interact with dbctl [software/spicerack] - 10https://gerrit.wikimedia.org/r/1058586 (https://phabricator.wikimedia.org/T362893) (owner: 10Volans) [08:42:38] (03CR) 10Ladsgroup: [C:03+2] Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [08:42:44] (03CR) 10Jelto: [C:03+1] "lgtm as an intermediate solution. I'd like to move that to the more general `abuse/blocked_nets` set in the future when T348734 is unblock" [puppet] - 10https://gerrit.wikimedia.org/r/1076663 (https://phabricator.wikimedia.org/T375996) (owner: 10Hashar) [08:43:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52629 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:43:33] (03Merged) 10jenkins-bot: Remove metawiki dark mode exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072623 (owner: 10Ebrahim) [08:43:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:44:10] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1072623|Remove metawiki dark mode exceptions]] [08:46:05] (03CR) 10Gmodena: [C:03+2] mw-page-content-change-enrich: disable legacy network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [08:46:52] /12 [08:46:55] err :) [08:47:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 [08:47:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 (owner: 10TrainBranchBot) [08:47:20] (03PS2) 10Hashar: gerrit: block some Gitiles crawlers [puppet] - 10https://gerrit.wikimedia.org/r/1076663 (https://phabricator.wikimedia.org/T375996) [08:49:28] (03PS3) 10Gmodena: mw-page-content-change-enrich: disable legacy network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) [08:50:02] (03CR) 10Jelto: [C:03+1] "lgtm for now but see comment above" [puppet] - 10https://gerrit.wikimedia.org/r/1076663 (https://phabricator.wikimedia.org/T375996) (owner: 10Hashar) [08:50:33] (03CR) 10Gmodena: [V:03+2 C:03+2] mw-page-content-change-enrich: disable legacy network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [08:50:36] (03CR) 10Jelto: [C:03+2] "I'll test this on gerrit2002 first" [puppet] - 10https://gerrit.wikimedia.org/r/1076663 (https://phabricator.wikimedia.org/T375996) (owner: 10Hashar) [08:51:25] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076217 (https://phabricator.wikimedia.org/T375886) (owner: 10Brouberol) [08:51:28] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: disable legacy network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [08:51:55] (03CR) 10Btullis: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076220 (https://phabricator.wikimedia.org/T375846) (owner: 10Brouberol) [08:55:59] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.14.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076672 [08:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:56:34] !log ladsgroup@deploy2002 ebrahim, ladsgroup: Backport for [[gerrit:1072623|Remove metawiki dark mode exceptions]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:56:48] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: upsize the WAL storage volume to 15GB by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076220 (https://phabricator.wikimedia.org/T375846) (owner: 10Brouberol) [08:56:58] !log ladsgroup@deploy2002 ebrahim, ladsgroup: Continuing with sync [08:57:48] !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [08:57:54] !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:59:29] !log gmodena@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:59:34] !log gmodena@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:00:50] !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [09:00:53] !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:04:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:38] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1072623|Remove metawiki dark mode exceptions]] (duration: 22m 27s) [09:07:11] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.14.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076672 (owner: 10Volans) [09:07:45] (03PS2) 10Brouberol: spark-operator: update base.certificate module to v2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076668 (https://phabricator.wikimedia.org/T365024) [09:09:37] (03PS1) 10Elukey: Add aux-k8s-{ctrl,worker}1003 to AUX K8s [puppet] - 10https://gerrit.wikimedia.org/r/1076679 (https://phabricator.wikimedia.org/T344230) [09:10:22] (03PS1) 10Gmodena: services: page-content-change-enrich: set deployment value. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076680 (https://phabricator.wikimedia.org/T368787) [09:11:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:12:26] (03PS1) 10Elukey: Add aux-k8s-ctrl1003 to admin_ng's config for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) [09:15:07] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4152/co" [puppet] - 10https://gerrit.wikimedia.org/r/1076666 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [09:16:25] (03CR) 10Btullis: [C:03+1] cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) (owner: 10Brouberol) [09:18:09] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.14.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076672 (owner: 10Volans) [09:19:32] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1075605 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [09:19:41] (03CR) 10JMeybohm: [C:03+1] Add aux-k8s-{ctrl,worker}1003 to AUX K8s [puppet] - 10https://gerrit.wikimedia.org/r/1076679 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [09:20:46] (03CR) 10Ladsgroup: [C:03+1] mariadb: productionize db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1075108 (https://phabricator.wikimedia.org/T373579) (owner: 10Arnaudb) [09:21:29] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 (owner: 10TrainBranchBot) [09:21:29] (03CR) 10JMeybohm: "Are these still used somewhere? I'd say there are no more users of these values and they can instead be removed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [09:22:48] (03CR) 10JMeybohm: [C:03+1] Deprecate system::role for initial batch of serviceops services [puppet] - 10https://gerrit.wikimedia.org/r/1076160 (owner: 10Muehlenhoff) [09:24:29] (03CR) 10Elukey: "I wondered the same, not sure if we moved away from explicitly stating the k8s ips for AUX." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [09:26:43] (03CR) 10JMeybohm: "IIRC the tracing components initially used these hardcoded lists but since have been migrated to use calico policies as well. grep seems t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [09:27:09] (03PS1) 10Volans: Upstream release v8.14.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076687 [09:27:59] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 (10MoritzMuehlenhoff) 03NEW [09:33:54] (03CR) 10Ladsgroup: "This adds an overhead of needing to maintain such list. We soon will add x3, then we need to add that to this list and if we forget, thing" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [09:34:15] 06SRE, 06Infrastructure-Foundations: eqiad: one VM for irc.wikimedia.org - https://phabricator.wikimedia.org/T376015 (10elukey) 03NEW [09:36:36] 06SRE, 06Infrastructure-Foundations: eqiad: one VM for irc.wikimedia.org - https://phabricator.wikimedia.org/T376015#10186995 (10elukey) ` +-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +-------+-------+-----... [09:38:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:40:58] (03CR) 10CI reject: [V:04-1] Upstream release v8.14.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076687 (owner: 10Volans) [09:42:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:42:56] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:43:08] (03CR) 10Brouberol: [C:03+2] airflow: only change the configuration/secret checksum when actual config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076217 (https://phabricator.wikimedia.org/T375886) (owner: 10Brouberol) [09:44:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:44:51] (03CR) 10Volans: "This list lives here because there wasn't any official list in puppet that could be dumped into a config file to be read. At the time ther" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1074111 (owner: 10Volans) [09:45:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:46:33] 10SRE-swift-storage, 06Commons: File not found: file not available on Commons - https://phabricator.wikimedia.org/T376013#10187048 (10Aklapper) @Yann: Please tag tasks about missing files with #sre-swift-storage as the #Commons community itself cannot fix this, I'm afraid. Thanks. [09:48:20] (03PS2) 10Btullis: Disable exposure warning for airflow webservers by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076213 (https://phabricator.wikimedia.org/T375739) [09:49:09] !log elukey@cumin1002 START - Cookbook sre.ganeti.makevm for new host irc1003.wikimedia.org [09:49:10] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [09:49:31] 06SRE, 06Infrastructure-Foundations: eqiad: one VM for irc.wikimedia.org - https://phabricator.wikimedia.org/T376015#10187072 (10MoritzMuehlenhoff) +1 [09:51:52] (03CR) 10Hashar: "The node job failed with:" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 (owner: 10TrainBranchBot) [09:55:44] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10187090 (10cmooney) >>! In T375847#10182667, @aborrero wrote: > I see the dhcp6 packets from my test VM arriving into neutron: > > `... [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1000) [10:03:21] (03CR) 10Btullis: [C:03+2] Disable exposure warning for airflow webservers by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076213 (https://phabricator.wikimedia.org/T375739) (owner: 10Btullis) [10:04:28] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc1003.wikimedia.org - elukey@cumin1002" [10:04:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM irc1003.wikimedia.org - elukey@cumin1002" [10:04:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:04:44] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache irc1003.wikimedia.org on all recursors [10:04:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) irc1003.wikimedia.org on all recursors [10:05:10] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc1003.wikimedia.org - elukey@cumin1002" [10:05:14] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM irc1003.wikimedia.org - elukey@cumin1002" [10:05:41] (03Merged) 10jenkins-bot: Disable exposure warning for airflow webservers by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076213 (https://phabricator.wikimedia.org/T375739) (owner: 10Btullis) [10:06:06] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10187153 (10cmooney) @aborrero the network assignment is incorrect also. 2a02:ec80:a100::/56 is the entire public IPv6 allocation for... [10:06:57] (03CR) 10Raimond Spekking: "Probably an import error from last week. Re-created manually: https://translatewiki.net/wiki/MediaWiki:Copyright/qqq" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 (owner: 10TrainBranchBot) [10:10:59] (03PS1) 10Elukey: Add irc1003 basic config [puppet] - 10https://gerrit.wikimedia.org/r/1076703 (https://phabricator.wikimedia.org/T376015) [10:11:00] !log Started time limited MediaModeration scan on enwiki - https://wikitech.wikimedia.org/wiki/MediaModeration [10:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:52] (03PS1) 10JMeybohm: k8s.upgrade-cluster: Black format and sort imports [cookbooks] - 10https://gerrit.wikimedia.org/r/1076705 (https://phabricator.wikimedia.org/T341984) [10:12:53] (03PS1) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [10:13:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1076703 (https://phabricator.wikimedia.org/T376015) (owner: 10Elukey) [10:13:50] !log Restarted MediaModeration scanning script, starting it up again using mwscript-k8s - https://wikitech.wikimedia.org/wiki/MediaModeration [10:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:24] (03CR) 10Elukey: [C:03+2] Add irc1003 basic config [puppet] - 10https://gerrit.wikimedia.org/r/1076703 (https://phabricator.wikimedia.org/T376015) (owner: 10Elukey) [10:15:40] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host irc1003.wikimedia.org with OS bookworm [10:18:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:19:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:22:52] (03PS1) 10Slyngshede: P:ircstream: Replacement service for irc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) [10:23:12] (03CR) 10CI reject: [V:04-1] P:ircstream: Replacement service for irc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:24:02] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076709 (https://phabricator.wikimedia.org/T375881) [10:24:45] (03PS1) 10Brouberol: Copy meta_2.0.1 into meta_2.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076710 [10:24:45] (03PS1) 10Brouberol: Release meta_2.0.2 that injects a configuration checksum annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076711 [10:25:01] (03CR) 10CI reject: [V:04-1] k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:27:35] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on irc1003.wikimedia.org with reason: host reimage [10:28:24] (03PS2) 10Slyngshede: P:ircstream: Replacement service for irc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) [10:32:29] (03CR) 10Elukey: "Left two nits!" [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:32:34] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on irc1003.wikimedia.org with reason: host reimage [10:34:15] (03CR) 10Kosta Harlan: [C:03+1] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076709 (https://phabricator.wikimedia.org/T375881) (owner: 10STran) [10:34:57] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: work out IPv6 and designate integration - https://phabricator.wikimedia.org/T374715#10187215 (10cmooney) Guys I would propose the following: * We delegate the allocated 'public' and 'private' ranges to the codf... [10:35:05] (03PS3) 10Slyngshede: P:ircstream: Replacement service for irc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) [10:35:09] (03CR) 10Slyngshede: P:ircstream: Replacement service for irc.wikimedia.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:36:08] !log upload ircstream 0.12.0~dev+wmf1 to apt.wikimedia.org T376014 [10:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:14] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [10:37:14] (03PS2) 10Volans: Upstream release v8.14.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076687 [10:39:20] (03PS1) 10Cathal Mooney: Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) [10:40:25] jouncebot: nowandnext [10:40:25] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1000) [10:40:26] In 2 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1300) [10:40:56] (03PS2) 10Cathal Mooney: Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) [10:43:16] !log ladsgroup@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [10:43:29] (03PS3) 10Cathal Mooney: Delegate IPv6 ranges allocated for WMCS Openstack networks in codfw [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) [10:43:36] !log ladsgroup@cumin1002 END (FAIL) - Cookbook sre.switchdc.databases.finalize (exit_code=99) for the switch from eqiad to codfw [10:45:36] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host irc1003.wikimedia.org with OS bookworm [10:45:36] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host irc1003.wikimedia.org [10:45:38] !log ladsgroup@cumin1002 START - Cookbook sre.switchdc.databases.finalize for the switch from eqiad to codfw [10:47:36] (03CR) 10Cathal Mooney: [C:04-1] "Not to be merged until the WMCS servers are set up to respond to queries for these ranges." [dns] - 10https://gerrit.wikimedia.org/r/1076713 (https://phabricator.wikimedia.org/T374715) (owner: 10Cathal Mooney) [10:48:06] (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076709 (https://phabricator.wikimedia.org/T375881) (owner: 10STran) [10:49:30] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076709 (https://phabricator.wikimedia.org/T375881) (owner: 10STran) [10:50:09] (03CR) 10Volans: [C:03+2] Upstream release v8.14.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076687 (owner: 10Volans) [10:53:33] (03PS1) 10Zabe: Pin wgRevisionSlotsCacheExpiry to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076715 (https://phabricator.wikimedia.org/T183490) [10:53:58] (03CR) 10Muehlenhoff: "One nit inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:54:10] (03PS1) 10Dreamy Jazz: Don't implement CheckUserQueryInterface in purgeOldData.php [extensions/CheckUser] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076716 (https://phabricator.wikimedia.org/T376022) [10:54:29] jouncebot: nowandnext [10:54:29] For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1000) [10:54:29] In 2 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1300) [10:54:41] Going to backport shortly [10:55:37] (03CR) 10Dreamy Jazz: [C:03+2] Don't implement CheckUserQueryInterface in purgeOldData.php [extensions/CheckUser] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076716 (https://phabricator.wikimedia.org/T376022) (owner: 10Dreamy Jazz) [10:55:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076716 (https://phabricator.wikimedia.org/T376022) (owner: 10Dreamy Jazz) [10:57:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.switchdc.databases.finalize (exit_code=0) for the switch from eqiad to codfw [10:57:45] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4153/console" [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [10:58:26] My change will take a while to merge as a wmf branch backport. [10:58:30] (03PS2) 10Ladsgroup: pc2017: Set it to master [puppet] - 10https://gerrit.wikimedia.org/r/1075052 (https://phabricator.wikimedia.org/T374355) [10:58:38] (03CR) 10Ladsgroup: [V:03+2 C:03+2] pc2017: Set it to master [puppet] - 10https://gerrit.wikimedia.org/r/1075052 (https://phabricator.wikimedia.org/T374355) (owner: 10Ladsgroup) [10:58:46] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [10:59:20] (03PS4) 10Slyngshede: P:ircstream: Replacement service for irc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) [10:59:47] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:00:31] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:00:37] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:00:47] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [11:00:55] Dreamy_Jazz: is it okay if I squeeze in a config change while your CI is runnin? [11:01:33] zabe: Sure [11:01:48] (03CR) 10Zabe: [C:03+2] Pin wgRevisionSlotsCacheExpiry to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076715 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [11:01:53] (03CR) 10Zabe: [C:03+2] Move closed wikis to group0 except aawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076451 (owner: 10Zabe) [11:01:59] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:02:34] (03Merged) 10jenkins-bot: Pin wgRevisionSlotsCacheExpiry to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076715 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [11:02:37] (03Merged) 10jenkins-bot: Move closed wikis to group0 except aawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076451 (owner: 10Zabe) [11:02:58] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1076715|Pin wgRevisionSlotsCacheExpiry to default (T183490)]], [[gerrit:1076451|Move closed wikis to group0 except aawiki]] [11:02:59] !log zabe@deploy2002 scap failed: Command '['docker', 'run', '--rm', '--attach', 'stdin', '--attach', 'stdout', '--attach', 'stderr', '--user', 'www-data', '--mount', 'type=bind,source=/srv/mediawiki-staging,target=/srv/mediawiki-staging', '--mount', 'type=bind,source=/tmp,target=/tmp', '--workdir', '/srv/mediawiki-staging', '--entrypoint', '/bin/bash', '--network', 'none', 'docker-registry.wikimedia. [11:02:59] org/php7.4-fpm-multiversion-base', '-c', 'rm -f /srv/mediawiki-staging/php-1.43.0-wmf.24/cache/l10n/*.tmp.*']' returned non-zero exit status 126. (scap version: 4.107.0-1) (duration: 00m 00s) [11:03:01] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:03:04] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [11:03:10] eh [11:03:15] (03Merged) 10jenkins-bot: Upstream release v8.14.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076687 (owner: 10Volans) [11:03:23] (03CR) 10Slyngshede: P:ircstream: Replacement service for irc.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:03:34] second try [11:03:39] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1076715|Pin wgRevisionSlotsCacheExpiry to default (T183490)]], [[gerrit:1076451|Move closed wikis to group0 except aawiki]] [11:03:40] !log zabe@deploy2002 scap failed: Command '['docker', 'run', '--rm', '--attach', 'stdin', '--attach', 'stdout', '--attach', 'stderr', '--user', 'www-data', '--mount', 'type=bind,source=/srv/mediawiki-staging,target=/srv/mediawiki-staging', '--mount', 'type=bind,source=/tmp,target=/tmp', '--workdir', '/srv/mediawiki-staging', '--entrypoint', '/bin/bash', '--network', 'none', 'docker-registry.wikimedia. [11:03:40] org/php7.4-fpm-multiversion-base', '-c', 'rm -f /srv/mediawiki-staging/php-1.43.0-wmf.24/cache/l10n/*.tmp.*']' returned non-zero exit status 126. (scap version: 4.107.0-1) (duration: 00m 00s) [11:03:42] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:03:44] okay [11:03:54] (03CR) 10Slyngshede: [C:03+2] P:ircstream: Replacement service for irc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1076708 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:04:37] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [11:05:35] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [11:08:10] Dreamy_Jazz: ^ fyi, scap currently crashes when I try to do a deploy [11:08:37] (03PS1) 10Volans: Add python3-conftool-dbctl depedency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076718 [11:14:01] Annoying [11:16:44] !log dreamyjazz@deploy2002 Started scap sync-world: Testing deployment using no-op deployment change [11:16:44] !log dreamyjazz@deploy2002 scap failed: Command '['docker', 'run', '--rm', '--attach', 'stdin', '--attach', 'stdout', '--attach', 'stderr', '--user', 'www-data', '--mount', 'type=bind,source=/srv/mediawiki-staging,target=/srv/mediawiki-staging', '--mount', 'type=bind,source=/tmp,target=/tmp', '--workdir', '/srv/mediawiki-staging', '--entrypoint', '/bin/bash', '--network', 'none', 'docker-registry.wiki [11:16:44] media.org/php7.4-fpm-multiversion-base', '-c', 'rm -f /srv/mediawiki-staging/php-1.43.0-wmf.24/cache/l10n/*.tmp.*']' returned non-zero exit status 126. (scap version: 4.107.0-1) (duration: 00m 23s) [11:16:48] (03CR) 10FNegri: "I think this patch corresponds to this item from the checklist in the Phab task:" [puppet] - 10https://gerrit.wikimedia.org/r/1076297 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [11:17:33] (03PS1) 10Slyngshede: P:ircstream: Allow incoming IRC connections. [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) [11:18:00] !log jnuche@deploy2002 Installing scap version "4.104.0" for 212 hosts [11:18:50] created https://phabricator.wikimedia.org/T376023 [11:19:14] Scap is currently being updated as we speak, so that might fix it> [11:19:17] *? [11:19:34] (03CR) 10Volans: [C:03+2] Add python3-conftool-dbctl depedency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076718 (owner: 10Volans) [11:23:44] zabe, Dreamy_Jazz: I've rolled scap back to a previous version which shouldn't have the issue [11:23:48] can you try the backport again? [11:23:50] yes [11:24:12] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1076715|Pin wgRevisionSlotsCacheExpiry to default (T183490)]], [[gerrit:1076451|Move closed wikis to group0 except aawiki]] [11:24:18] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [11:25:47] (03PS2) 10Slyngshede: P:ircstream: Firewall openings. [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) [11:26:43] !log zabe@deploy2002 zabe: Backport for [[gerrit:1076715|Pin wgRevisionSlotsCacheExpiry to default (T183490)]], [[gerrit:1076451|Move closed wikis to group0 except aawiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:26:54] seems to be working [11:27:04] !log zabe@deploy2002 zabe: Continuing with sync [11:27:11] :D [11:27:28] \o/ [11:27:54] (03CR) 10Muehlenhoff: "Looks good, minor bikeshedding inside" [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:28:03] (03Merged) 10jenkins-bot: Don't implement CheckUserQueryInterface in purgeOldData.php [extensions/CheckUser] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076716 (https://phabricator.wikimedia.org/T376022) (owner: 10Dreamy Jazz) [11:29:05] (03CR) 10David Caro: "It's a mixture of the first and that one, but restricted to us only." [puppet] - 10https://gerrit.wikimedia.org/r/1076297 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [11:29:23] (03PS3) 10Slyngshede: P:ircstream: Firewall openings. [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) [11:30:57] (03CR) 10Slyngshede: P:ircstream: Firewall openings. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:31:43] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076715|Pin wgRevisionSlotsCacheExpiry to default (T183490)]], [[gerrit:1076451|Move closed wikis to group0 except aawiki]] (duration: 07m 30s) [11:31:49] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [11:31:53] Dreamy_Jazz: finally over to you :) [11:32:02] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1076716|Don't implement CheckUserQueryInterface in purgeOldData.php (T376022)]] [11:32:08] T376022: Unable to run purgeOldData.php on production - https://phabricator.wikimedia.org/T376022 [11:32:10] (03CR) 10Muehlenhoff: "Two additional syntax fixes are needed for firewall::service" [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:32:16] Thanks! [11:33:18] (03Merged) 10jenkins-bot: Add python3-conftool-dbctl depedency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1076718 (owner: 10Volans) [11:33:31] (03PS4) 10Slyngshede: P:ircstream: Firewall openings. [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) [11:34:02] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1076716|Don't implement CheckUserQueryInterface in purgeOldData.php (T376022)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:34:04] (03PS5) 10Slyngshede: P:ircstream: Firewall openings. [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) [11:34:07] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [11:34:12] (03CR) 10Slyngshede: P:ircstream: Firewall openings. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:37:28] (03CR) 10Hashar: "Thank you Raimond!" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 (owner: 10TrainBranchBot) [11:37:31] !log uploaded spicerack_8.14.0 to apt.wikimedia.org bullseye-wikimedia [11:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:37] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076716|Don't implement CheckUserQueryInterface in purgeOldData.php (T376022)]] (duration: 06m 35s) [11:38:40] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 (owner: 10TrainBranchBot) [11:38:43] T376022: Unable to run purgeOldData.php on production - https://phabricator.wikimedia.org/T376022 [11:44:53] (03Abandoned) 10Clément Goubert: mediawiki: Move job spec for reuse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1071864 (owner: 10Clément Goubert) [11:46:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076721 [11:46:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076721 (owner: 10TrainBranchBot) [11:47:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [11:56:32] !log Running `foreachwiki extensions/CheckUser/maintenance/purgeOldData.php` on a tmux session [11:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:09] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076721 (owner: 10TrainBranchBot) [11:58:09] (03PS2) 10JMeybohm: k8s.upgrade-cluster: Support stacked hardware control planes [cookbooks] - 10https://gerrit.wikimedia.org/r/1076706 (https://phabricator.wikimedia.org/T341984) [12:01:57] (03CR) 10Jaime Nuche: "recheck" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076721 (owner: 10TrainBranchBot) [12:10:52] (03PS2) 10Muehlenhoff: Remove irc1001/irc2001 from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) [12:13:42] (03PS3) 10Muehlenhoff: Remove irc1001/irc2001 from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) [12:13:48] (03CR) 10Raimond Spekking: "Thank you for your patch, Antoine!" [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076671 (owner: 10TrainBranchBot) [12:14:16] (03CR) 10Slyngshede: [C:03+2] P:ircstream: Firewall openings. [puppet] - 10https://gerrit.wikimedia.org/r/1076720 (https://phabricator.wikimedia.org/T376014) (owner: 10Slyngshede) [12:17:32] (03PS1) 10Muehlenhoff: Add irc1003 to servers which receive UDP broadcast events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076730 (https://phabricator.wikimedia.org/T376014) [12:18:42] (03CR) 10Slyngshede: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050261 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [12:27:54] (03CR) 10Joal: [C:03+1] "Thanks @aquhen@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) (owner: 10Aqu) [12:29:25] (03PS1) 10Muehlenhoff: Default ircstream to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1076736 (https://phabricator.wikimedia.org/T376014) [12:30:10] (03PS1) 10Esanders: Enable DiscussionTools auto subscriptions for all interfaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076737 (https://phabricator.wikimedia.org/T290778) [12:30:39] (03CR) 10Joal: [C:04-1] "Actually we need to do an analysis about available HDFS space before merging this. pushing a -1 to make sure we don't forget." [puppet] - 10https://gerrit.wikimedia.org/r/1076563 (https://phabricator.wikimedia.org/T373630) (owner: 10Aqu) [12:31:36] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1076736 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [12:32:08] (03CR) 10Muehlenhoff: [C:03+2] Default ircstream to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1076736 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [12:40:34] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076721 (owner: 10TrainBranchBot) [12:43:16] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1176.eqiad.wmnet [12:46:29] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1176.eqiad.wmnet [12:47:58] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1176.eqiad.wmnet [12:48:19] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1177.eqiad.wmnet [12:48:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076741 [12:51:18] (03PS4) 10Gmodena: dse-k8s-services: dump-reconcile: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) [12:53:03] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1177.eqiad.wmnet [12:54:19] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1177.eqiad.wmnet [12:55:38] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1177.eqiad.wmnet [12:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1300). [13:00:05] sfaci and Melos: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:05:12] FIRING: [3x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:01] (03CR) 10JMeybohm: [C:03+1] Copy meta_2.0.1 into meta_2.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076710 (owner: 10Brouberol) [13:06:10] (03CR) 10JMeybohm: [C:03+1] Release meta_2.0.2 that injects a configuration checksum annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076711 (owner: 10Brouberol) [13:15:46] 06SRE, 06Infrastructure-Foundations: eqiad: one VM for irc.wikimedia.org - https://phabricator.wikimedia.org/T376015#10187547 (10elukey) 05Open→03Resolved a:03elukey [13:16:03] (03CR) 10Brouberol: [C:03+2] Copy meta_2.0.1 into meta_2.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076710 (owner: 10Brouberol) [13:16:07] (03CR) 10Brouberol: [C:03+2] Release meta_2.0.2 that injects a configuration checksum annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076711 (owner: 10Brouberol) [13:17:04] (03Merged) 10jenkins-bot: Copy meta_2.0.1 into meta_2.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076710 (owner: 10Brouberol) [13:17:05] (03Merged) 10jenkins-bot: Release meta_2.0.2 that injects a configuration checksum annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076711 (owner: 10Brouberol) [13:20:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076741 (owner: 10TrainBranchBot) [13:23:14] (03PS8) 10Brouberol: cloudnative-pg: add monitors for PG clusters [alerts] - 10https://gerrit.wikimedia.org/r/1067338 (https://phabricator.wikimedia.org/T372284) [13:26:30] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10187584 (10elukey) Some stats: ` swift_account_stats_objects_total{account="AUTH_docker", cluster="swift", instance="ms-fe1009:9112", job="st... [13:27:54] (03CR) 10Elukey: [C:03+1] Add irc1003 to servers which receive UDP broadcast events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076730 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:28:07] jouncebot: next [13:28:08] In 2 hour(s) and 1 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1530) [13:28:53] (03PS2) 10Brouberol: airflow: automatically inject the configuration checksum annotation on deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076755 (https://phabricator.wikimedia.org/T375886) [13:32:44] (03PS5) 10Gmodena: dse-k8s-services: dump-reconcile: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) [13:32:51] (03PS6) 10Gmodena: dse-k8s-services: dump-reconcile: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) [13:33:20] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [13:33:20] (03CR) 10JMeybohm: [C:03+1] Add irc1003 to servers which receive UDP broadcast events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076730 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:34:09] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: dump-reconcile: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [13:35:25] (03Merged) 10jenkins-bot: dse-k8s-services: dump-reconcile: add kafka-test brokers to flink app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075931 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [13:37:44] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10187610 (10akosiaris) Nice job! So, 75k is a not insignificant but also a not particularly large percentage. ~6%. I have my doubts it will su... [13:38:20] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:40:59] 7~/52 [13:41:02] :) [13:44:21] (03PS1) 10Zabe: group1: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076760 (https://phabricator.wikimedia.org/T183490) [13:49:31] (03CR) 10Elukey: [C:03+2] Add irc1003 to servers which receive UDP broadcast events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076730 (https://phabricator.wikimedia.org/T376014) (owner: 10Muehlenhoff) [13:51:09] !log elukey@deploy2002 Started scap sync-world: Allow udp traffic to irc1003 [13:52:09] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [13:52:15] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich-next: apply [13:52:39] !log elukey@deploy2002 Finished scap sync-world: Allow udp traffic to irc1003 (duration: 02m 04s) [13:53:20] (03CR) 10Stevemunene: [C:03+2] hdfs: Add new worker hosts to net_topology [puppet] - 10https://gerrit.wikimedia.org/r/1072660 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:53:39] (03CR) 10Vgutierrez: varnish: Give 1% of views RSA cert warnings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1072590 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [13:53:41] (03CR) 10Stevemunene: [C:03+2] hdfs: Assign the worker role to new hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/1072661 (https://phabricator.wikimedia.org/T353788) (owner: 10Stevemunene) [13:59:47] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [13:59:51] (03PS1) 10Btullis: Allow overriding the airflow executor pod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076764 (https://phabricator.wikimedia.org/T375895) [13:59:52] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [14:00:17] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::metricsinfra::haproxy: migrate to HAProxy internal exporter [puppet] - 10https://gerrit.wikimedia.org/r/1076297 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [14:00:27] (03PS2) 10Btullis: Allow overriding the airflow executor pod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076764 (https://phabricator.wikimedia.org/T375895) [14:00:53] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T375908#10187678 (10Jhancock.wm) this is sretest2002 [14:01:21] (03CR) 10CI reject: [V:04-1] Allow overriding the airflow executor pod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076764 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [14:01:52] !log upgraded spicerack to 8.14.0 on cumin2002 [14:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:25] (03PS3) 10Btullis: Allow overriding the airflow executor pod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076764 (https://phabricator.wikimedia.org/T375895) [14:04:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Move sretest2002 primary uplink to asw-d4-codfw - https://phabricator.wikimedia.org/T370475#10187693 (10Jhancock.wm) 05Open→03Resolved taken care of! [14:05:31] (03PS1) 10Majavah: P:wmcs::metricsinfra::haproxy: Fix duplicate frontend [puppet] - 10https://gerrit.wikimedia.org/r/1076766 [14:05:32] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T375908#10187699 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:05:58] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10187719 (10elukey) Hi @akosiaris! I had a chat with Janis and there are some details that needs to be highlighted: * I didn't run it with `--... [14:06:13] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10187722 (10elukey) Started: ` elukey@registry1004:~$ time sudo -u docker-registry /usr/bin/docker-registry garbage-collect --dry-run /etc/dock... [14:07:13] (03CR) 10Vgutierrez: "any performance/memory usage considerations here?" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [14:07:27] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) [14:07:30] (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra::haproxy: Fix duplicate frontend [puppet] - 10https://gerrit.wikimedia.org/r/1076766 (owner: 10Majavah) [14:09:22] (03CR) 10CDanis: "great question!" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [14:09:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:13] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [14:10:16] (03CR) 10Vgutierrez: "let's limit the experiment to eqsin first and see what happens if that's ok with you" [puppet] - 10https://gerrit.wikimedia.org/r/1075633 (https://phabricator.wikimedia.org/T371144) (owner: 10CDanis) [14:11:28] (03PS1) 10Elukey: Add ircstream.wikimedia.org as CNAME to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1076770 (https://phabricator.wikimedia.org/T376014) [14:11:46] (03CR) 10CI reject: [V:04-1] Add ircstream.wikimedia.org as CNAME to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1076770 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:12:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:12:23] (03CR) 10Elukey: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1076770 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:12:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10187795 (10Jclark-ctr) a:03Jclark-ctr [14:13:00] (03PS2) 10Zabe: s3: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076760 (https://phabricator.wikimedia.org/T183490) [14:13:40] (03CR) 10CI reject: [V:04-1] s3: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076760 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [14:13:58] (03PS3) 10Brouberol: spark-operator: update base.certificate module to v2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076668 (https://phabricator.wikimedia.org/T365024) [14:14:28] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) [14:14:28] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [14:15:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1076770 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:15:32] (03CR) 10Elukey: [C:03+2] Add ircstream.wikimedia.org as CNAME to irc1003 [dns] - 10https://gerrit.wikimedia.org/r/1076770 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:16:22] (03CR) 10Brouberol: [C:03+1] "Nice" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076764 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [14:17:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10187816 (10Jclark-ctr) logging-hd1004 e6 u32 port 32 cableid 2300304500196 logging-hd1005 f6 u33 port 33 cableid 230304500195 [14:17:58] (03PS3) 10Zabe: s3: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076760 (https://phabricator.wikimedia.org/T183490) [14:18:05] (03PS4) 10Brouberol: spark-operator: update base.certificate module to v2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076668 (https://phabricator.wikimedia.org/T365024) [14:22:28] (03PS1) 10Ladsgroup: tables-catalog: Catalog tables of FlaggedRevs and ProofreadPage [puppet] - 10https://gerrit.wikimedia.org/r/1076774 (https://phabricator.wikimedia.org/T363581) [14:23:01] (03PS5) 10Brouberol: spark-operator: update base.certificate module to v2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076668 (https://phabricator.wikimedia.org/T365024) [14:27:18] (03PS3) 10CDanis: experiment w/ externalIPs on staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) [14:27:32] (03CR) 10Ladsgroup: [C:03+1] s3: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076760 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [14:28:07] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/output/1076768/1964/" [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [14:30:21] 10ops-codfw, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T376035 (10phaultfinder) 03NEW [14:31:06] <_joe_> !log uploaded conftool 3.3.0 to apt [14:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:26] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:34:22] (03PS1) 10Bking: dse-k8s-services: add airflow helmfile directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076775 (https://phabricator.wikimedia.org/T374948) [14:34:47] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt logging-hd1005 - jclark@cumin1002" [14:35:03] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [14:35:04] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [14:35:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt logging-hd1005 - jclark@cumin1002" [14:35:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:42] <_joe_> !log running requestctl --debug upgrade-schema [14:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:17] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:36:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:36:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:36:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:37:58] 10ops-codfw, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T376035#10187873 (10Jhancock.wm) device was offlined. waiting for alert to clear before closing ticket [14:38:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:50] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logging-hd2004 [14:38:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-hd2004 [14:39:44] (03PS1) 10Elukey: role::ircstream: add template for config file plus basic settings [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) [14:39:49] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host logging-hd2005 [14:39:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host logging-hd2005 [14:39:57] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:54] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4154/console" [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:41:16] <_joe_> !log updating conftool to 3.3.0 [14:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:25] (03PS2) 10Elukey: role::ircstream: add template for config file plus basic settings [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) [14:41:45] (03PS1) 10Herron: add otel-cli component and ensure in opentelemetry::collector [puppet] - 10https://gerrit.wikimedia.org/r/1076777 [14:42:05] (03PS3) 10Elukey: role::ircstream: add template for config file plus basic settings [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) [14:42:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:42:26] (03CR) 10CDanis: [C:03+1] add otel-cli component and ensure in opentelemetry::collector [puppet] - 10https://gerrit.wikimedia.org/r/1076777 (owner: 10Herron) [14:43:45] (03CR) 10Brouberol: "I don't think this is going to work. The way I see it, we _need_ to have each airflow instance with its own directory, helmfile and values" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076775 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:45:02] !log force a run of upload_puppet_facts.service on puppetserver1001 to pick up new facts/hosts [14:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:20] (03CR) 10Herron: [C:03+2] "that was fast! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1076777 (owner: 10Herron) [14:47:29] (03CR) 10Brouberol: "Sorry, let me try to make my comment clearer." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076775 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [14:49:20] (03PS1) 10Andrea Denisse: mirrors: Update Tails mirror addresses due to domain migration [puppet] - 10https://gerrit.wikimedia.org/r/1076778 [14:51:23] !log swfrench@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=shellbox-video,name=codfw [reason: Pooling shellbox-video in codfw before (re)pooling eqiad on Wednesday - T370962] [14:51:33] T370962: Southward Datacenter Switchover (September 2024) - https://phabricator.wikimedia.org/T370962 [14:54:18] (03CR) 10Muehlenhoff: "Looks good, bikeshedding inside." [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [14:54:29] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Seanleong-WMDE - https://phabricator.wikimedia.org/T376034#10187916 (10Pppery) [14:55:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1004.eqiad.wmnet with OS bookworm [14:55:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host logging-hd1005.eqiad.wmnet with OS bookworm [14:55:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10187922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1004.eqiad.wmnet with OS bookworm [14:55:20] (03PS1) 10Mforns: Bump up commons-impact-analytics service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076779 (https://phabricator.wikimedia.org/T368035) [14:55:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10187923 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host logging-hd1005.eqiad.wmnet with OS bookworm [14:58:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:35] (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076779 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [15:00:46] (03Merged) 10jenkins-bot: Bump up commons-impact-analytics service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076779 (https://phabricator.wikimedia.org/T368035) (owner: 10Mforns) [15:02:47] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10187945 (10Papaul) [15:03:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10187950 (10Jclark-ctr) [15:06:30] (03PS1) 10Muehlenhoff: Use nftables for ircstream [puppet] - 10https://gerrit.wikimedia.org/r/1076781 [15:06:51] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4155/console" [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [15:08:06] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [15:08:17] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [15:10:41] (03CR) 10Ssingh: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1076778 (owner: 10Andrea Denisse) [15:11:37] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dbproxy2005.codfw.wmnet with OS bookworm [15:11:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#10187994 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host dbproxy2005.codfw.wmnet with OS bookworm executed wi... [15:12:09] (03CR) 10Ssingh: "This is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1075920 (owner: 10Ssingh) [15:12:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#10188014 (10Volans) The above was me aborting the leftover execution of the cookbook that have been left in waiting for user input. [15:12:53] !log upgraded spicerack to 8.14.0 on cumin1002 [15:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:16] (03CR) 10Elukey: [V:03+1] role::ircstream: add template for config file plus basic settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [15:16:38] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10188045 (10ssingh) I am assuming this means a site depool during this period? [15:18:27] (03PS2) 10BPirkle: REST: Make experimental endpoints available on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) [15:19:06] (03CR) 10CI reject: [V:04-1] REST: Make experimental endpoints available on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [15:19:16] (03CR) 10JMeybohm: [C:03+1] experiment w/ externalIPs on staging-codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:19:33] (03CR) 10Muehlenhoff: role::ircstream: add template for config file plus basic settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076776 (https://phabricator.wikimedia.org/T376014) (owner: 10Elukey) [15:20:06] (03PS3) 10BPirkle: REST: Make experimental endpoints available on beta and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076058 (https://phabricator.wikimedia.org/T375512) [15:20:53] (03CR) 10CDanis: [C:03+2] experiment w/ externalIPs on staging-codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1076768 (https://phabricator.wikimedia.org/T344171) (owner: 10CDanis) [15:21:32] (03PS1) 10Mforns: hieradata::services_proxy::envoy.yaml: enable data-gateway listener [puppet] - 10https://gerrit.wikimedia.org/r/1076784 (https://phabricator.wikimedia.org/T368035) [15:21:34] (03PS1) 10Majavah: hieradata: Remove data for clouddb-services [puppet] - 10https://gerrit.wikimedia.org/r/1076785 [15:25:24] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [15:26:38] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd1004.eqiad.wmnet with reason: host reimage [15:26:57] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc1001 [15:27:03] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-hd1005.eqiad.wmnet with reason: host reimage [15:27:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc1001 [15:27:08] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc1002 [15:27:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc1002 [15:27:37] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations, 13Patch-For-Review: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10188117 (10elukey) First day of hackathon summary: * After a chat with Faidon, we disc... [15:28:45] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt mc-misc - jclark@cumin1002" [15:28:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt mc-misc - jclark@cumin1002" [15:28:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:04] jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1530). Please do the needful. [15:31:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:32:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd1004.eqiad.wmnet with reason: host reimage [15:32:20] (03PS2) 10Ladsgroup: tables-catalog: Catalog tables of FlaggedRevs and ProofreadPage [puppet] - 10https://gerrit.wikimedia.org/r/1076774 (https://phabricator.wikimedia.org/T363581) [15:32:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:32:25] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Catalog tables of FlaggedRevs and ProofreadPage [puppet] - 10https://gerrit.wikimedia.org/r/1076774 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [15:32:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075189 (https://phabricator.wikimedia.org/T372460) (owner: 10Abijeet Patro) [15:33:35] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10188140 (10Papaul) @ssingh no site depool only management will not be available during the maintenance [15:34:01] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504#10188142 (10ssingh) >>! In T369504#10188140, @Papaul wrote: > @ssingh no site depool only management will not be available during the maintenance Thanks Papaul! [15:34:11] (03CR) 10Ssingh: [C:03+2] P:dns::recursor: set allow_extended_errors to true [puppet] - 10https://gerrit.wikimedia.org/r/1075062 (https://phabricator.wikimedia.org/T375414) (owner: 10Ssingh) [15:35:05] (03PS2) 10Andrea Denisse: mirrors: Update Tails mirror addresses due to domain migration [puppet] - 10https://gerrit.wikimedia.org/r/1076778 [15:35:41] (03Abandoned) 10Bking: dse-k8s-services: add airflow helmfile directory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076775 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [15:35:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-hd1005.eqiad.wmnet with reason: host reimage [15:38:33] (03CR) 10Andrea Denisse: "I've verified the rsync domain and updated the address in my patch." [puppet] - 10https://gerrit.wikimedia.org/r/1076778 (owner: 10Andrea Denisse) [15:38:59] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [15:39:34] (03CR) 10Jelto: [C:03+1] "Antoine can you also take a brief look here? Do you think it's fine to remove the rsa key?" [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [15:39:51] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc1001 [15:39:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc1001 [15:39:59] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mc-misc1002 [15:40:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mc-misc1002 [15:40:38] (03PS3) 10Andrea Denisse: mirrors: Update Tails mirror addresses due to domain migration [puppet] - 10https://gerrit.wikimedia.org/r/1076778 [15:41:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:32] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:41:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:42:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host mc-misc1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:42:07] (03CR) 10Elukey: "Did a first pass, everything looks good but I added some comments to discuss if it is good or not (long term) to fetch profile data from c" [puppet] - 10https://gerrit.wikimedia.org/r/1075026 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [15:43:27] FIRING: [2x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2064:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:43:36] (03CR) 10Dzahn: [C:03+1] "lgtm, the old URL redirects to the new one in a browser. Curious, how did you find out about this? Did they email us?" [puppet] - 10https://gerrit.wikimedia.org/r/1076778 (owner: 10Andrea Denisse) [15:44:39] (03CR) 10Andrea Denisse: "Yes, they sent an email to noc. I'll let the Tails team know about the update, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1076778 (owner: 10Andrea Denisse) [15:44:49] (03CR) 10Andrea Denisse: [C:03+2] mirrors: Update Tails mirror addresses due to domain migration [puppet] - 10https://gerrit.wikimedia.org/r/1076778 (owner: 10Andrea Denisse) [15:45:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mc-misc1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:48:18] (03CR) 10Btullis: [C:03+2] Allow overriding the airflow executor pod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076764 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [15:48:27] FIRING: [15x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:49:57] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [15:50:02] (03Merged) 10jenkins-bot: Allow overriding the airflow executor pod image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076764 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [15:51:44] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:52:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:52:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd1004.eqiad.wmnet with OS bookworm [15:52:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10188223 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-hd1004.eqiad.wmnet with OS bookworm completed: - logging-hd10... [15:52:44] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4156/console" [puppet] - 10https://gerrit.wikimedia.org/r/1075153 (owner: 10Giuseppe Lavagetto) [15:52:59] (03PS1) 10Scott French: php8.1: rebuild to pick up 8.1.30 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1076783 (https://phabricator.wikimedia.org/T376036) [15:53:07] (03PS1) 10Dzahn: gerrit: remove bad_browser IPs added in 2014 [puppet] - 10https://gerrit.wikimedia.org/r/1076788 [15:53:25] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] service: make legacy function work with puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075153 (owner: 10Giuseppe Lavagetto) [15:53:27] RESOLVED: [15x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:53:56] (03PS2) 10Dzahn: gerrit: remove bad_browser IPs added in 2014 [puppet] - 10https://gerrit.wikimedia.org/r/1076788 [15:55:25] (03PS1) 10Btullis: airflow: Use the latest airflow images by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076789 (https://phabricator.wikimedia.org/T375895) [15:55:53] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:56:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:56:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-hd1005.eqiad.wmnet with OS bookworm [15:56:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10188265 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host logging-hd1005.eqiad.wmnet with OS bookworm completed: - logging-hd10... [15:57:47] (03CR) 10Dzahn: "hmm.. syntax error.. I guess maybe it's "%{facts.networking.ip}" and "%{facts.networking.ip6}"" [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn) [15:58:11] (03CR) 10Ahmon Dancy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1076019 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [15:58:56] (03PS2) 10Dzahn: gitlab: replace legacy Hiera facts with newer syntax [puppet] - 10https://gerrit.wikimedia.org/r/1074493 [15:59:53] (03PS3) 10Dzahn: devtools/hiera: replace legacy facts for puppet 8 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/1074491 [16:02:18] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1076781 (owner: 10Muehlenhoff) [16:04:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10188324 (10Jclark-ctr) [16:05:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10188325 (10Jclark-ctr) These are both setup and waiting for puppet / preseed to be updated to complete [16:05:51] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [16:06:03] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10188332 (10Jclark-ctr) [16:06:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-hd100[4-5] - https://phabricator.wikimedia.org/T372511#10188333 (10Jclark-ctr) 05Open→03Resolved [16:07:20] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T376035#10188330 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:09:57] FIRING: [4x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:58] (03CR) 10Giuseppe Lavagetto: [C:03+1] php8.1: rebuild to pick up 8.1.30 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1076783 (https://phabricator.wikimedia.org/T376036) (owner: 10Scott French) [16:16:29] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T375776#10188351 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced pdu [16:16:42] (03CR) 10Giuseppe Lavagetto: git: add replicated_local_repo define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [16:17:14] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10188354 (10Jclark-ctr) 05In progress→03Resolved a:05ABran-WMF→03Jclark-ctr [16:17:35] (03PS6) 10Giuseppe Lavagetto: git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) [16:17:36] (03PS6) 10Giuseppe Lavagetto: conftool::client: allow setting the conftool2git address [puppet] - 10https://gerrit.wikimedia.org/r/1075039 (https://phabricator.wikimedia.org/T374723) [16:17:36] (03PS6) 10Giuseppe Lavagetto: profile: add conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075040 [16:17:36] (03PS2) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [16:17:56] 10ops-eqiad, 06SRE, 06DC-Ops: Repurposing 2x Decommissioned Servers for Phasing Out Puppet 5 - https://phabricator.wikimedia.org/T375000#10188361 (10VRiley-WMF) Hi @MoritzMuehlenhoff It looks like we could use snapshot1008 and snapshot1009 as stand ins for the servers. Let us know if there is any prefernce o... [16:19:07] (03PS1) 10Giuseppe Lavagetto: Add ssh key for conftool2git [labs/private] - 10https://gerrit.wikimedia.org/r/1076792 [16:20:03] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10188377 (10Jhancock.wm) @aborrero I accidentally ran a few imaging attempts while just going through lists. Could you update the site.pp file for us? Thanks! [16:20:37] (03CR) 10CI reject: [V:04-1] git: add replicated_local_repo define [puppet] - 10https://gerrit.wikimedia.org/r/1075038 (https://phabricator.wikimedia.org/T374723) (owner: 10Giuseppe Lavagetto) [16:21:32] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add ssh key for conftool2git [labs/private] - 10https://gerrit.wikimedia.org/r/1076792 (owner: 10Giuseppe Lavagetto) [16:21:57] (03PS1) 10Btullis: Airflow: update the kerberosexecutor settings with specified image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076793 (https://phabricator.wikimedia.org/T375895) [16:23:39] 10ops-eqiad, 06SRE, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T375639#10188385 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr server is decom status in netbox. [16:23:53] 10ops-eqiad, 06SRE, 06DC-Ops: Repurposing 2x Decommissioned Servers for Phasing Out Puppet 5 - https://phabricator.wikimedia.org/T375000#10188389 (10MoritzMuehlenhoff) That sounds excellent, cage and/or location don't matter at all. [16:25:31] (03CR) 10Btullis: [C:03+2] Airflow: update the kerberosexecutor settings with specified image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076793 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [16:26:03] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T372001#10188399 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr No recent allerts in librenms [16:26:55] (03Merged) 10jenkins-bot: Airflow: update the kerberosexecutor settings with specified image tags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076793 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [16:27:17] !log imported php8.1_8.1.30-1+wmf11u1 into component/php81 - T376036 [16:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:23] T376036: Update PHP 8.1 packages - https://phabricator.wikimedia.org/T376036 [16:27:46] (03PS2) 10Btullis: airflow: Use the latest airflow images by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076789 (https://phabricator.wikimedia.org/T375895) [16:28:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#10188403 (10Jhancock.wm) a:05Jhancock.wm→03elukey [16:28:27] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075920 (owner: 10Ssingh) [16:28:47] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10188432 (10Jhancock.wm) @colewhite are logging-hd2004 and 2005 live? we wanna try rerunning the provisioning script and checking the bios settings. [16:28:52] (03CR) 10Ssingh: "Thanks for the review volans!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075920 (owner: 10Ssingh) [16:29:08] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: sort DC list by DATACENTER_NUMBERING_PREFIX [cookbooks] - 10https://gerrit.wikimedia.org/r/1075920 (owner: 10Ssingh) [16:35:12] (03PS1) 10Daimona Eaytoy: [zhwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076797 (https://phabricator.wikimedia.org/T373821) [16:35:55] PROBLEM - mailman3_queue_size on lists1004 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 125 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [16:36:10] (03CR) 10Btullis: [C:03+2] airflow: Use the latest airflow images by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076789 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [16:36:34] 10SRE-swift-storage, 06Commons: File not found: file not available on Commons - https://phabricator.wikimedia.org/T376013#10188474 (10Pppery) 05Open→03Invalid The file is now properly deleted on the MediaWiki side too. [16:37:11] (03Merged) 10jenkins-bot: airflow: Use the latest airflow images by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076789 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [16:37:55] 10SRE-swift-storage, 06Wikimedia Enterprise: Commonswiki recently updated files not found - https://phabricator.wikimedia.org/T375797#10188485 (10Pppery) 05Open→03Resolved And the second image exists now too. [16:37:55] (03CR) 10CDanis: [C:03+1] Add aux-k8s-{ctrl,worker}1003 to AUX K8s [puppet] - 10https://gerrit.wikimedia.org/r/1076679 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [16:38:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076797 (https://phabricator.wikimedia.org/T373821) (owner: 10Daimona Eaytoy) [16:38:10] (03CR) 10CDanis: [C:03+1] Add aux-k8s-ctrl1003 to admin_ng's config for AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [16:38:30] (03CR) 10Scott French: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1076783 (https://phabricator.wikimedia.org/T376036) (owner: 10Scott French) [16:38:37] (03CR) 10CDanis: [C:03+1] "Confirmed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076681 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [16:39:54] (03CR) 10Scott French: [V:03+2 C:03+2] "Verified locally (build / run entrypoint as a basic smoke test)." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1076783 (https://phabricator.wikimedia.org/T376036) (owner: 10Scott French) [16:40:55] RECOVERY - mailman3_queue_size on lists1004 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [16:40:57] (03PS2) 10Majavah: hieradata: Remove data for clouddb-services [puppet] - 10https://gerrit.wikimedia.org/r/1076785 [16:40:58] (03PS1) 10Majavah: hieradata: Bump striker-toolsbeta to 2024-09-30-162529-production [puppet] - 10https://gerrit.wikimedia.org/r/1076798 (https://phabricator.wikimedia.org/T359428) [16:41:12] (03PS2) 10Majavah: hieradata: Bump striker-toolsbeta to 2024-09-30-162529-production [puppet] - 10https://gerrit.wikimedia.org/r/1076798 (https://phabricator.wikimedia.org/T359428) [16:41:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:42:19] (03CR) 10Majavah: [C:03+2] hieradata: Bump striker-toolsbeta to 2024-09-30-162529-production [puppet] - 10https://gerrit.wikimedia.org/r/1076798 (https://phabricator.wikimedia.org/T359428) (owner: 10Majavah) [16:51:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:51:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [16:52:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson) [16:52:34] (03PS1) 10Majavah: hieradata: Bump striker-tools to 2024-09-30-162529-production [puppet] - 10https://gerrit.wikimedia.org/r/1076799 (https://phabricator.wikimedia.org/T359428) [16:53:14] !log built and published new php8.1 images - T376036 [16:53:15] (03CR) 10Majavah: [C:03+2] hieradata: Bump striker-tools to 2024-09-30-162529-production [puppet] - 10https://gerrit.wikimedia.org/r/1076799 (https://phabricator.wikimedia.org/T359428) (owner: 10Majavah) [16:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:22] T376036: Update PHP 8.1 packages - https://phabricator.wikimedia.org/T376036 [16:55:30] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10188560 (10RobH) Entered ticket 00981959 for this swap to take place today: > Support, > > We would like to request remote hands to assist in retriving a s... [16:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1700) [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T1700). [17:02:17] (03PS1) 10Majavah: hieradata: Bump striker-toolsbeta to 2024-09-30-170045-production [puppet] - 10https://gerrit.wikimedia.org/r/1076800 (https://phabricator.wikimedia.org/T359428) [17:03:11] (03CR) 10Majavah: [C:03+2] hieradata: Bump striker-toolsbeta to 2024-09-30-170045-production [puppet] - 10https://gerrit.wikimedia.org/r/1076800 (https://phabricator.wikimedia.org/T359428) (owner: 10Majavah) [17:07:54] (03PS3) 10Giuseppe Lavagetto: puppetserver: run conftool2git [puppet] - 10https://gerrit.wikimedia.org/r/1075220 (https://phabricator.wikimedia.org/T374723) [17:11:40] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 30) - https://phabricator.wikimedia.org/T374673#10188638 (10KFrancis) Hi all, I'm back from vacation. Please resume all NDA requests to me. Thanks! [17:11:56] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:12:06] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:14:25] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:14:51] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:15:08] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10188649 (10RobH) Ticket accepted and they've retrieved the replacement shipment, should get pinged from them shortly to start the swap. [17:18:25] (03PS1) 10Ebernhardson: cirrus: Change staging update topics to use codfw prefix to match dc switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076807 [17:21:35] (03CR) 10Ebernhardson: [C:03+2] cirrus: Change staging update topics to use codfw prefix to match dc switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076807 (owner: 10Ebernhardson) [17:22:25] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [17:22:45] (03Merged) 10jenkins-bot: cirrus: Change staging update topics to use codfw prefix to match dc switchover [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076807 (owner: 10Ebernhardson) [17:22:52] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [17:23:33] (03PS32) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [17:23:51] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:23:51] (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [17:24:02] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:25:05] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:25:11] (03PS33) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [17:27:47] (03PS1) 10Scott French: shellbox: add support for service.deployment: none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074494 (https://phabricator.wikimedia.org/T375243) [17:29:55] (03PS34) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [17:30:17] 10ops-codfw, 06SRE, 06DC-Ops, 10procurement: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057 (10RobH) 03NEW p:05Triage→03Medium [17:30:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10procurement: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058 (10RobH) 03NEW p:05Triage→03Medium [17:32:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Vacation coverage for Katie Francis (route NDA requests to Rachel until September 30) - https://phabricator.wikimedia.org/T374673#10188803 (10Dzahn) 05Open→03Resolved a:03Dzahn Welcome back, Katie! I'm closing this ticket as resolved since it w... [17:33:40] 10ops-eqiad, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10188812 (10RobH) [17:33:48] 10ops-codfw, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057#10188817 (10RobH) [17:34:22] (03CR) 10Cwhite: [C:03+2] logstash: cast airflow caller field to string [puppet] - 10https://gerrit.wikimedia.org/r/1076299 (owner: 10Cwhite) [17:35:25] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:38:20] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:38:31] (03PS1) 10Bking: dse-k8s: Add service configuration for airflow instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) [17:38:51] (03PS3) 10Ahonc: Change votewiki language back to English. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076810 (https://phabricator.wikimedia.org/T302443) [17:39:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076810 (https://phabricator.wikimedia.org/T302443) (owner: 10Ahonc) [17:43:51] (03CR) 10CI reject: [V:04-1] sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [17:44:32] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [17:44:39] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:47:46] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:47:54] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:49:06] (03PS35) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) [17:53:46] (03CR) 10Dzahn: "thanks for checking in compiler! with this syntax it doesn't fail: https://puppet-compiler.wmflabs.org/output/1074493/4158/" [puppet] - 10https://gerrit.wikimedia.org/r/1074493 (owner: 10Dzahn) [17:57:46] (03CR) 10Dzahn: [C:03+2] scap-master-sync: Fix cdb exclude [puppet] - 10https://gerrit.wikimedia.org/r/1076019 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [17:58:01] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:58:49] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:58:49] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:02:14] (03Abandoned) 10Scott French: service_node_spec.pb: avoid use of merge_config [puppet] - 10https://gerrit.wikimedia.org/r/1076040 (owner: 10Scott French) [18:04:01] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs[4008-4010].ulsfo.wmnet with reason: site is depooled, cr3-ulsfo is being replaced [18:04:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs[4008-4010].ulsfo.wmnet with reason: site is depooled, cr3-ulsfo is being replaced [18:06:12] (03CR) 10Dzahn: [C:03+1] "Guesstimates I got are that <1% of users should be affected." [puppet] - 10https://gerrit.wikimedia.org/r/1075614 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [18:07:08] (03PS3) 10Dzahn: gerrit: remove bad_browser IPs added >=10 years ago [puppet] - 10https://gerrit.wikimedia.org/r/1076788 [18:08:58] (03CR) 10Dzahn: "latest PS is the syntax also used in the production patch. the compiler runs without change for that one now." [puppet] - 10https://gerrit.wikimedia.org/r/1074491 (owner: 10Dzahn) [18:09:20] (03CR) 10Dzahn: "Jelto, we could confirm here first." [puppet] - 10https://gerrit.wikimedia.org/r/1074491 (owner: 10Dzahn) [18:14:00] (03CR) 10CDobbins: sre.cdn.pdns-recursor: add rolling restart script (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1073290 (https://phabricator.wikimedia.org/T374891) (owner: 10CDobbins) [18:27:35] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10189124 (10RobH) Swap completed and @papaul confirms they can attach via serial console. The onsite portion of this troubleshooting and repair should now be... [18:29:03] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 22.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:29:30] (03PS1) 10CDanis: puppet hiera_lookup: add format option [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076821 [18:30:35] (03PS3) 10Scott French: scap: remove stale production dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1075981 (https://phabricator.wikimedia.org/T370962) [18:30:36] (03CR) 10Scott French: "Thanks in advance for the review, Ahmon!" [puppet] - 10https://gerrit.wikimedia.org/r/1075981 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [18:31:05] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:34:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 24.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:34:26] (03CR) 10Ahmon Dancy: [C:03+1] scap: remove stale production dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/1075981 (https://phabricator.wikimedia.org/T370962) (owner: 10Scott French) [18:36:29] (03PS2) 10CDanis: puppet hiera_lookup: add format option [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076821 [18:41:03] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [18:41:25] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [18:48:40] (03CR) 10Dzahn: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [18:51:24] (03PS2) 10Dzahn: aphlict: limit envoy srange to CACHES [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) [19:02:42] (03CR) 10Dzahn: "Ah, cool! Uploaded new PS." [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:06:49] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:07:05] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:07:31] (03PS3) 10Dzahn: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) [19:07:49] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:07:53] welcome back cr3-ulsfo :) [19:07:59] (03CR) 10Dzahn: "Thank you! amended." [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:08:51] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs[4008-4010].ulsfo.wmnet [19:08:52] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs[4008-4010].ulsfo.wmnet [19:10:16] (03PS1) 10Dzahn: requesttracker: fix envoy firewall source ranges [puppet] - 10https://gerrit.wikimedia.org/r/1076824 [19:10:51] 10ops-eqiad, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10189281 (10VRiley-WMF) Surprisingly, I have been able to locate six (6) 32 gig sticks of RAM 3200 MHz. Please let us know when we can initiate this process. [19:13:23] (03CR) 10Dzahn: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:17:29] (03CR) 10Dzahn: [C:03+1] "@Brett It seems we may have to decide/document whether we allow redirects to targets not controlled by us. There are like 3 categories of " [puppet] - 10https://gerrit.wikimedia.org/r/1069643 (owner: 10Ncmonitor) [19:28:39] (03CR) 10Dzahn: "this was added originally with comment "# Allow labs projects to temporarily opt out of nist kex disabling" ( I0fd6579d9c4e0563)" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [19:30:15] (03CR) 10Dzahn: [C:03+1] "also https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064413 before or after" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [19:30:27] (03CR) 10Dzahn: "related: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1074381" [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [19:31:31] (03CR) 10Dzahn: [C:03+1] "this was added originally with comment "# Allow labs projects to temporarily opt out of nist kex disabling" ( before I0fd6579d9c4e0563)" [puppet] - 10https://gerrit.wikimedia.org/r/1064413 (https://phabricator.wikimedia.org/T315942) (owner: 10Dzahn) [19:32:34] 10ops-codfw, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057#10189405 (10Jhancock.wm) [19:32:56] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T375328#10189408 (10colewhite) >>! In T375328#10188432, @Jhancock.wm wrote: > @colewhite are logging-hd2004 and 2005 live? we wanna try rerunning the provisioning script and checking the bios settings. They are not yet live... [19:35:56] 10ops-codfw, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376057#10189417 (10Jhancock.wm) I couldn't find any 3200 MHz at codfw. [19:36:25] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10189423 (10Dzahn) gently pinged on Slack [19:46:17] jouncebot next [19:46:17] In 0 hour(s) and 13 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T2000) [19:48:15] (03CR) 10Dzahn: [C:03+1] "already has +2 but not submitted, is it meant to be merged or waiting?" [puppet] - 10https://gerrit.wikimedia.org/r/1074433 (https://phabricator.wikimedia.org/T374090) (owner: 10JHathaway) [19:55:17] (03CR) 10Volans: [C:03+1] "LGTM. Ideally it would be nice to expand `test_hiera_lookup` with parametrize to test this new behaviour." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076821 (owner: 10CDanis) [19:56:02] (03CR) 10CDanis: [C:03+2] puppet hiera_lookup: add format option [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076821 (owner: 10CDanis) [19:56:11] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [19:56:28] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [19:57:19] cdanis: did you even read my comments? :) [19:57:27] volans: yes :) [19:57:37] I'll add a parameterize later [19:58:17] great :) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T2000) [20:00:05] ebernhardson and Ahonc: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:02:38] \o [20:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [20:05:00] (03Merged) 10jenkins-bot: cirrus: Remove unused Regex pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070282 (https://phabricator.wikimedia.org/T369808) (owner: 10Ebernhardson) [20:05:34] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1070282|cirrus: Remove unused Regex pool counter (T369808)]] [20:05:40] T369808: The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories) - https://phabricator.wikimedia.org/T369808 [20:06:23] * Ahonc around [20:07:53] (03Merged) 10jenkins-bot: puppet hiera_lookup: add format option [software/spicerack] - 10https://gerrit.wikimedia.org/r/1076821 (owner: 10CDanis) [20:07:59] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1070282|cirrus: Remove unused Regex pool counter (T369808)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:08:15] Ahonc: alrighty, i'll get to yours in a few min [20:08:35] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [20:08:51] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:10:12] FIRING: [4x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:37] RECOVERY - Host ml-serve2001 is UP: PING WARNING - Packet loss = 71%, RTA = 0.55 ms [20:11:03] PROBLEM - SSH on ml-serve2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:13:08] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1070282|cirrus: Remove unused Regex pool counter (T369808)]] (duration: 07m 34s) [20:13:14] T369808: The Commons search "deepcategory" operator often does not work (Deep category query returned too many categories) - https://phabricator.wikimedia.org/T369808 [20:14:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076810 (https://phabricator.wikimedia.org/T302443) (owner: 10Ahonc) [20:15:20] Ahonc: alrighty it's going out now, should be on test servers in a few. It looks like the extent of testing is verify the language changes? [20:15:47] (03Merged) 10jenkins-bot: Change votewiki language back to English. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076810 (https://phabricator.wikimedia.org/T302443) (owner: 10Ahonc) [20:16:02] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1076810|Change votewiki language back to English. (T302443)]] [20:16:08] T302443: Undertake Wikimedia Ukraine 2024 AGM elections on SecurePoll - https://phabricator.wikimedia.org/T302443 [20:16:44] looks ok [20:17:01] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:18:06] !log ebernhardson@deploy2002 ebernhardson, ahonc: Backport for [[gerrit:1076810|Change votewiki language back to English. (T302443)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:18:12] yup looks to have switched the ui as expected [20:18:14] !log ebernhardson@deploy2002 ebernhardson, ahonc: Continuing with sync [20:19:58] (03PS5) 10Bking: dse-k8s: Add service configuration for airflow-analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076813 (https://phabricator.wikimedia.org/T371208) [20:20:25] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:22:43] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076810|Change votewiki language back to English. (T302443)]] (duration: 06m 41s) [20:22:49] T302443: Undertake Wikimedia Ukraine 2024 AGM elections on SecurePoll - https://phabricator.wikimedia.org/T302443 [20:23:09] Ahonc: all shipped out, should be everywhere [20:23:51] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:23:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson) [20:24:40] (03Merged) 10jenkins-bot: ClosedWikiProvider: Support canAlwaysAutocreate option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1074257 (https://phabricator.wikimedia.org/T374987) (owner: 10Ebernhardson) [20:24:58] !log ebernhardson@deploy2002 Started scap sync-world: Backport for [[gerrit:1074257|ClosedWikiProvider: Support canAlwaysAutocreate option (T374987)]] [20:25:03] T374987: "Account autocreation denied for CirrusSearch Streaming Updater by ClosedWikiProvider" - https://phabricator.wikimedia.org/T374987 [20:27:03] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:1074257|ClosedWikiProvider: Support canAlwaysAutocreate option (T374987)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:29:06] 10ops-eqiad, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10189608 (10MoritzMuehlenhoff) Fantastic! We can start with puppetserver1002 tomorrow. I'll add a note when the server can be taken down for adding the RAM. Or is it an option to add the... [20:29:33] (03PS1) 10Btullis: airflow: Use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076832 (https://phabricator.wikimedia.org/T375895) [20:30:54] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [20:31:14] (03CR) 10Btullis: [C:03+2] airflow: Use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076832 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [20:32:12] (03Merged) 10jenkins-bot: airflow: Use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076832 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [20:34:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:35:25] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:35:38] !log ebernhardson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1074257|ClosedWikiProvider: Support canAlwaysAutocreate option (T374987)]] (duration: 10m 40s) [20:35:43] T374987: "Account autocreation denied for CirrusSearch Streaming Updater by ClosedWikiProvider" - https://phabricator.wikimedia.org/T374987 [20:36:01] that wraps up the deployment window [20:37:49] (03PS1) 10Zabe: Make revision-slots expiry configurable [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076834 (https://phabricator.wikimedia.org/T183490) [20:40:19] (03PS1) 10DErenrich: disable the Add A Fact QuickSurvey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076835 [20:43:18] (03CR) 10Muehlenhoff: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:43:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1076824 (owner: 10Dzahn) [20:44:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:45:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [20:47:45] (03PS1) 10CDanis: Revert "experiment w/ externalIPs on staging-codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1076837 (https://phabricator.wikimedia.org/T344171) [20:49:09] (03CR) 10Dzahn: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:50:49] (03CR) 10Zabe: [C:03+2] Make revision-slots expiry configurable [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076834 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [20:51:40] (03CR) 10Dzahn: "also could be possible that tools like https://people.wikimedia.org/~cdanis/sremap/ get deployed from deploy.. would have to check" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:51:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076661 (https://phabricator.wikimedia.org/T375979) (owner: 10Melos) [20:52:23] (03CR) 10CDanis: "This one doesn't, but I don't know about others." [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:56:14] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: repool ulsfo as cr3-ulsfo was replaced, T375345] [20:56:20] T375345: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345 [20:56:23] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: repool ulsfo as cr3-ulsfo was replaced, T375345] [20:56:29] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:56:51] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [20:57:50] 06SRE, 06Infrastructure-Foundations, 10vm-requests: eqiad: request VM 1 for wdqs-categories - https://phabricator.wikimedia.org/T376079 (10bking) 03NEW [20:57:59] (03CR) 10Dzahn: [C:03+2] requesttracker: fix envoy firewall source ranges [puppet] - 10https://gerrit.wikimedia.org/r/1076824 (owner: 10Dzahn) [20:59:05] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10189714 (10bking) [20:59:35] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10189711 (10bking) [20:59:36] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10189715 (10bking) [21:00:01] 06SRE, 06Infrastructure-Foundations, 10vm-requests, 10Data-Platform-SRE (2024.09.28 - 2024.10.18): eqiad: request 1 VM for wdqs-categories - https://phabricator.wikimedia.org/T376079#10189719 (10bking) [21:00:05] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240930T2100). [21:01:32] (03CR) 10Dzahn: "ACK, thanks! yea, the real reason is just running the httpbb tests. That always made me add deployment_hosts as srange on various services" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:02:39] (03CR) 10Dzahn: [C:03+2] "on moscovium: Error 500 on SERVER: Server Error: Could not find resource 'File[/etc/ferm/conf.d]' in parameter 'require'. checking..." [puppet] - 10https://gerrit.wikimedia.org/r/1076824 (owner: 10Dzahn) [21:06:51] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [21:20:27] (03CR) 10Dzahn: [C:03+1] envoy: Add support for passing an array of sets to the firewall service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1072690 (owner: 10Muehlenhoff) [21:20:45] (03CR) 10Dzahn: [C:03+2] "I left an inline comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072690 about this." [puppet] - 10https://gerrit.wikimedia.org/r/1076824 (owner: 10Dzahn) [21:22:39] (03Merged) 10jenkins-bot: Make revision-slots expiry configurable [core] (wmf/1.43.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1076834 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [21:23:16] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1076834|Make revision-slots expiry configurable (T183490)]] [21:23:26] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [21:25:24] !log zabe@deploy2002 zabe: Backport for [[gerrit:1076834|Make revision-slots expiry configurable (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:24] !log zabe@deploy2002 zabe: Continuing with sync [21:26:36] (03PS1) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) [21:26:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076835 (owner: 10DErenrich) [21:30:59] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076834|Make revision-slots expiry configurable (T183490)]] (duration: 07m 42s) [21:31:05] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [21:36:28] (03PS1) 10Btullis: airflow: update PYTHONPATH and set executor_pod_image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076844 (https://phabricator.wikimedia.org/T375895) [21:38:13] (03CR) 10Btullis: [C:03+2] airflow: update PYTHONPATH and set executor_pod_image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076844 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [21:38:20] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:39:48] (03Merged) 10jenkins-bot: airflow: update PYTHONPATH and set executor_pod_image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076844 (https://phabricator.wikimedia.org/T375895) (owner: 10Btullis) [21:47:00] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:47:44] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [21:52:27] (03CR) 10Zabe: [C:03+2] s3: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076760 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [21:53:15] (03Merged) 10jenkins-bot: s3: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076760 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [21:53:36] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1076760|s3: Reduce revision-slots cache expiry to 60 seconds (T183490)]] [21:53:41] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [21:54:16] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [21:54:47] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [21:55:45] !log zabe@deploy2002 zabe: Backport for [[gerrit:1076760|s3: Reduce revision-slots cache expiry to 60 seconds (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:55:53] !log zabe@deploy2002 zabe: Continuing with sync [21:58:04] (03PS1) 10Dzahn: requesttracker: set firewall_srange to ~ in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1076846 [21:58:38] (03CR) 10Dzahn: [C:03+2] requesttracker: set firewall_srange to ~ in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1076846 (owner: 10Dzahn) [22:00:30] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1076760|s3: Reduce revision-slots cache expiry to 60 seconds (T183490)]] (duration: 06m 54s) [22:00:45] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:12:22] 10ops-eqiad, 06SRE, 06DC-Ops: puppetserver ram upgrades - decom memory option - https://phabricator.wikimedia.org/T376058#10189860 (10RobH) RAM requires downtime for swap, as the host has to slide partially out of the rack and expose the mainboard. [22:19:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for zoe - https://phabricator.wikimedia.org/T373666#10189863 (10VPuffetMichel) I approve this. Thanks for the ping as I have been in and out a lot. [22:44:15] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [23:08:51] PROBLEM - Hadoop NodeManager on an-worker1177 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:20:25] PROBLEM - Hadoop NodeManager on an-worker1176 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:22:51] RECOVERY - Hadoop NodeManager on an-worker1177 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:29:57] (03PS2) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) [23:31:49] (03PS3) 10Bking: wdqs-categories: introduce VM for testing [puppet] - 10https://gerrit.wikimedia.org/r/1076841 (https://phabricator.wikimedia.org/T375687) [23:35:25] RECOVERY - Hadoop NodeManager on an-worker1176 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076858 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1076858 (owner: 10TrainBranchBot) [23:46:49] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 62, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:47:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:47:51] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:53:49] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:04] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:49] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status