[00:06:02] (03PS1) 10Arlolra: Deploy Parsoid Read Views to 7 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199880 (https://phabricator.wikimedia.org/T408765) [00:08:43] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:11:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:13:18] (03CR) 10Scott French: "Thanks, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [00:17:36] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11325872 (10SherryYang-WMF) requested wmf on IDM I think I can start with level one of analytics-privatedata-users and see... [00:25:38] 06SRE: Migrate from Squid to Varnish - https://phabricator.wikimedia.org/T78911#11325883 (10Krinkle) [00:38:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199885 [00:38:29] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199885 (owner: 10TrainBranchBot) [00:38:43] (03PS1) 10Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) [00:38:54] (03CR) 10Superpes15: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [00:39:10] (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [00:40:46] (03CR) 10CI reject: [V:04-1] Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [00:42:57] (03PS2) 10Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) [00:43:48] (03CR) 10Superpes15: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [00:54:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199885 (owner: 10TrainBranchBot) [01:00:49] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:02:24] (03PS1) 10Tim Starling: Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) [01:02:26] (03PS1) 10Tim Starling: Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) [01:02:28] (03PS1) 10Tim Starling: Enable ChangesListQuery partitioning on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798) [01:04:21] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:08:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199895 [01:08:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199895 (owner: 10TrainBranchBot) [01:08:44] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:11:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:14:02] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 12s) [01:31:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199895 (owner: 10TrainBranchBot) [01:33:43] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:43] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:18] (03CR) 10RLazarus: [C:03+1] Enroll 50% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199836 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [02:14:33] (03CR) 10RLazarus: [C:03+1] mw-(api-int|jobrunner): serve 25% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199837 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [02:29:21] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:34:15] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11325974 (10Dzahn) a:05SherryYang-WMF→03None Thank you, sounds good. Will continue with this information. [03:09:21] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:34:21] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:06:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:23:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:28:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:46:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:04:21] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:08:43] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:33:43] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:44:21] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [05:44:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:44:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T407997)', diff saved to https://phabricator.wikimedia.org/P84410 and previous config saved to /var/cache/conftool/dbconfig/20251030-054449-marostegui.json [05:44:55] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [05:47:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T407997)', diff saved to https://phabricator.wikimedia.org/P84411 and previous config saved to /var/cache/conftool/dbconfig/20251030-054659-marostegui.json [05:47:19] (03PS1) 10Marostegui: db2153: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1199943 (https://phabricator.wikimedia.org/T407463) [05:48:13] (03CR) 10Marostegui: [C:03+2] db2153: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1199943 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [05:49:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2153.codfw.wmnet with reason: Maintenance [05:49:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2153 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84412 and previous config saved to /var/cache/conftool/dbconfig/20251030-054923-marostegui.json [05:51:28] (03PS1) 10Marostegui: installserver: Remove es2048 [puppet] - 10https://gerrit.wikimedia.org/r/1199945 [05:53:55] (03CR) 10Marostegui: [C:03+2] installserver: Remove es2048 [puppet] - 10https://gerrit.wikimedia.org/r/1199945 (owner: 10Marostegui) [05:57:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84413 and previous config saved to /var/cache/conftool/dbconfig/20251030-055732-root.json [05:58:15] (03PS1) 10Marostegui: instances.yaml: Remove es1033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199946 (https://phabricator.wikimedia.org/T408772) [05:58:53] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199946 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T0600) [06:00:05] marostegui, Amir1, and federico3: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T0600). Please do the needful. [06:00:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1033 from dbctl T408772', diff saved to https://phabricator.wikimedia.org/P84414 and previous config saved to /var/cache/conftool/dbconfig/20251030-060018-marostegui.json [06:00:24] T408772: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772 [06:00:41] (03PS1) 10Marostegui: es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1199948 (https://phabricator.wikimedia.org/T408772) [06:01:16] (03CR) 10Marostegui: [C:03+2] es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1199948 (https://phabricator.wikimedia.org/T408772) (owner: 10Marostegui) [06:02:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P84415 and previous config saved to /var/cache/conftool/dbconfig/20251030-060208-marostegui.json [06:12:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84416 and previous config saved to /var/cache/conftool/dbconfig/20251030-061238-root.json [06:15:11] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1033.eqiad.wmnet with OS trixie [06:17:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P84417 and previous config saved to /var/cache/conftool/dbconfig/20251030-061715-marostegui.json [06:22:37] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.078e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [06:27:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84418 and previous config saved to /var/cache/conftool/dbconfig/20251030-062744-root.json [06:29:21] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:32:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T407997)', diff saved to https://phabricator.wikimedia.org/P84419 and previous config saved to /var/cache/conftool/dbconfig/20251030-063223-marostegui.json [06:32:29] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [06:32:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [06:32:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84420 and previous config saved to /var/cache/conftool/dbconfig/20251030-063247-marostegui.json [06:34:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84421 and previous config saved to /var/cache/conftool/dbconfig/20251030-063457-marostegui.json [06:42:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84422 and previous config saved to /var/cache/conftool/dbconfig/20251030-064250-root.json [06:50:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P84423 and previous config saved to /var/cache/conftool/dbconfig/20251030-065004-marostegui.json [06:50:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage [06:54:00] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage [07:00:05] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:05:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P84424 and previous config saved to /var/cache/conftool/dbconfig/20251030-070512-marostegui.json [07:10:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11326172 (10Jclark-ctr) Replacement drive has arrived @btullis [07:15:36] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9420 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [07:18:43] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:20:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84425 and previous config saved to /var/cache/conftool/dbconfig/20251030-072020-marostegui.json [07:20:26] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [07:20:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:20:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T407997)', diff saved to https://phabricator.wikimedia.org/P84426 and previous config saved to /var/cache/conftool/dbconfig/20251030-072043-marostegui.json [07:22:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T407997)', diff saved to https://phabricator.wikimedia.org/P84427 and previous config saved to /var/cache/conftool/dbconfig/20251030-072253-marostegui.json [07:33:17] FIRING: ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:21] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:38:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P84428 and previous config saved to /var/cache/conftool/dbconfig/20251030-073801-marostegui.json [07:38:17] FIRING: [10x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:38:32] FIRING: [10x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:38:36] (03PS8) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [07:41:56] 06SRE, 10SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11326218 (10elukey) Hi Daniel! I think full access since the kerberos identity was requested :) [07:43:17] FIRING: [18x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:43:24] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11326220 (10elukey) Correct this needs an approval from Mark afaik :) @mark Hi! Looping you in to approve the ops membership for Dawid (new Staff SRE in ML). [07:46:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:48:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:48:44] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:50:15] brouberol, stevemunene : could you have a look at the WDQS elevated max lag ? Ping David or Gabriele if needed. [07:51:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:52:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:53:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P84429 and previous config saved to /var/cache/conftool/dbconfig/20251030-075308-marostegui.json [07:53:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:53:34] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1033.eqiad.wmnet with OS trixie [07:53:43] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:54:09] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1033.eqiad.wmnet with OS trixie [07:54:51] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11326229 (10Marostegui) [07:57:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:58:43] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:03:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:04:45] Ack gehel , though it seems to have resolved. Following up on any extra steps that might be needed [08:05:34] stevemunene: there is a discussion on slack. More context there. [08:07:31] (03CR) 10Slyngshede: [C:04-1] "We should add tests, like so:" [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [08:08:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T407997)', diff saved to https://phabricator.wikimedia.org/P84430 and previous config saved to /var/cache/conftool/dbconfig/20251030-080816-marostegui.json [08:08:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:08:22] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:08:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:08:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84431 and previous config saved to /var/cache/conftool/dbconfig/20251030-080840-marostegui.json [08:08:43] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:10:29] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [08:10:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84432 and previous config saved to /var/cache/conftool/dbconfig/20251030-081050-marostegui.json [08:12:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:12:53] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11326255 (10elukey) Repooled codfw after the eqiad-only test, I think we are good! We'll wait a couple more days to be sure, but from next week we should start decomming the old har... [08:13:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:25] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:15:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:16:23] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:17:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:18:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:18:31] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage [08:18:43] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:20:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:22:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:22:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1155.eqiad.wmnet with reason: Upgrade [08:23:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:23:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet with reason: Fixing triggers [08:23:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage [08:25:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P84433 and previous config saved to /var/cache/conftool/dbconfig/20251030-082558-marostegui.json [08:27:03] (03CR) 10Stevemunene: Deploy airflow images from airflow-dags repository build (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [08:27:09] (03CR) 10Jcrespo: [C:03+1] "Yes, we don't actively backup this host (only every 5 years). Although we should migrate the backup user grants." [puppet] - 10https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: 10Marostegui) [08:28:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:28:30] (03CR) 10Marostegui: "The host was cloned from the existing one, so if they were there, they should be on the new host too" [puppet] - 10https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: 10Marostegui) [08:28:31] (03CR) 10Marostegui: [C:03+2] backup1013.cnf.erb: Change es1032 with es1055 [puppet] - 10https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: 10Marostegui) [08:30:59] (03PS8) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [08:32:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:32:03] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [08:32:54] (03CR) 10Elukey: [C:03+2] conftool: upgrade to 6.x and above [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199723 (owner: 10Giuseppe Lavagetto) [08:33:44] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:39:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:41:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P84434 and previous config saved to /var/cache/conftool/dbconfig/20251030-084105-marostegui.json [08:43:44] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:47:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:48:17] RESOLVED: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:52:02] RESOLVED: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:54:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:56:02] FIRING: [6x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [08:56:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84435 and previous config saved to /var/cache/conftool/dbconfig/20251030-085613-marostegui.json [08:56:17] FIRING: [12x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:19] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [08:56:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [08:56:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [08:56:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T407997)', diff saved to https://phabricator.wikimedia.org/P84436 and previous config saved to /var/cache/conftool/dbconfig/20251030-085636-marostegui.json [08:58:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T407997)', diff saved to https://phabricator.wikimedia.org/P84437 and previous config saved to /var/cache/conftool/dbconfig/20251030-085846-marostegui.json [08:59:47] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:03:38] (03PS1) 10Slyngshede: data.yaml add tracking for sherryyang [puppet] - 10https://gerrit.wikimedia.org/r/1199993 [09:04:21] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:06:17] FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:08:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:08:47] 06SRE: offline rackspace wikitech-static, online aws wikitech-static - https://phabricator.wikimedia.org/T408704#11326433 (10LSobanski) cc @akosiaris [09:10:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:13:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:13:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P84438 and previous config saved to /var/cache/conftool/dbconfig/20251030-091354-marostegui.json [09:14:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1199993 (owner: 10Slyngshede) [09:15:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:15:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:18:54] (03PS1) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [09:19:08] (03CR) 10Slyngshede: [C:03+2] data.yaml add tracking for sherryyang [puppet] - 10https://gerrit.wikimedia.org/r/1199993 (owner: 10Slyngshede) [09:19:26] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11326454 (10elukey) [09:20:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:20:26] (03CR) 10Brouberol: Deploy airflow images from airflow-dags repository build (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [09:20:50] (03CR) 10Brouberol: "Also, as a general point, please render locally to ferret these issues earlier." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [09:25:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:25:11] (03CR) 10Tiziano Fogli: [C:03+2] nrpe2nodexp: use service description as alertname [puppet] - 10https://gerrit.wikimedia.org/r/1199242 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:25:45] (03CR) 10Brouberol: "Something like" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [09:28:36] (03CR) 10Brouberol: "You can also run `rake run_locally` in your `deployment-charts` directory to run the CI job locally." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [09:29:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P84439 and previous config saved to /var/cache/conftool/dbconfig/20251030-092901-marostegui.json [09:30:21] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:34:21] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:38:44] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:40:21] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:41:43] (03CR) 10Majavah: [C:03+2] aptrepo: Retire kubeadm/1.29 components [puppet] - 10https://gerrit.wikimedia.org/r/1199240 (owner: 10Majavah) [09:41:50] (03CR) 10Majavah: [C:03+2] aptrepo: Import Kubeadm/1.31 packages [puppet] - 10https://gerrit.wikimedia.org/r/1199241 (https://phabricator.wikimedia.org/T372697) (owner: 10Majavah) [09:42:13] (03PS1) 10JMeybohm: admin: Replace my ssh key with a FIDO token [puppet] - 10https://gerrit.wikimedia.org/r/1200008 [09:42:13] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:42:58] (03PS2) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [09:43:15] (03PS1) 10D3r1ck01: Stats: add getLabels() function [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1200009 (https://phabricator.wikimedia.org/T406170) [09:44:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T407997)', diff saved to https://phabricator.wikimedia.org/P84440 and previous config saved to /var/cache/conftool/dbconfig/20251030-094409-marostegui.json [09:44:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:44:21] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [09:44:54] (03Abandoned) 10D3r1ck01: Stats: add getLabels() function [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1200009 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [09:47:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:47:36] (03PS1) 10Stevemunene: Add an opensearch-test-codfw namespace [puppet] - 10https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) [09:48:31] (03PS1) 10Majavah: aptrepo: Remove previously-missed reference to kubeadm 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1200011 [09:48:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:48:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [09:48:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T407997)', diff saved to https://phabricator.wikimedia.org/P84441 and previous config saved to /var/cache/conftool/dbconfig/20251030-094854-marostegui.json [09:49:13] (03CR) 10Majavah: [C:03+2] aptrepo: Remove previously-missed reference to kubeadm 1.29 [puppet] - 10https://gerrit.wikimedia.org/r/1200011 (owner: 10Majavah) [09:50:30] !log import prometheus-statsd-exporter to trixie-wikimedia T407513 [09:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:36] T407513: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513 [09:51:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T407997)', diff saved to https://phabricator.wikimedia.org/P84442 and previous config saved to /var/cache/conftool/dbconfig/20251030-095103-marostegui.json [09:51:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [09:51:32] (03PS3) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [09:53:04] (03PS4) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [09:54:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, key has been verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1200008 (owner: 10JMeybohm) [09:55:06] (03CR) 10JMeybohm: [C:03+2] admin: Replace my ssh key with a FIDO token [puppet] - 10https://gerrit.wikimedia.org/r/1200008 (owner: 10JMeybohm) [09:56:25] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1000) [10:01:25] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:03:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:05:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:05:45] (03CR) 10Elukey: "Left some comments to better understand the code!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney) [10:06:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P84443 and previous config saved to /var/cache/conftool/dbconfig/20251030-100611-marostegui.json [10:08:44] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:10:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:12:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:12:52] (03CR) 10Cathal Mooney: sre.hosts.provision: move the switch config to parent class and run (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney) [10:13:43] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:14:06] (03PS1) 10Tiziano Fogli: haproxy: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks [puppet] - 10https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) [10:14:06] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [10:14:25] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:14:45] (03PS2) 10Tiziano Fogli: cinder: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks [puppet] - 10https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) [10:14:47] FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:11] (03PS5) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [10:18:43] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:19:08] (03PS1) 10Tiziano Fogli: neutron: enable nrpe2nodexp wrapper on check-neutron-conntrack [puppet] - 10https://gerrit.wikimedia.org/r/1200016 (https://phabricator.wikimedia.org/T328502) [10:19:25] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:21:12] (03PS6) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [10:21:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P84444 and previous config saved to /var/cache/conftool/dbconfig/20251030-102118-marostegui.json [10:21:59] 10SRE-Access-Requests: Posix group membership: dpogorzelski ->ml-lab-users - https://phabricator.wikimedia.org/T408788 (10DPogorzelski-WMF) 03NEW [10:22:29] (03PS1) 10Dpogorzelski: topic: add dpogorzelski to ml-lab-users [puppet] - 10https://gerrit.wikimedia.org/r/1200017 (https://phabricator.wikimedia.org/T408788) [10:22:35] (03CR) 10CI reject: [V:04-1] Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 (owner: 10Daniel Kinzler) [10:23:09] (03PS1) 10Tiziano Fogli: nova: enable nrpe2nodexp wrapper on check-flavor_aggregates [puppet] - 10https://gerrit.wikimedia.org/r/1200018 (https://phabricator.wikimedia.org/T328502) [10:24:01] (03PS7) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [10:24:47] FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:00] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [10:26:12] (03PS9) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [10:27:14] (03CR) 10Stevemunene: "Thanks, using this for now and bookmarked the other helpfull tips" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [10:27:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:28:13] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:28:50] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1033.eqiad.wmnet with OS trixie [10:29:21] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:33:53] (03Abandoned) 10Dpogorzelski: topic: add dpogorzelski to ml-lab-users [puppet] - 10https://gerrit.wikimedia.org/r/1200017 (https://phabricator.wikimedia.org/T408788) (owner: 10Dpogorzelski) [10:34:46] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Posix group membership: dpogorzelski ->ml-lab-users - https://phabricator.wikimedia.org/T408788#11326666 (10DPogorzelski-WMF) 05Open→03Invalid [10:36:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T407997)', diff saved to https://phabricator.wikimedia.org/P84445 and previous config saved to /var/cache/conftool/dbconfig/20251030-103626-marostegui.json [10:36:33] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [10:36:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [10:39:40] (03PS8) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [10:40:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:44:11] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:44:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:45:16] here [10:45:38] !incidents [10:45:38] 6910 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [10:45:41] We were just seeing 503s on Grafana [10:45:42] !ack 6910 [10:45:43] 6910 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [10:46:31] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:47:11] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:47:31] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:49:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:13] Looking at the thanos-swift dashboard, there was a big spike of requests (resulting in 206) around the time the page fired [10:52:28] (with consequent rise in network traffic etc) [10:54:56] looks like it hit 1002 a lot harder than 1001 [10:54:58] 06SRE, 06Infrastructure-Foundations: megacli issues on Debian Trixie - https://phabricator.wikimedia.org/T408776#11326712 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:55:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:55:31] memory usage spike at the same time lines up. $someone did $something expensive [10:55:42] the `thanos-query-log-explore` script in the docs doesn't output anything it seems unless I'm using it wrong [10:56:06] I can't get it to either. [10:56:22] yeah, during the same window, some queries took several minutes to complete.. [10:57:09] hnowlan: shall I open a ticket about that for observability to look at? [10:57:39] (but I think this incident probably doesn't need more work from oncall now otherwise) [10:58:38] ah, no, I get it now, if I specify --min-range 1m then I get answers [11:01:40] <_joe_> https://www.youtube.com/watch?v=M_5u3ESfFv0 [11:03:32] :D [11:03:42] tappof: could you have a look to see if anything sticks out please? [11:04:15] yes hnowlan [11:05:00] * hnowlan afk for an hour [11:05:06] thanks m.oritzm! <3 [11:05:11] yw! [11:05:15] (03CR) 10Clément Goubert: [C:03+1] Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 (owner: 10Daniel Kinzler) [11:07:06] (03PS9) 10Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 [11:07:31] (03CR) 10Clément Goubert: [C:03+1] Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 (owner: 10Daniel Kinzler) [11:07:49] 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11326744 (10JAllemandou) The reason for which I suggested doing this task is that Druid-realtime are a specific type o... [11:08:13] (03PS5) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [11:09:54] (03CR) 10CI reject: [V:04-1] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [11:10:44] (03CR) 10Clément Goubert: [C:03+2] Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 (owner: 10Daniel Kinzler) [11:12:55] (03Merged) 10jenkins-bot: Fix handling of per-route ratelimit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199999 (owner: 10Daniel Kinzler) [11:13:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:14:47] FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:16:07] (03CR) 10Brouberol: [C:04-1] Add an opensearch-test-codfw namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: 10Stevemunene) [11:17:56] (03CR) 10Brouberol: [C:04-1] "I'm still seeing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [11:18:22] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:19:36] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:20:02] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:24:42] (03CR) 10Stevemunene: Add an opensearch-test-codfw namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: 10Stevemunene) [11:25:17] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:25:31] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:25:49] (03PS6) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [11:27:31] (03CR) 10CI reject: [V:04-1] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [11:28:30] (03PS1) 10Muehlenhoff: Re-enable monitoring for maps/bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) [11:29:08] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11326775 (10DPogorzelski-WMF) a:03mark [11:33:44] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:34:05] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:34:14] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:34:21] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:38:49] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2026 - https://phabricator.wikimedia.org/T408385#11326804 (10Marostegui) [11:40:19] (03PS1) 10Marostegui: es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200031 (https://phabricator.wikimedia.org/T408407) [11:41:03] (03CR) 10Marostegui: [C:03+2] es2028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1200031 (https://phabricator.wikimedia.org/T408407) (owner: 10Marostegui) [11:41:24] (03CR) 10Marostegui: [C:04-2] "Not for now, as I am using it for some Debian trixie testing." [puppet] - 10https://gerrit.wikimedia.org/r/1199825 (https://phabricator.wikimedia.org/T408407) (owner: 10Federico Ceratto) [11:41:38] (03PS10) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [11:42:47] (03PS7) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [11:43:55] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [11:44:02] (03CR) 10CI reject: [V:04-1] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [11:45:08] > tappof: could you have a look to see if anything sticks out please? [11:45:23] !log installing pdns-recursor security updates [11:45:25] Looks like the short outage was caused by a request on the "all clusters utilization" dashboard with a time range of a year [11:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:41] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database minwikisource (T408346) [11:45:47] T408346: [wikireplicas] Create views for new wiki minwikisource - https://phabricator.wikimedia.org/T408346 [11:48:50] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793 (10MatthewVernon) 03NEW [11:54:15] (03PS1) 10Clément Goubert: api-gateway: Improve policy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200033 [11:54:48] (03CR) 10Effie Mouzeli: [C:03+1] Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [11:54:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11326898 (10taavi) [11:55:07] (03CR) 10Effie Mouzeli: [C:03+1] "woohoo" [puppet] - 10https://gerrit.wikimedia.org/r/1198952 (owner: 10Muehlenhoff) [11:55:58] (03CR) 10Muehlenhoff: [C:03+2] Remove Cumin aliases for legacy mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/1198952 (owner: 10Muehlenhoff) [11:59:09] (03PS1) 10Stevemunene: druid: switch to using the druid-public-coordinator url [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) [11:59:48] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Improve policy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200033 (owner: 10Clément Goubert) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1200) [12:01:03] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [12:01:39] (03Merged) 10jenkins-bot: api-gateway: Improve policy override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200033 (owner: 10Clément Goubert) [12:02:32] (03PS8) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [12:03:42] (03CR) 10CI reject: [V:04-1] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [12:03:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [12:03:53] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:04:22] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:04:44] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:05:21] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:06:05] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:06:16] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:06:46] (03PS9) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [12:10:26] (03CR) 10Slyngshede: [C:03+1] "Looks good." [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [12:13:44] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:21:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11326937 (10Jclark-ctr) [12:22:22] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11326938 (10Jclark-ctr) 05Open→03Resolved [12:23:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission kafka-jumbo100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T404413#11326940 (10Jclark-ctr) [12:29:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: 10Bunnypranav) [12:31:12] !log installing nginx security updates [12:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission kafka-jumbo100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T404413#11326973 (10Jclark-ctr) 05Open→03Resolved [12:41:24] (03PS1) 10Marostegui: installserver: Format /srv/ in es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1200049 (https://phabricator.wikimedia.org/T407472) [12:45:21] (03CR) 10Marostegui: [C:03+2] installserver: Format /srv/ in es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1200049 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [12:48:37] (03PS1) 10Huei Tan: alartmanager: change the lpl-team-slack-api-alerts config [puppet] - 10https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) [12:49:05] (03CR) 10CI reject: [V:04-1] alartmanager: change the lpl-team-slack-api-alerts config [puppet] - 10https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [12:52:10] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [12:53:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11327005 (10Jclark-ctr) Replacement drive is being ordered from dell on ticket T408572 after reviewing Available options other supplie... [12:58:11] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [13:00:05] Urbanecm and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1300). [13:00:05] seanleong-wmde, JavierMonton, mfossati, Superpes, and Bunnypranav: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:25] o/ [13:00:37] o/ [13:01:48] o/ [13:02:19] bunnypranav Any reason why you moved on workboard the 2 tasks I was handling and change the status? [13:04:21] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:04:29] Urbanecm, TheresNoTime: I can self-deploy [13:06:48] Superpes: Nothing special, I thought the general process for work board/task management (someone did that for mine as well earlier). Apologies if you do not want it; feel free to revert and I will take a note for future. [13:07:51] bunnypranav Nope, don't get me wrong, the process is absolutely correct! But, since they should be closed in less than an hour... well, I'd say it's a pointless change, just more unnecessary work for us, that's all :D No need to revert :) [13:08:28] mfossati Inizia con le tue patch che qui si fa notte mi sa poi, se riesci, ci saremmo anche noi :D [13:08:43] Oh okay, will remember for any future changes. [13:08:59] Superpes ok vado! [13:09:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199814 (owner: 10Marco Fossati) [13:09:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199844 (https://phabricator.wikimedia.org/T408618) (owner: 10Marco Fossati) [13:09:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199847 (owner: 10Marco Fossati) [13:12:58] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199814 (owner: 10Marco Fossati) [13:12:58] (03Merged) 10jenkins-bot: Style adjustments [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199844 (https://phabricator.wikimedia.org/T408618) (owner: 10Marco Fossati) [13:13:01] (03Merged) 10jenkins-bot: Capture more captions [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199847 (owner: 10Marco Fossati) [13:13:50] !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1199814|Localisation updates from https://translatewiki.net.]], [[gerrit:1199844|Style adjustments (T408618)]], [[gerrit:1199847|Capture more captions]] [13:13:55] T408618: UI Bug Bash for Image browsing (production) - https://phabricator.wikimedia.org/T408618 [13:14:45] Hi, sorry for the question, it's my first time trying to deploy a change, I added it to the calendar but I'm not sure if I have to do anything else. Can I help with it somehow? [13:15:16] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2001.codfw.wmnet [13:15:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2001.codfw.wmnet [13:15:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11327064 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by elukey@cumin1003 pool for host ml-serve2001.codfw.wmnet completed: - ml-serve2001.codfw.w... [13:15:33] Hi, anyone able to help me deploy my changes? Thanks! [13:15:38] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11327067 (10elukey) 05Open→03Resolved a:03elukey Host repooled! [13:16:26] urbanecm, TheresNoTime: are you around to deploy the config changes by JavierMonton and seanleong-wmde? [13:16:45] same here btw, I also need a deployer [13:16:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [13:16:53] yup, I am around, but I need a deployer [13:18:10] !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1199814|Localisation updates from https://translatewiki.net.]], [[gerrit:1199844|Style adjustments (T408618)]], [[gerrit:1199847|Capture more captions]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:18:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1001.eqiad.wmnet [13:18:43] checking, please hold on :-) [13:19:22] 06SRE, 10SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11327081 (10elukey) rationale for `ml-team-admins`: while Dawid will soon be in `ops`, some tools available only to `ml-team-admins` will need to be tested in the future and not needi... [13:20:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [13:22:15] hmm for some reason I'm not seeing the changes with WikimediaDebug ... let me dig further [13:22:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [13:23:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance [13:24:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1001.eqiad.wmnet [13:26:59] no idea why the WikimediaDebug extension in Firefox isn't showing me the changes [13:28:03] I'm trying to switch a few backends [13:28:58] (03PS2) 10Gehel: WDQS: remove ferm rule for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736) [13:31:30] Oh I think I got it: I can't test on the wikis where the extension is deployed since they aren't yet at 1.45.0-wmf.25. I'll go forward. Thanks for bearing with me [13:32:07] !log mfossati@deploy2002 mfossati: Continuing with sync [13:32:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [13:32:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:32:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84446 and previous config saved to /var/cache/conftool/dbconfig/20251030-133243-marostegui.json [13:32:49] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:34:05] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 60 hosts with reason: downtime new nokia devices in case they alert during tests [13:34:12] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11327142 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aacadee6-1bf1-45b7-bbed-963884cb38ed) set by cmooney@cumin1003 for 5 d... [13:34:21] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:35:59] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database minwikisource (T408346) [13:36:00] mfossati I just noticed! That's right, it can't be tested on wikis, because it's for the next update :D [13:36:04] T408346: [wikireplicas] Create views for new wiki minwikisource - https://phabricator.wikimedia.org/T408346 [13:36:10] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database minwikisource (T408346) [13:36:18] (03CR) 10Brouberol: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [13:36:35] Is anyone available to deploy the other patches? [13:36:55] !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199814|Localisation updates from https://translatewiki.net.]], [[gerrit:1199844|Style adjustments (T408618)]], [[gerrit:1199847|Capture more captions]] (duration: 23m 05s) [13:37:00] T408618: UI Bug Bash for Image browsing (production) - https://phabricator.wikimedia.org/T408618 [13:37:01] Superpes: LOL, I definitely overlooked that [13:37:21] I'm all done here! [13:39:36] Superpes: I have deploy rights, so I guess I could do that, but don't wanna step on any official deployer toes :-) [13:39:58] (03CR) 10Muehlenhoff: C:openldap extend wikimediaPerson schema for Phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [13:40:11] Let me check if they're available on Slack [13:40:18] I think no one is available atm (except you) :D [13:40:22] Yep for sure! [13:40:26] thanks! [13:40:37] I have just 1 config change [13:42:02] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:42:19] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:42:27] They both seem away on Slack, too. Well, I'll wear the deployer hat then [13:42:50] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database pcmwikiquote (T408354) [13:42:51] o7 [13:42:55] T408354: [wikireplicas] Create views for new wiki pcmwikiquote - https://phabricator.wikimedia.org/T408354 [13:43:21] (03CR) 10Bking: [C:03+1] Add an opensearch-test-codfw namespace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: 10Stevemunene) [13:44:00] thanks mfossati! [13:44:21] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:38] seanleong-wmde, JavierMonton, Superpes, Bunnypranav: I'll backport all config patches at once. If anybody needs to verify their patch, please let me know [13:45:34] (03PS2) 10Andrea Denisse: alartmanager: change the lpl-team-slack-api-alerts config [puppet] - 10https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [13:45:54] (03CR) 10Andrea Denisse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [13:46:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [13:46:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [13:46:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199725 (https://phabricator.wikimedia.org/T408298) (owner: 10Superpes15) [13:46:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199727 (https://phabricator.wikimedia.org/T408514) (owner: 10Superpes15) [13:46:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: 10Bunnypranav) [13:46:27] mfossati I could do a short test if it's possible in mwdebug [13:46:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84447 and previous config saved to /var/cache/conftool/dbconfig/20251030-134639-marostegui.json [13:46:43] (03CR) 10Andrea Denisse: [C:03+2] alartmanager: change the lpl-team-slack-api-alerts config [puppet] - 10https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [13:46:46] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [13:46:50] seanleong-wmde: sure thing [13:46:56] thanks! [13:46:59] (03Merged) 10jenkins-bot: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [13:47:02] (03Merged) 10jenkins-bot: Disable default user-agent collection. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [13:47:04] (03Merged) 10jenkins-bot: [huwiki] Set $wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199725 (https://phabricator.wikimedia.org/T408298) (owner: 10Superpes15) [13:47:07] (03Merged) 10jenkins-bot: [ruwiki] Enable WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199727 (https://phabricator.wikimedia.org/T408514) (owner: 10Superpes15) [13:47:09] (03Merged) 10jenkins-bot: core-Namespaces: Add R: and R_talk: NS for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: 10Bunnypranav) [13:47:43] !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[gerrit:1198626|core-Namespaces: Add R: [13:47:43] and R_talk: NS for crhwiki (T408284)]] [13:47:53] T397258: Implement Visual Changes to Edit Summary Based on UX Proposal - https://phabricator.wikimedia.org/T397258 [13:47:55] T384964: [Event Platform] Disable default collection of user agent for analytics streams - https://phabricator.wikimedia.org/T384964 [13:47:57] T408298: Set $wgUploadNavigationUrl for hu.wikipedia.org - https://phabricator.wikimedia.org/T408298 [13:47:57] T408514: Install Extension:WikiLove in Russian Wikipedia - https://phabricator.wikimedia.org/T408514 [13:47:58] T408284: Request to create a namespace for Crimean Tatar Wikipedia - https://phabricator.wikimedia.org/T408284 [13:48:09] (03CR) 10Slyngshede: C:openldap extend wikimediaPerson schema for Phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [13:50:20] !log mfossati@deploy2002 superpes, bunnypranav, javiermonton, mfossati, seanleong-wmde: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[g [13:50:20] errit:1198626|core-Namespaces: Add R: and R_talk: NS for crhwiki (T408284)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:50:41] Testing :) [13:50:42] testing it now [13:51:22] testing [13:51:45] checking [13:52:27] mfossati, all good for mine, but you'll need to run namespace dupes script as well for that wiki. [13:53:26] there are existing pages with the new namespace prefix (R:, https://crh.wikipedia.org/wiki/Mahsus:%C3%96nekDizini?prefix=R%3A&namespace=0), so that will need fixing [13:53:35] bunnypranav: sorry, but I've never done that and not sure how to [13:53:56] Oh, anyone here that can help with the script? [13:55:28] (03CR) 10Majavah: [C:04-1] "It seems like this is caused by a mismatch of the wmf server packages and the debian client package:" [puppet] - 10https://gerrit.wikimedia.org/r/1199850 (owner: 10Andrew Bogott) [13:56:50] mfossati: https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php is the docs for it, if it helps. I assume, since this should not be clashing by my understanding, that "./maintenance/run namespaceDupes --fix" the automatic repair one should work [13:56:56] on the debug extension, do you know which of the options e.g. k8s-mwdebug we should be looking on? [13:57:01] in the dropdown [13:57:07] would you be able to do it? [13:57:58] (03PS1) 10Elukey: team-sre: set only critical alerts for mirrors [alerts] - 10https://gerrit.wikimedia.org/r/1200068 [13:58:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11327268 (10MatthewVernon) @VRiley-WMF the host is up, but it can't reach any of its spinning disks (the OS sees none, and the BMC says 0 physical disks).... [13:59:49] Sorry for the late everything fine for my patches :) [14:00:15] bunnypranav: I'm afraid I can't help further, never done that, so not confident at all [14:00:59] everything fine on my side too [14:01:03] hmm, I was actually advised that this is the script to be run. [14:01:05] (03CR) 10Bking: [C:03+1] global_config: stop relying on DNS to translate FQDNs into IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: 10Brouberol) [14:01:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P84448 and previous config saved to /var/cache/conftool/dbconfig/20251030-140147-marostegui.json [14:01:52] (advised by other deployers few days ago) [14:02:11] (03CR) 10Tiziano Fogli: [C:03+2] cinder: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks [puppet] - 10https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) (owner: 10Tiziano Fogli) [14:02:37] bunnypranav: it would be great if you could directly ask them [14:03:42] I can send xSavitar a DM to do it in a few hours when they said they will be available, and we continue the patch for now. Would that be okay with you? [14:03:48] (03CR) 10Kgraessle: [C:03+1] Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [14:03:57] (03PS1) 10Marostegui: rebuild_abuse_filter_log_trigger.sh: Quick oneliner to drop a trigger [software] - 10https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780) [14:04:07] (03CR) 10Kgraessle: [C:03+1] Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [14:04:50] (03CR) 10Marostegui: "Just a quick script for this task, as it may be needed in the future for other schema changes, just leaving this one here as example as so" [software] - 10https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780) (owner: 10Marostegui) [14:04:55] bunnypranav: yep, that sounds good. Thanks for your understanding, as I came here only to backport my patches :-) [14:05:02] (03CR) 10Marostegui: [C:03+2] rebuild_abuse_filter_log_trigger.sh: Quick oneliner to drop a trigger [software] - 10https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780) (owner: 10Marostegui) [14:05:12] (03CR) 10Kgraessle: [C:03+1] Enable ChangesListQuery partitioning on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [14:05:18] Thank you for the backport, really appreciate it! :) [14:05:32] all right, let's go! [14:05:33] (03Merged) 10jenkins-bot: rebuild_abuse_filter_log_trigger.sh: Quick oneliner to drop a trigger [software] - 10https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780) (owner: 10Marostegui) [14:05:45] !log mfossati@deploy2002 superpes, bunnypranav, javiermonton, mfossati, seanleong-wmde: Continuing with sync [14:06:46] !log aqu@deploy2002 Started deploy [analytics/refinery@39e92e9] (hadoop-test): Update pageview allowlist TEST [analytics/refinery@39e92e9f] [14:07:04] (03PS3) 10Bking: ingress: remove reference to defunct template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) [14:07:51] !log aqu@deploy2002 Finished deploy [analytics/refinery@39e92e9] (hadoop-test): Update pageview allowlist TEST [analytics/refinery@39e92e9f] (duration: 01m 04s) [14:08:24] !log aqu@deploy2002 Started deploy [analytics/refinery@39e92e9]: Update pageview allowlist [analytics/refinery@39e92e9f] [14:09:17] (03PS4) 10Bking: ingress: remove reference to defunct template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) [14:09:49] (03CR) 10Bking: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking) [14:11:22] !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[gerrit:1198626|core-Namespaces: Add R: [14:11:22] and R_talk: NS for crhwiki (T408284)]] (duration: 23m 39s) [14:11:31] T397258: Implement Visual Changes to Edit Summary Based on UX Proposal - https://phabricator.wikimedia.org/T397258 [14:11:32] T384964: [Event Platform] Disable default collection of user agent for analytics streams - https://phabricator.wikimedia.org/T384964 [14:11:33] T408298: Set $wgUploadNavigationUrl for hu.wikipedia.org - https://phabricator.wikimedia.org/T408298 [14:11:33] T408514: Install Extension:WikiLove in Russian Wikipedia - https://phabricator.wikimedia.org/T408514 [14:11:34] T408284: Request to create a namespace for Crimean Tatar Wikipedia - https://phabricator.wikimedia.org/T408284 [14:12:10] we're all done here! This was a quite impromptu session :-D [14:12:17] !log aqu@deploy2002 Finished deploy [analytics/refinery@39e92e9]: Update pageview allowlist [analytics/refinery@39e92e9f] (duration: 03m 52s) [14:12:27] Thank you so much! [14:12:59] thanks! mfossati [14:13:46] !log aqu@deploy2002 Started deploy [analytics/refinery@39e92e9] (thin): Update pageview allowlist THIN [analytics/refinery@39e92e9f] [14:14:00] (03PS2) 10Tiziano Fogli: base: remove check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) [14:14:24] (03PS1) 10Marostegui: db2195: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200075 [14:14:44] a pleasure, have a nice one folks! [14:15:03] !log aqu@deploy2002 Finished deploy [analytics/refinery@39e92e9] (thin): Update pageview allowlist THIN [analytics/refinery@39e92e9f] (duration: 01m 16s) [14:15:12] (03CR) 10Marostegui: [C:03+2] db2195: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200075 (owner: 10Marostegui) [14:16:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2195.codfw.wmnet with reason: Maintenance [14:16:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2195 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84449 and previous config saved to /var/cache/conftool/dbconfig/20251030-141638-marostegui.json [14:16:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P84450 and previous config saved to /var/cache/conftool/dbconfig/20251030-141657-marostegui.json [14:17:53] Grazie mfossati :3 [14:18:00] (03CR) 10Vgutierrez: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [14:18:08] (03CR) 10Stevemunene: Deploy airflow images from airflow-dags repository build (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [14:18:13] (03CR) 10Stevemunene: [C:03+2] Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [14:19:58] (03Merged) 10jenkins-bot: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [14:20:10] (03CR) 10Muehlenhoff: base: remove check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [14:21:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [14:22:31] (03PS3) 10Tiziano Fogli: base: remove check_microcode [puppet] - 10https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) [14:23:00] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:24:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84451 and previous config saved to /var/cache/conftool/dbconfig/20251030-142428-root.json [14:26:10] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:26:52] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:27:20] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:27:40] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:28:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1430) [14:30:48] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1200068 (owner: 10Elukey) [14:31:38] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [14:32:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84452 and previous config saved to /var/cache/conftool/dbconfig/20251030-143204-marostegui.json [14:32:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:32:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [14:32:35] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [14:33:22] (03PS5) 10Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [14:33:31] (03PS3) 10Cathal Mooney: sre.hosts.provision: move the switch config to parent class and run [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) [14:33:48] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:33:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:34:29] (03PS2) 10Tiziano Fogli: dbctl: enable nrpe2nodexp wrapper on dbctl_uncommitted_diffs [puppet] - 10https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) [14:34:29] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [14:35:56] (03PS1) 10Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/1200080 [14:35:57] (03CR) 10Elukey: [C:03+2] team-sre: set only critical alerts for mirrors [alerts] - 10https://gerrit.wikimedia.org/r/1200068 (owner: 10Elukey) [14:36:06] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission es2026 - https://phabricator.wikimedia.org/T408385#11327475 (10Jhancock.wm) 05Open→03Resolved [14:36:18] (03PS2) 10Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/1200080 [14:36:22] (03CR) 10Brouberol: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: 10Brouberol) [14:36:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200080 (owner: 10Andrew Bogott) [14:37:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org [14:39:07] (03CR) 10CI reject: [V:04-1] cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/1200080 (owner: 10Andrew Bogott) [14:39:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84453 and previous config saved to /var/cache/conftool/dbconfig/20251030-143934-root.json [14:39:35] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database pcmwikiquote (T408354) [14:39:38] (03CR) 10Brouberol: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: 10Brouberol) [14:39:43] T408354: [wikireplicas] Create views for new wiki pcmwikiquote - https://phabricator.wikimedia.org/T408354 [14:41:11] (03PS3) 10Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/1200080 [14:42:06] (03PS3) 10Vgutierrez: haproxy,varnish: Report X-Is-Browser back from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) [14:44:01] (03CR) 10Muehlenhoff: "This runs on the Cumin hosts, which are shared infrastructure, but the dbctl infrastructure is used by the DBAs, so I'll add them as revie" [puppet] - 10https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [14:44:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200080 (owner: 10Andrew Bogott) [14:44:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [14:44:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T407997)', diff saved to https://phabricator.wikimedia.org/P84454 and previous config saved to /var/cache/conftool/dbconfig/20251030-144452-marostegui.json [14:44:59] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:47:55] (03PS4) 10Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/1200080 [14:47:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200080 (owner: 10Andrew Bogott) [14:48:26] (03CR) 10Marostegui: [C:03+1] "This is fine, and pretty easy to generate an alert (it only goes to irc) so we can see how it works. Let me know if you want me to do so" [puppet] - 10https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [14:50:03] (03CR) 10Ladsgroup: [C:03+1] dbctl: enable nrpe2nodexp wrapper on dbctl_uncommitted_diffs [puppet] - 10https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [14:50:56] (03CR) 10Vgutierrez: [V:03+2] "varnishtests are happy for both text and upload" [puppet] - 10https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) (owner: 10Vgutierrez) [14:51:18] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/1200080 (owner: 10Andrew Bogott) [14:51:31] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:51:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:51:49] (03CR) 10CDanis: [C:03+1] haproxy,varnish: Report X-Is-Browser back from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) (owner: 10Vgutierrez) [14:53:11] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11327643 (10elukey) After a chat with Jesse it may be possible that this bug is a variant of what we have been chasing in T381919. Th... [14:54:24] (03CR) 10Bking: [C:03+2] Add an opensearch-test-codfw namespace [puppet] - 10https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: 10Stevemunene) [14:54:24] (03CR) 10Elukey: [C:03+1] sre.hosts.provision: move the switch config to parent class and run (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney) [14:54:40] (03PS1) 10STran: Deploy temporary accounts to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200083 [14:54:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84455 and previous config saved to /var/cache/conftool/dbconfig/20251030-145440-root.json [14:55:11] (03PS1) 10CDanis: benthos webrequest: x-is-browser [puppet] - 10https://gerrit.wikimedia.org/r/1200084 [14:55:34] (03PS3) 10Gehel: WDQS: remove ferm rule for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736) [14:55:36] (03PS1) 10Arlolra: Turn off GeoCrumbsUseParserOutputFallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) [14:55:52] (03CR) 10Elukey: "I am totally fine with this, do we know what monitors will be re-enabled? Just to be sure and avoid noise :)" [puppet] - 10https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:57:00] (03PS1) 10Marostegui: db2170: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200086 (https://phabricator.wikimedia.org/T407463) [14:57:30] (03CR) 10Marostegui: [C:03+2] db2170: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1200086 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [14:57:59] (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: stop relying on DNS to translate FQDNs into IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: 10Brouberol) [14:58:15] (03CR) 10Gehel: [C:03+2] WDQS: remove ferm rule for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736) (owner: 10Gehel) [14:58:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2170.codfw.wmnet with reason: Maintenance [14:58:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2170 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84456 and previous config saved to /var/cache/conftool/dbconfig/20251030-145831-marostegui.json [14:58:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T407997)', diff saved to https://phabricator.wikimedia.org/P84457 and previous config saved to /var/cache/conftool/dbconfig/20251030-145857-marostegui.json [14:59:03] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:00:05] dduvall and dancy: OwO what's this, a deployment window?? Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1500). nyaa~ [15:00:05] marostegui: I think you have a pending puppet merge on puppetmaster, Feel free to merge mine at the same time (a removal of a ferm rule) [15:00:16] gehel: No, I don't [15:00:22] gehel: I think it is brouberol [15:00:23] gehel: ok to merge the ferm rule change for wdqs? [15:00:27] There we go! [15:00:27] sure [15:00:31] (it's both of us) [15:01:00] the mariadb change has now disappeared! [15:01:08] gehel: I think I was already pushing when you pinged me [15:01:13] But your change wasn't there when I pushed :) [15:02:06] everything is now merged! [15:02:44] yay! [15:03:12] (03CR) 10CDanis: [C:03+1] benthos::webrequest: Provide X-Is-Browser data [puppet] - 10https://gerrit.wikimedia.org/r/1199781 (owner: 10Vgutierrez) [15:04:02] (03CR) 10Vgutierrez: [V:03+2 C:03+2] haproxy,varnish: Report X-Is-Browser back from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) (owner: 10Vgutierrez) [15:04:05] (03Abandoned) 10CDanis: benthos webrequest: x-is-browser [puppet] - 10https://gerrit.wikimedia.org/r/1200084 (owner: 10CDanis) [15:05:09] (03PS1) 10Tiziano Fogli: dotls: enable nrpe2nodexp wrapper on check_dotls [puppet] - 10https://gerrit.wikimedia.org/r/1200088 (https://phabricator.wikimedia.org/T384425) [15:05:09] (03CR) 10Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - 10https://gerrit.wikimedia.org/r/1200088 (https://phabricator.wikimedia.org/T384425) (owner: 10Tiziano Fogli) [15:05:54] (03CR) 10Tiziano Fogli: [C:03+2] base: remove check_microcode (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [15:06:00] (03CR) 10Tiziano Fogli: [C:03+2] dbctl: enable nrpe2nodexp wrapper on dbctl_uncommitted_diffs [puppet] - 10https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: 10Tiziano Fogli) [15:06:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84458 and previous config saved to /var/cache/conftool/dbconfig/20251030-150636-root.json [15:07:04] (03PS1) 10Daniel Kinzler: Note that per-route rate limits require Envoy 1.33 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200090 [15:08:39] (03CR) 10C. Scott Ananian: [C:03+1] Turn off GeoCrumbsUseParserOutputFallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: 10Arlolra) [15:08:51] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:09:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84459 and previous config saved to /var/cache/conftool/dbconfig/20251030-150946-root.json [15:14:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P84460 and previous config saved to /var/cache/conftool/dbconfig/20251030-151405-marostegui.json [15:14:34] (03CR) 10Muehlenhoff: "AFAICT currently no alerts would be issued for all the common base alerts (disk space, host down etc) and also not for the OSM replication" [puppet] - 10https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:16:22] (03CR) 10Vgutierrez: [C:03+2] benthos::webrequest: Provide X-Is-Browser data [puppet] - 10https://gerrit.wikimedia.org/r/1199781 (owner: 10Vgutierrez) [15:16:59] (03Abandoned) 10Andrew Bogott: dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 (owner: 10Andrew Bogott) [15:17:04] (03Abandoned) 10Andrew Bogott: pdns_server::db_backups: --skip-ssl for db setup commands [puppet] - 10https://gerrit.wikimedia.org/r/1199851 (owner: 10Andrew Bogott) [15:18:51] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84461 and previous config saved to /var/cache/conftool/dbconfig/20251030-152141-root.json [15:24:07] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:24:19] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:24:48] (03CR) 10Cathal Mooney: "Re-tested with test-cookbook and working as expected. Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney) [15:24:50] (03CR) 10Cathal Mooney: [C:03+2] sre.hosts.provision: move the switch config to parent class and run [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney) [15:27:09] (03PS1) 10Bking: opensearch-cluster: temporarily remove prometheus-related annotations from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200093 (https://phabricator.wikimedia.org/T362114) [15:28:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11327839 (10Jhancock.wm) 05In progress→03Resolved a:03Jhancock.wm [15:29:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P84462 and previous config saved to /var/cache/conftool/dbconfig/20251030-152913-marostegui.json [15:30:35] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11327853 (10VRiley-WMF) Of course, I'm looking into this now. [15:31:15] (03Merged) 10jenkins-bot: sre.hosts.provision: move the switch config to parent class and run [cookbooks] - 10https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: 10Cathal Mooney) [15:31:20] !log dancy@deploy2002 Installing scap version "4.221.0" for 165 host(s) [15:32:29] !log installing openjdk-21 security updates [15:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:20] (03CR) 10Elukey: [C:03+1] "Let's try :)" [puppet] - 10https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:33:51] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:23] (03CR) 10Bking: [C:03+2] opensearch-cluster: temporarily remove prometheus-related annotations from chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200093 (https://phabricator.wikimedia.org/T362114) (owner: 10Bking) [15:34:31] !log installing imagemagick security updates [15:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:09] !log dancy@deploy2002 Installation of scap version "4.221.0" completed for 165 hosts [15:36:24] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:36:34] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:36:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84463 and previous config saved to /var/cache/conftool/dbconfig/20251030-153647-root.json [15:37:10] (03PS6) 10Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [15:37:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:37:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:38:51] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:39:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:44:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T407997)', diff saved to https://phabricator.wikimedia.org/P84464 and previous config saved to /var/cache/conftool/dbconfig/20251030-154420-marostegui.json [15:44:27] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:44:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:44:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84465 and previous config saved to /var/cache/conftool/dbconfig/20251030-154434-marostegui.json [15:50:40] (03CR) 10Brouberol: [C:03+1] Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [15:51:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:51:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:51:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84466 and previous config saved to /var/cache/conftool/dbconfig/20251030-155153-root.json [15:52:18] (03CR) 10Brouberol: [C:04-1] "Don't merge yet:" [puppet] - 10https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: 10Stevemunene) [15:57:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84467 and previous config saved to /var/cache/conftool/dbconfig/20251030-155758-marostegui.json [15:58:04] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:00:05] jhathaway and moritzm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:53] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [16:04:10] (03CR) 10Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [16:06:03] (03CR) 10Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [16:06:31] (03PS1) 10Andrew Bogott: labsaliaser: include python3-keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/1200095 [16:09:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:10:05] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:12:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:12:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [16:12:35] (03PS1) 10Andrew Bogott: cloudservices: include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1200096 [16:12:52] (03Abandoned) 10Andrew Bogott: labsaliaser: include python3-keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/1200095 (owner: 10Andrew Bogott) [16:12:56] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200096 (owner: 10Andrew Bogott) [16:13:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P84468 and previous config saved to /var/cache/conftool/dbconfig/20251030-161306-marostegui.json [16:13:32] 06SRE, 10SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11328180 (10Dzahn) >>! In T408579#11326218, @elukey wrote: > Hi Daniel! I think full access since the kerberos identity was requested :) I think there is still a misunderstanding her... [16:16:34] (03CR) 10Andrew Bogott: [C:03+2] cloudservices: include openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1200096 (owner: 10Andrew Bogott) [16:16:48] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2248-2267].codfw.wmnet [16:16:56] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2248-2267].codfw.wmnet [16:19:20] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2003-2004,2007-2010,2019-2032].codfw.wmnet [16:21:36] 06SRE, 10SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11328259 (10elukey) My understanding is that asking a kerberos identity implies https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#All_of_the_above, what is the misundersta... [16:22:43] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [16:24:18] (03PS1) 10Andrew Bogott: pdns_server: rename 'master' to 'primary' [puppet] - 10https://gerrit.wikimedia.org/r/1200097 [16:25:38] (03PS2) 10Andrew Bogott: pdns_server: rename 'master' to 'primary' [puppet] - 10https://gerrit.wikimedia.org/r/1200097 [16:25:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1200097 (owner: 10Andrew Bogott) [16:26:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11328329 (10KOfori) This has my approval. [16:28:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P84469 and previous config saved to /var/cache/conftool/dbconfig/20251030-162814-marostegui.json [16:31:39] (03PS1) 10Ottomata: page-analytics - bump image to get pageviews/v3/per_editor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200099 (https://phabricator.wikimedia.org/T405041) [16:32:38] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2003-2004,2007-2010,2019-2032].codfw.wmnet [16:33:03] (03CR) 10BryanDavis: "Andrew added the prometheus logging in I0443357a7e2abb5b48ea6d2f78053078dc3f68c8" [puppet] - 10https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [16:33:18] (03CR) 10Ottomata: [C:03+2] page-analytics - bump image to get pageviews/v3/per_editor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200099 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [16:33:51] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:14] (03PS3) 10Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) [16:34:48] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11328383 (10Jhancock.wm) i ran `reset /system1/pwrmgtsvc1` with a physical console up to observe. it didn't reboot for me. i powered it down manually and checked the insides aga... [16:35:17] (03Merged) 10jenkins-bot: page-analytics - bump image to get pageviews/v3/per_editor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200099 (https://phabricator.wikimedia.org/T405041) (owner: 10Ottomata) [16:35:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:39:09] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [16:39:31] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [16:40:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:42:39] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [16:42:45] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [16:42:50] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [16:43:06] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [16:43:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84470 and previous config saved to /var/cache/conftool/dbconfig/20251030-164322-marostegui.json [16:43:27] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:43:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [16:43:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84471 and previous config saved to /var/cache/conftool/dbconfig/20251030-164346-marostegui.json [16:43:56] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2268-2287].codfw.wmnet [16:44:05] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2268-2287].codfw.wmnet [16:44:31] (03CR) 10Brouberol: [C:04-1] "`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [16:44:57] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101].codfw.wmnet [16:45:07] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [16:45:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:45:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:45:40] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [16:48:51] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:49:53] (03PS1) 10Cparle: Enable pagination on Special:Watchlist everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510) [16:51:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510) (owner: 10Cparle) [16:54:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11328564 (10Clement_Goubert) [16:57:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84472 and previous config saved to /var/cache/conftool/dbconfig/20251030-165710-marostegui.json [16:57:17] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:58:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10procurement, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11328592 (10Clement_Goubert) [16:58:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11328593 (10Clement_Goubert) [16:59:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:59:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [16:59:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: 10Arlolra) [17:00:05] bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700). [17:00:05] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700). [17:00:11] o/ [17:00:38] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101].codfw.wmnet [17:00:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:01:33] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2288-2299].codfw.wmnet [17:01:39] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2288-2299].codfw.wmnet [17:02:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [17:02:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [17:02:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199836 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:02:27] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2230-2241].codfw.wmnet [17:02:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2028:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2028 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:03:15] (03Merged) 10jenkins-bot: Enroll 50% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199836 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:03:48] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1199836|Enroll 50% of client sessions in PHP 8.3 (T405955)]] [17:03:52] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:05:57] nothing for my deploy window this week. The only changes in developer-portal were translation file noise from a MediaWiki major version bump at TWN. [17:07:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11328646 (10VRiley-WMF) [17:07:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11328647 (10VRiley-WMF) 05Open→03Resolved This is completed. [17:08:38] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1199836|Enroll 50% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:08:51] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:22] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2230-2241].codfw.wmnet [17:10:22] (03PS1) 10Kosta Harlan: EventBus: Enable TYPE_EVENT for loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) [17:11:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: 10Kosta Harlan) [17:11:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [17:11:55] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [17:12:01] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2300-2319].codfw.wmnet [17:12:01] (03PS3) 10Dzahn: aptrepo::staging: add job to clear incoming folder [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [17:12:08] (03CR) 10Dzahn: aptrepo::staging: add job to clear incoming folder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [17:12:09] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2300-2319].codfw.wmnet [17:12:13] !log swfrench@deploy2002 swfrench: Continuing with sync [17:12:13] jouncebot: now [17:12:13] For the next 0 hour(s) and 47 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700) [17:12:13] For the next 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700) [17:12:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P84473 and previous config saved to /var/cache/conftool/dbconfig/20251030-171218-marostegui.json [17:12:46] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2320-2330].codfw.wmnet [17:12:51] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2320-2330].codfw.wmnet [17:13:22] dduvall: I have a deployment in flight, then I'll need to do some manual capacity tuning on two mediawiki services, but there should be some time afterward left in the window if you need it [17:13:51] RESOLVED: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:14:03] swfrench-wmf: perfect, thanks. group1 was rolled back yesterday so i'm hoping to get it back out a little early before rolling to all wikis [17:14:03] (03PS1) 10Jcrespo: Fix unit tests that had been broken (but only were detected on trixie) [software/transferpy] - 10https://gerrit.wikimedia.org/r/1200112 [17:14:21] !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2116-2123,2216-2230].codfw.wmnet [17:15:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:16:53] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11328747 (10VRiley-WMF) PP:0000:103234 - Has no additional interconnects PP:000:1268259 - Has 23324916, 23324917, 23324918 and 23324919 [17:17:00] dduvall: sounds good. I'll let you know when I'm done [17:20:32] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199836|Enroll 50% of client sessions in PHP 8.3 (T405955)]] (duration: 16m 44s) [17:20:38] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:23:08] (03CR) 10CDanis: add discovery records for gerrit as CNAMEs to public names (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [17:24:46] (03CR) 10Scott French: [C:03+2] mw-(api-int|jobrunner): serve 25% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199837 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:25:04] (03PS3) 10Dzahn: add discovery records for gerrit as CNAMEs to public names [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) [17:25:10] (03CR) 10Dzahn: add discovery records for gerrit as CNAMEs to public names (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [17:26:28] (03Merged) 10jenkins-bot: mw-(api-int|jobrunner): serve 25% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199837 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:27:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [17:27:24] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [17:27:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P84474 and previous config saved to /var/cache/conftool/dbconfig/20251030-172726-marostegui.json [17:27:30] !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2116-2123,2216-2230].codfw.wmnet [17:27:31] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [17:27:44] (03PS1) 10Clément Goubert: site.pp: Add new wikikube insetup hosts [puppet] - 10https://gerrit.wikimedia.org/r/1200116 (https://phabricator.wikimedia.org/T408749) [17:29:10] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:29:25] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:29:32] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:29:48] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:30:08] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:30:18] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:30:26] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:30:37] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:30:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:32:23] (03CR) 10Vgutierrez: [C:03+1] Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [17:35:01] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:35:12] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:35:17] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:35:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:35:52] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:35:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:36:00] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:36:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:36:08] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:36:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:36:30] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:37:40] 06SRE, 06Infrastructure-Foundations: megacli issues on Debian Trixie - https://phabricator.wikimedia.org/T408776#11328850 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I grabbed the megacli "source package" from the http://hwraid.le-vert.net/debian (written in brackets since it doesn'... [17:39:40] (03CR) 10Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [17:40:45] (03CR) 10Dzahn: [C:03+2] aptrepo::staging: add job to clear incoming folder [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [17:41:04] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11328866 (10MoritzMuehlenhoff) megacli is not available, all the details in T407513. I've also imported the prometheus-statds-exporter to trixie-wikimedia, so once... [17:42:02] (03PS1) 10Kosta Harlan: hCaptcha: Enable 100% passive mode for edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) [17:42:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84475 and previous config saved to /var/cache/conftool/dbconfig/20251030-174233-marostegui.json [17:42:41] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:42:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [17:42:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T407997)', diff saved to https://phabricator.wikimedia.org/P84476 and previous config saved to /var/cache/conftool/dbconfig/20251030-174257-marostegui.json [17:43:10] dduvall: I believe I'm done. all yours! [17:43:23] swfrench-wmf: ty! [17:45:56] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200121 (https://phabricator.wikimedia.org/T405681) [17:45:58] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200121 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [17:46:50] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [17:47:11] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200121 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [17:48:51] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:19] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Enable 100% passive mode for edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [17:53:48] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.25 refs T405681 [17:53:53] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [17:56:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [17:56:08] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [17:56:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T407997)', diff saved to https://phabricator.wikimedia.org/P84477 and previous config saved to /var/cache/conftool/dbconfig/20251030-175611-marostegui.json [17:56:19] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:57:58] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11328938 (10Jclark-ctr) https://netbox.wikimedia.org/circuits/circuit-terminations/?site_id=6&sort=circuit We should go through each of these and verify the connections. @VRiley-WMF Th... [17:58:40] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200123 (https://phabricator.wikimedia.org/T405681) [17:58:42] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200123 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [17:59:36] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200123 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:00:05] dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1800). nyaa~ [18:05:05] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie [18:06:26] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.25 refs T405681 [18:06:31] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [18:08:51] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:09:35] !log rolling back group2 from 1.45.0-wmf.25 to wmf.24 due to high rate of `PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated` errors [18:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:52] !log rolling back group2 from 1.45.0-wmf.25 to wmf.24 due to high rate of `PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated` errors (T405681) [18:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:06] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200126 (https://phabricator.wikimedia.org/T405681) [18:10:09] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200126 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:11:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P84478 and previous config saved to /var/cache/conftool/dbconfig/20251030-181121-marostegui.json [18:11:26] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200126 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:14:52] (03PS2) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196777 (owner: 10Ncmonitor) [18:15:11] (03PS1) 10Dzahn: admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) [18:15:27] (03CR) 10CI reject: [V:04-1] admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: 10Dzahn) [18:15:53] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [18:16:45] (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [18:17:09] (03PS2) 10Dzahn: admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) [18:17:55] (03CR) 10CI reject: [V:04-1] admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: 10Dzahn) [18:18:07] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.25 refs T405681 [18:18:12] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [18:18:20] (03PS4) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [18:18:45] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [18:18:47] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196777 (owner: 10Ncmonitor) [18:18:51] (03CR) 10BCornwall: [V:03+2 C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1196776 (owner: 10Ncmonitor) [18:19:42] (03PS3) 10Dzahn: admin: upgrade user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) [18:22:39] (03PS2) 10Andrea Denisse: alertmanager: Add dashboard and runbook for Slack alerts [puppet] - 10https://gerrit.wikimedia.org/r/1200124 (https://phabricator.wikimedia.org/T408145) [18:22:39] (03CR) 10Andrea Denisse: "Hi folks, I tested this in Pontoon and I sent an alert to the #engineering-all channel." [puppet] - 10https://gerrit.wikimedia.org/r/1200124 (https://phabricator.wikimedia.org/T408145) (owner: 10Andrea Denisse) [18:23:54] (03PS1) 10Zabe: Do not use special db group [extensions/Flow] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200132 (https://phabricator.wikimedia.org/T408540) [18:26:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P84479 and previous config saved to /var/cache/conftool/dbconfig/20251030-182629-marostegui.json [18:28:16] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11329123 (10Dzahn) [18:28:26] FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:28:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11329137 (10Dzahn) [18:32:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2028:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2028 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:33:26] RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:34:07] (03PS1) 10Bking: opensearch-cluster: stop hard-coding admin username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200135 (https://phabricator.wikimedia.org/T408012) [18:35:21] (03PS6) 10Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [18:35:21] (03CR) 10Xcollazo: "@joal@wikimedia.org, @brouberol@wikimedia.org, for your review." [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo) [18:35:38] zabe: thanks for triaging/fixing. would it make sense to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/1200132 now or should i wait for a backport of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1200133 ? [18:36:32] It is a deprecation warning. The fix can be backported to reduce log spam, but imo this doesn't has to block the train [18:36:54] (03PS1) 10Zabe: Do not use special db group [extensions/FlaggedRevs] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200139 (https://phabricator.wikimedia.org/T408540) [18:37:20] we typically block on egregious amounts of logspam, but yeah, not always [18:37:33] Amir +2'ed the other patch [18:37:39] so we can backport both [18:37:49] alright. i'll do both at the same time then. ty! [18:38:35] (03PS2) 10Bking: opensearch-cluster: stop hard-coding admin username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200135 (https://phabricator.wikimedia.org/T408012) [18:39:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy2002 using scap backport" [extensions/FlaggedRevs] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200139 (https://phabricator.wikimedia.org/T408540) (owner: 10Zabe) [18:40:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy2002 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200132 (https://phabricator.wikimedia.org/T408540) (owner: 10Zabe) [18:41:07] (03Merged) 10jenkins-bot: Do not use special db group [extensions/FlaggedRevs] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200139 (https://phabricator.wikimedia.org/T408540) (owner: 10Zabe) [18:41:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T407997)', diff saved to https://phabricator.wikimedia.org/P84480 and previous config saved to /var/cache/conftool/dbconfig/20251030-184136-marostegui.json [18:41:42] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:41:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance [18:42:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T407997)', diff saved to https://phabricator.wikimedia.org/P84481 and previous config saved to /var/cache/conftool/dbconfig/20251030-184200-marostegui.json [18:43:51] RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:49:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:49:50] (03CR) 10Bking: [C:03+2] opensearch-cluster: stop hard-coding admin username [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200135 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [18:50:01] (03Merged) 10jenkins-bot: Do not use special db group [extensions/Flow] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1200132 (https://phabricator.wikimedia.org/T408540) (owner: 10Zabe) [18:50:40] !log dduvall@deploy2002 Started scap sync-world: Backport for [[gerrit:1200139|Do not use special db group (T408540)]], [[gerrit:1200132|Do not use special db group (T408540)]] [18:50:45] T408540: PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated: watchlist [Called from Wikimedia\Rdbms\LoadBalancer::getConnectionInternal] - https://phabricator.wikimedia.org/T408540 [18:52:16] (03PS7) 10Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [18:52:58] !log dduvall@deploy2002 zabe, dduvall: Backport for [[gerrit:1200139|Do not use special db group (T408540)]], [[gerrit:1200132|Do not use special db group (T408540)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:53:50] !log dduvall@deploy2002 zabe, dduvall: Continuing with sync [18:54:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:54:47] (03PS1) 10Bking: dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - 10https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012) [18:55:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T407997)', diff saved to https://phabricator.wikimedia.org/P84482 and previous config saved to /var/cache/conftool/dbconfig/20251030-185518-marostegui.json [18:55:24] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:57:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [18:58:04] (03CR) 10CDanis: [C:03+1] dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - 10https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [18:58:04] !log dduvall@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200139|Do not use special db group (T408540)]], [[gerrit:1200132|Do not use special db group (T408540)]] (duration: 07m 24s) [18:58:10] T408540: PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated: watchlist [Called from Wikimedia\Rdbms\LoadBalancer::getConnectionInternal] - https://phabricator.wikimedia.org/T408540 [18:58:19] (03PS1) 10Scott French: deployment_server: default to PHP 8.3 in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1200142 (https://phabricator.wikimedia.org/T405955) [18:59:42] (03CR) 10Dzahn: [C:03+1] dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - 10https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [18:59:48] (03CR) 10Bking: [C:03+2] dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - 10https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [19:01:08] !log bking@dns1004 START - running authdns-update [19:02:01] !log bking@dns1004 END - running authdns-update [19:02:32] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200153 (https://phabricator.wikimedia.org/T405681) [19:02:34] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200153 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [19:03:22] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200153 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [19:10:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P84483 and previous config saved to /var/cache/conftool/dbconfig/20251030-191026-marostegui.json [19:12:25] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.25 refs T405681 [19:12:29] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [19:13:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1001-dev.eqiad.wmnet with OS trixie [19:15:16] (03PS5) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) [19:15:48] (03CR) 10Bartosz Dziewoński: "I'll schedule this for deployment the next time I have something to deploy, if I don't forget." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) (owner: 10Func) [19:19:00] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:48] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: 10Dzahn) [19:25:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P84484 and previous config saved to /var/cache/conftool/dbconfig/20251030-192534-marostegui.json [19:27:41] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [19:32:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [19:33:01] (03CR) 10Dzahn: [C:03+2] admin: upgrade user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: 10Dzahn) [19:35:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11329331 (10Dzahn) Hello @SherryYang-WMF give it max. 30 minutes from now and y... [19:36:51] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11329332 (10Dzahn) 05Open→03Resolved a:03Dzahn [19:38:51] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:39:12] (03PS1) 10Dzahn: admin: add mvernon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1200159 (https://phabricator.wikimedia.org/T408793) [19:40:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:40:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T407997)', diff saved to https://phabricator.wikimedia.org/P84485 and previous config saved to /var/cache/conftool/dbconfig/20251030-194041-marostegui.json [19:40:47] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:40:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [19:41:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84486 and previous config saved to /var/cache/conftool/dbconfig/20251030-194105-marostegui.json [19:45:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:47:29] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet with OS trixie [19:50:20] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1200159 (https://phabricator.wikimedia.org/T408793) (owner: 10Dzahn) [19:51:17] (03CR) 10D3r1ck01: [C:03+2] "starting gate-and-submit ahead of backport window" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [19:51:25] (03CR) 10D3r1ck01: [C:03+2] "starting gate-and-submit ahead of backport window" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [19:53:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84487 and previous config saved to /var/cache/conftool/dbconfig/20251030-195347-marostegui.json [19:53:54] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:54:00] (03CR) 10Dzahn: [C:03+2] admin: add mvernon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1200159 (https://phabricator.wikimedia.org/T408793) (owner: 10Dzahn) [19:56:20] (03CR) 10RLazarus: [C:03+1] deployment_server: default to PHP 8.3 in mwscript-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1200142 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [19:56:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11329422 (10Dzahn) @MatthewVernon You have been added to the group. Give it the usual couple minutes for puppet to deploy it across the f... [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T2000). [20:00:05] xSavitar, arlolra, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:20] o/ [20:01:28] hello [20:02:11] (03PS1) 10Dzahn: admin: add kerberos principal indication to mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1200162 (https://phabricator.wikimedia.org/T408793) [20:02:41] arlolra, do you want to do your config change first while my patches land? I +2'd them 10 mins ahead of the window to save us some time. [20:02:57] sure [20:03:02] I can do mine after yours then kostajh takes it from there. [20:03:06] (03Merged) 10jenkins-bot: Stats: add getLabels() function [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [20:03:06] arlolra, go for it. [20:03:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: 10Arlolra) [20:03:46] Hi [20:03:51] kostajh, heh [20:04:05] Sounds good [20:04:10] (03Merged) 10jenkins-bot: Turn off GeoCrumbsUseParserOutputFallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: 10Arlolra) [20:04:28] (03CR) 10Dzahn: [C:03+2] admin: add kerberos principal indication to mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1200162 (https://phabricator.wikimedia.org/T408793) (owner: 10Dzahn) [20:04:54] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]] [20:04:59] T390236: Turn off GeoCrumbsUseParserOutputFallback in production - https://phabricator.wikimedia.org/T390236 [20:06:25] (03Merged) 10jenkins-bot: Stats: have RunningTimer manage the initial label set [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [20:07:08] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:41] xSavitar: hmm, so it seems I'm deploying your changes as well [20:08:06] "The following are unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.45.0-wmf.25" [20:08:07] I'm not sure [20:08:30] Oh! and you didn't supply the gerrit patches? [20:08:39] So scap will auto-detect even when not specified? [20:08:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P84488 and previous config saved to /var/cache/conftool/dbconfig/20251030-200854-marostegui.json [20:09:50] arlolra, if it insists, go ahead and deploy both. [20:09:51] I suppposed. I clicked to see the diff [20:09:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11329460 (10Dzahn) @MatthewVernon I created the Kerberos principal for you per https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/... [20:10:06] arlolra, what does the diff say? [20:10:14] It showed your patches [20:10:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11329461 (10Dzahn) 05Open→03Resolved a:03Dzahn [20:10:22] And then there was anotehr prompt [20:10:23] "Continue with deployment (all patches will be deployed)? [y/N]:" [20:10:39] Okay, accept it, and roll it out [20:10:54] You can see the interaction in the log [20:11:07] My patches are about prometheus metrics, shouldn't be too much to worry about [20:11:18] * xSavitar checks... [20:11:22] So nothing for you to check on the testservesr? [20:11:56] nothing [20:12:05] I checked the logs and saw `20:07:08 arlolra: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]] synced to the testservers` [20:12:05] T390236: Turn off GeoCrumbsUseParserOutputFallback in production - https://phabricator.wikimedia.org/T390236 [20:12:25] Not sure if hidden in that will deploy the others but let's try. [20:12:49] I mean the spiderpig log [20:13:06] I see `Continue with deployment (all patches will be deployed)? [y/N]:` [20:13:08] Yes, that's fine [20:13:19] We can deploy all of them, yes! [20:13:38] arlolra, I mean once you're done testing on mwdebug [20:14:02] There is really nothing to test on my side. I can only see once it rolls out if metrics are being logged again [20:14:06] !log arlolra@deploy2002 arlolra: Continuing with sync [20:14:15] s/logged/sent [20:18:20] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]] (duration: 13m 26s) [20:18:26] T390236: Turn off GeoCrumbsUseParserOutputFallback in production - https://phabricator.wikimedia.org/T390236 [20:19:35] arlolra, it seems to me like my changes were not actually deployed [20:19:39] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:19:52] So I'll like to actually try to deploy them and see if anything happens. [20:20:09] Maybe scap didn't do what it said it'll do? [20:20:19] That's somewhat surprising [20:20:31] Has the train rolled out to group2 yet? [20:20:44] yes per https://versions.toolforge.org/ [20:21:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:23] arlolra, should I try? :) [20:21:42] Sure, but it seems like a no-op [20:21:48] Are you sure your patches work? [20:22:12] Yes, I tested them locally [20:22:17] xSavitar: Scap deploys all mediawiki config and code that is merged into a suitable train branch. [20:22:28] ^ this [20:22:36] I'm looking a https://grafana-rw.wikimedia.org/d/000000067/resourceloader-module-builds?forceLogin=true&from=now-6M&orgId=1&timezone=utc&to=now&var-module=startup&viewPanel=panel-17 and it's not going up yet [20:22:37] You had merged the patches, that's why they got picked up by arlolra's deploy [20:23:09] Can I proceed with my patches? [20:23:33] kostajh, you can go ahead and I'll wait for a while. [20:23:42] dancy, okay, thanks! [20:24:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P84489 and previous config saved to /var/cache/conftool/dbconfig/20251030-202402-marostegui.json [20:24:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:24:53] (03PS6) 10Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) [20:24:59] (03PS7) 10Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) [20:25:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: 10Kosta Harlan) [20:25:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [20:25:27] dancy, would it be terrible if I try again after Kosta is done just in case? :) [20:25:44] That's fine. [20:25:58] dancy, okay, if I find anything unusual, I'll let you know, okay? [20:26:04] (03Merged) 10jenkins-bot: EventBus: Enable TYPE_EVENT for loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: 10Kosta Harlan) [20:26:07] (03Merged) 10jenkins-bot: hCaptcha: Enable 100% passive mode for edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [20:26:07] Sounds good. I'll be around. [20:26:11] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 232.15 ms [20:26:13] dancy, thank you [20:26:24] I see one of your patches wasn't synced, xSavitar [20:26:25] (03PS8) 10Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) [20:26:29] xSavitar: https://spiderpig.wikimedia.org/jobs/840 [20:26:46] so https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1199856 will sync out now [20:26:59] interesting, I had a weird feeling, okay [20:27:01] please sync it [20:27:14] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [20:27:27] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1200111|EventBus: Enable TYPE_EVENT for loginwiki (T408701)]], [[gerrit:1200120|hCaptcha: Enable 100% passive mode for edits on test2wiki (T405586)]] [20:27:34] T408701: Enable event logging for the mediawiki.product_metrics.suggested_investigations_interaction stream on loginwiki - https://phabricator.wikimedia.org/T408701 [20:27:34] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [20:27:56] xSavitar: IMO in general, it's better not to +2 things ahead of the deployment window, because it makes it more difficult to know what is getting synced out and when [20:28:19] kostajh, yes you're right. I was just about to write that to myself [20:28:30] (03PS9) 10Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) [20:28:53] Because now the task doesn't have any trace of it being backported since scap will use the gerrit ID to log actions/activity to the task [20:28:58] * xSavitar notes... [20:29:35] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1200111|EventBus: Enable TYPE_EVENT for loginwiki (T408701)]], [[gerrit:1200120|hCaptcha: Enable 100% passive mode for edits on test2wiki (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:29:37] you can still use !log manually and if you mention a task number it will add it there [20:30:44] mutante, thanks! Was just wondering if scap autodetects changes while deploying something else and the deployer accepts to proceed, if those can be logged to the autodetected tasks as well if the gerrit patch has references it [20:32:02] xSavitar: we're on mwdebug now, if you want to verify your change [20:32:20] kostajh, nothing to verify for now. I'm fine [20:32:50] xSavitar: if logmsgbot can made to say it, it will be logged by stashbot. other than that it sounds like a scap feature request I guess [20:33:35] mutante, ack! I'll file something tomorrow then let the RelEng experts decide if it's a good idea or not. [20:33:53] sounds good [20:35:19] !log kharlan@deploy2002 kharlan: Continuing with sync [20:39:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84491 and previous config saved to /var/cache/conftool/dbconfig/20251030-203910-marostegui.json [20:39:16] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:39:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [20:39:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1214 (T407997)', diff saved to https://phabricator.wikimedia.org/P84492 and previous config saved to /var/cache/conftool/dbconfig/20251030-203933-marostegui.json [20:39:35] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200111|EventBus: Enable TYPE_EVENT for loginwiki (T408701)]], [[gerrit:1200120|hCaptcha: Enable 100% passive mode for edits on test2wiki (T405586)]] (duration: 12m 08s) [20:39:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:39:46] T408701: Enable event logging for the mediawiki.product_metrics.suggested_investigations_interaction stream on loginwiki - https://phabricator.wikimedia.org/T408701 [20:39:47] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [20:39:52] xSavitar: it's live [20:40:14] * xSavitar checks... [20:41:24] kostajh, yep, metrics are coming in again, thanks for sync 🙏🏽 [20:41:29] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:42:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:43:23] (03CR) 10Andrew Bogott: [C:03+2] pdns_server: rename 'master' to 'primary' [puppet] - 10https://gerrit.wikimedia.org/r/1200097 (owner: 10Andrew Bogott) [20:43:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "applies all of the base stuff but only on node 1001 - https://puppet-compiler.wmflabs.org/output/1198397/7518/tcp-proxy1001.eqiad.wmnet/in" [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [20:44:34] xSavitar: you're welcome! [20:44:39] RESOLVED: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:45:46] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11329606 (10Dzahn) config example kindly provided by Chris Danis: {P84490} [20:46:07] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11329609 (10RLazarus) 05In progress→03Resolved [20:47:12] dancy, I filed https://phabricator.wikimedia.org/T408868 so that I don't forget. I can always improve the task if needed, but I just did a brain-dump right now. Thanks! [20:47:23] (03CR) 10Dzahn: [C:03+2] site/role: create placeholder role/profile for tcpproxy [puppet] - 10https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [20:51:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T407997)', diff saved to https://phabricator.wikimedia.org/P84493 and previous config saved to /var/cache/conftool/dbconfig/20251030-205102-marostegui.json [20:51:08] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:51:15] (03CR) 10Scott French: P:cache:haproxy: introduce ua classes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [20:55:13] (03PS8) 10Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) [20:55:54] (03CR) 10Bking: "Record added in I09456d395dd57caa9a61ab2a86a9c9df163f995c" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [20:58:53] Will the Web Team be using their deployment window in a few minutes for anything? If not, there’s a sec patch update I’d like to get out. [20:59:00] RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T2100) [21:05:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:05:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:06:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P84495 and previous config saved to /var/cache/conftool/dbconfig/20251030-210610-marostegui.json [21:07:15] (03PS1) 10Andrew Bogott: cloud-vps pdns recursor: include nagios_common::check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/1200170 [21:09:00] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:41] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps pdns recursor: include nagios_common::check_dns_query [puppet] - 10https://gerrit.wikimedia.org/r/1200170 (owner: 10Andrew Bogott) [21:11:25] (03PS1) 10Bking: opensearch-cluster: fix chart typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1200171 (https://phabricator.wikimedia.org/T408012) [21:12:50] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:13:51] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:17:10] !log Deployed updated security mitigation for T407131 [21:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P84496 and previous config saved to /var/cache/conftool/dbconfig/20251030-212117-marostegui.json [21:25:32] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2005-dev.codfw.wmnet with OS trixie [21:28:44] (03PS1) 10Jdlrobson: Drop references to removed configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) [21:28:50] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:29:00] (03PS2) 10Jdlrobson: Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) [21:29:05] (03CR) 10Jdlrobson: [C:04-2] Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: 10Jdlrobson) [21:33:51] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:33:51] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:36:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T407997)', diff saved to https://phabricator.wikimedia.org/P84497 and previous config saved to /var/cache/conftool/dbconfig/20251030-213625-marostegui.json [21:36:32] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:36:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:36:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84498 and previous config saved to /var/cache/conftool/dbconfig/20251030-213649-marostegui.json [21:42:04] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage [21:44:31] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS trixie [21:48:07] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage [21:48:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84499 and previous config saved to /var/cache/conftool/dbconfig/20251030-214808-marostegui.json [21:48:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:49:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:57:49] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [21:58:04] PROBLEM - Host ms-be1090 is DOWN: PING CRITICAL - Packet loss = 100% [22:03:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P84500 and previous config saved to /var/cache/conftool/dbconfig/20251030-220316-marostegui.json [22:04:15] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage [22:04:34] RECOVERY - Host ms-be1090 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [22:09:57] (03PS2) 10Tim Starling: Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) [22:09:57] (03PS2) 10Tim Starling: Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) [22:09:57] (03PS2) 10Tim Starling: Enable ChangesListQuery partitioning on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798) [22:10:52] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:11:10] any deployments going on? [22:11:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11329915 (10VRiley-WMF) Hey @MatthewVernon I apologize about that. It seems the cable slipped out of the card while I was trying to diagnose the issue. It... [22:12:38] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408585#11329916 (10VRiley-WMF) 05Open→03Resolved closing duplicate. [22:13:51] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:14:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [22:15:04] (03Merged) 10jenkins-bot: Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [22:15:37] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]] [22:15:43] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [22:18:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P84501 and previous config saved to /var/cache/conftool/dbconfig/20251030-221824-marostegui.json [22:32:35] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003 [22:33:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84502 and previous config saved to /var/cache/conftool/dbconfig/20251030-223331-marostegui.json [22:33:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:33:37] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [22:34:00] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:39:48] FIRING: PuppetFailure: Puppet has failed on tcp-proxy1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:42:04] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:42:11] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [22:42:40] !log tstarling@deploy2002 tstarling: Continuing with sync [22:48:51] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:55:58] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]] (duration: 40m 21s) [22:56:03] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [22:56:56] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:57:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:59:56] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:00:46] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices2005-dev.codfw.wmnet with OS trixie [23:02:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:04:47] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet with OS trixie [23:19:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [23:20:46] (03Merged) 10jenkins-bot: Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) (owner: 10Tim Starling) [23:21:07] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]] [23:21:12] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [23:25:30] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:27:43] !log tstarling@deploy2002 tstarling: Continuing with sync [23:30:32] (03PS1) 10Dzahn: site: fix regex for tcp-proxy to cover 1002 [puppet] - 10https://gerrit.wikimedia.org/r/1200188 (https://phabricator.wikimedia.org/T408532) [23:31:06] (03CR) 10Dzahn: [C:03+2] site: fix regex for tcp-proxy to cover 1002 [puppet] - 10https://gerrit.wikimedia.org/r/1200188 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [23:35:41] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]] (duration: 14m 33s) [23:35:46] T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798 [23:36:01] (03PS1) 10Dzahn: tcpproxy: set puppet7 and firewall provider to ferm for new role [puppet] - 10https://gerrit.wikimedia.org/r/1200189 (https://phabricator.wikimedia.org/T408532) [23:38:55] (03CR) 10Dzahn: [C:03+2] tcpproxy: set puppet7 and firewall provider to ferm for new role [puppet] - 10https://gerrit.wikimedia.org/r/1200189 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [23:48:48] !log forward-fixing to puppet7 on tcp-proxy1001/1002 per T349619 T408532 [23:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:55] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 [23:48:55] T408532: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532 [23:50:29] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:51:19] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:57:56] (03PS1) 10Dzahn: tcpproxy: add config template [puppet] - 10https://gerrit.wikimedia.org/r/1200190 (https://phabricator.wikimedia.org/T408532)