[00:02:03] (03CR) 10Scott French: "While clearly very large, the PCC diff generally looks like what I'd expect, which is nice. Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [00:09:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136474 [00:09:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136474 (owner: 10TrainBranchBot) [00:16:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3378 MB (3% inode=98%): /tmp 3378 MB (3% inode=98%): /var/tmp 3378 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [00:29:55] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136474 (owner: 10TrainBranchBot) [00:46:36] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/69d4adff9ec963248074b4ed851e430576834914028afdd60017788f3eea3f8c/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:48:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:56:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3590 MB (3% inode=98%): /tmp 3590 MB (3% inode=98%): /var/tmp 3590 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [01:03:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [01:09:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.25 [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136477 (https://phabricator.wikimedia.org/T386220) [01:09:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.25 [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136477 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [01:13:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:22:14] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.25 [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136477 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [01:24:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [01:26:36] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:30:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1181'] [01:31:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1181'] [01:32:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [01:32:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10741929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker11... [01:58:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [01:59:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0200) [02:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:06:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye [02:06:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10741941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1181.e... [02:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:27:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10742000 (10MikhailRyazanov) By the way, are there any reasons, besides historical, to specify image sizes in “pixels” (which nowadays often don't corresp... [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0300) [03:01:44] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136485 (https://phabricator.wikimedia.org/T386220) [03:01:45] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136485 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [03:02:34] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136485 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [03:02:57] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.25 refs T386220 [03:03:00] T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220 [03:11:25] FIRING: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:23:39] FIRING: [3x] ProbeDown: Service restbase1044-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:26:25] RESOLVED: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [03:31:22] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [03:43:21] !log mwpresync@deploy1003 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.24,1.44.0-wmf.25 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discov [03:43:21] ery.wmnet/restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.153.0 --label vnd.wikimedia.mediawiki.versions=1.44.0-wmf.24,1.44.0-wmf.25 --label vnd.wi [03:43:21] kimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.153.0) (duration: 40m 23s) [03:48:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0400) [04:10:13] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.22 (duration: 10m 03s) [04:13:39] FIRING: [3x] ProbeDown: Service restbase1044-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:30:50] PROBLEM - Host ml-serve2007 is DOWN: PING CRITICAL - Packet loss = 100% [04:32:30] PROBLEM - BGP status on lsw1-c3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:36:50] FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:48:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:49] (03PS1) 10Marostegui: mariadb: Migrate pc6 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136487 (https://phabricator.wikimedia.org/T391454) [04:57:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc6 T391454', diff saved to https://phabricator.wikimedia.org/P75002 and previous config saved to /var/cache/conftool/dbconfig/20250415-045700-marostegui.json [04:57:05] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [04:57:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance [04:59:15] (03CR) 10Marostegui: [C:03+2] mariadb: Migrate pc6 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136487 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui) [05:03:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc6 T391454', diff saved to https://phabricator.wikimedia.org/P75003 and previous config saved to /var/cache/conftool/dbconfig/20250415-050307-marostegui.json [05:03:11] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:19:56] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:24] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:23:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:45:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:46:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0600) [06:00:05] marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0600). [06:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:16:56] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:19:00] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:27:58] (03PS1) 10Marostegui: events_coredb_master.sql: Add s8 [software] - 10https://gerrit.wikimedia.org/r/1136594 [06:29:19] (03CR) 10Marostegui: "This is a noop" [software] - 10https://gerrit.wikimedia.org/r/1136594 (owner: 10Marostegui) [06:29:21] (03CR) 10Marostegui: [C:03+2] events_coredb_master.sql: Add s8 [software] - 10https://gerrit.wikimedia.org/r/1136594 (owner: 10Marostegui) [06:29:49] (03Merged) 10jenkins-bot: events_coredb_master.sql: Add s8 [software] - 10https://gerrit.wikimedia.org/r/1136594 (owner: 10Marostegui) [06:31:34] (03CR) 10Marostegui: "Just some comments, the mysql side of things looks good, but I'd like to see the code reviewed by someone with more expertise." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [06:33:25] (03PS1) 10Marostegui: events_sanitarium.sql: Update sanitarium hosts. [software] - 10https://gerrit.wikimedia.org/r/1136598 [06:33:40] (03CR) 10Marostegui: "This is a noop" [software] - 10https://gerrit.wikimedia.org/r/1136598 (owner: 10Marostegui) [06:34:07] (03CR) 10Jelto: [V:03+2 C:03+2] gitlab: use a wmflib::expand_path compatible path for apus keys [labs/private] - 10https://gerrit.wikimedia.org/r/1136391 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [06:34:26] (03CR) 10Marostegui: [C:03+2] events_sanitarium.sql: Update sanitarium hosts. [software] - 10https://gerrit.wikimedia.org/r/1136598 (owner: 10Marostegui) [06:34:55] (03Merged) 10jenkins-bot: events_sanitarium.sql: Update sanitarium hosts. [software] - 10https://gerrit.wikimedia.org/r/1136598 (owner: 10Marostegui) [06:40:11] Deploying cxserver. Minor changes. [06:41:34] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-04-07-053106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134397 (https://phabricator.wikimedia.org/T390732) (owner: 10KartikMistry) [06:43:23] (03Merged) 10jenkins-bot: Update cxserver to 2025-04-07-053106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134397 (https://phabricator.wikimedia.org/T390732) (owner: 10KartikMistry) [06:44:44] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [06:45:06] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:45:57] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:46:31] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:47:48] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:48:20] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:48:58] !log Updated cxserver to 2025-04-07-053106-production (T390732, T390711) [06:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:03] T390732: Close pihwiki - https://phabricator.wikimedia.org/T390732 [06:49:03] T390711: Post-creation work for nupwiki - https://phabricator.wikimedia.org/T390711 [06:49:10] (03PS1) 10Filippo Giunchedi: librenms: stop sending data to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) [06:49:22] Also, deploying MinT (in staging first!) It will be bit slower one. [06:50:15] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:51:15] (03CR) 10CI reject: [V:04-1] librenms: stop sending data to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi) [06:51:30] (03CR) 10Ayounsi: [C:03+1] librenms: stop sending data to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi) [06:51:31] (03PS1) 10Filippo Giunchedi: kubernetes: remove master usage of prometheus_all_nodes, access is implicit [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) [06:51:33] (03PS1) 10Filippo Giunchedi: deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) [06:51:45] I bet I didn't align some arrows, HOW COULD I FORGET [06:52:58] actually no, unrelated [06:53:41] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "CI failures are unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi) [06:53:50] (03CR) 10Ayounsi: [C:03+1] "Could be worth running PCC for netmon1003.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi) [06:54:03] (03CR) 10CI reject: [V:04-1] kubernetes: remove master usage of prometheus_all_nodes, access is implicit [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [06:54:21] (03CR) 10CI reject: [V:04-1] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [06:55:24] XioNoX: heh, next time [06:56:28] Any recent change with people.wikimedia.org DNS? [06:56:49] godog: no pb ;) [06:57:29] (03CR) 10Brouberol: "Bear in mind that removing the puppet code will not stop/delete the systemd timers. It will just stop managing them via puppet." [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [07:00:04] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:28] !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in eqsin and codfw - T391334 [07:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:32] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [07:05:34] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqsin [07:05:37] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:06:02] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqsin [07:06:10] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_codfw [07:06:21] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_codfw [07:08:26] jelto: I think puppet CI is busted btw [07:08:45] jelto: compilation errors like these https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/8717/console for modules/profile/spec/classes/profile_gitlab_spec.rb [07:09:41] yes I'm currently troubleshooting the issue, give me a sec [07:10:16] !log elukey@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ms-be1091.eqiad.wmnet with reason: dcops maintenance [07:10:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10742385 (10elukey) @Jclark-ctr I downtimed the host for two days, please feel free to shut it down when it is convenient for you :) [07:11:52] jelto: ok no worries, I'm not impacted atm [07:13:41] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: fix type of s3 credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:15:23] (03Abandoned) 10Filippo Giunchedi: kubernetes: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129178 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:15:50] (03CR) 10Filippo Giunchedi: [C:03+1] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [07:16:57] godog: I think the issue is fixed, let me know when you see the error again [07:19:15] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:19:16] (03PS3) 10Volans: log: notify user on IRC when awaiting input [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 [07:19:16] (03PS1) 10Volans: tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608 [07:19:23] jelto: ack, thank you [07:19:30] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:21:13] (03CR) 10Volans: log: notify user on IRC when awaiting input (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans) [07:21:25] (03CR) 10Filippo Giunchedi: [C:03+1] statsd: remove ferm rule for statsd port 8125 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [07:21:28] You probably have to rebase to get the fix from https://gerrit.wikimedia.org/r/1136359 [07:21:44] ah yeah of course [07:21:52] CI autorebases behind the good :) [07:22:10] s/good/hood/ [07:22:25] what I mean is the patch is first merged against the tip of the target branch (production) [07:22:35] and the result is what is fetched by the jobs [07:22:48] ah thank you hashar I didn't realize that was the case [07:23:02] Oh it's a different error now, it's complaining about the string length [07:23:07] so you can `recheck` to verify the new state [07:23:42] but of course pressing `Rebase` is conveniently one click away and will ultimately end up with the same state [07:24:13] (03CR) 10Vgutierrez: P:durum: add conditional to enable ECH (durum2002) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [07:25:30] (03PS1) 10Brouberol: wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) [07:25:31] (03PS1) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) [07:25:32] (03PS1) 10Brouberol: wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) [07:25:32] (03PS1) 10Brouberol: wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) [07:25:33] (03PS1) 10Brouberol: wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) [07:25:34] (03PS1) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) [07:25:55] (03CR) 10CI reject: [V:04-1] wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [07:26:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:27:15] (03PS2) 10Brouberol: wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) [07:27:15] (03PS2) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) [07:27:15] (03PS2) 10Brouberol: wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) [07:27:16] (03PS2) 10Brouberol: wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) [07:27:17] (03PS2) 10Brouberol: wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) [07:27:18] (03PS2) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) [07:28:06] !log make sure all disks are mounted correctly prior to disk-swap testing T391854 [07:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:10] T391854: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854 [07:28:14] !log make sure all disks are mounted correctly prior to disk-swap testing T391854 ms-be1091 [07:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [07:28:45] (03PS1) 10Jelto: ceph: remove Ceph::S3::Credential String length constraints [puppet] - 10https://gerrit.wikimedia.org/r/1136657 (https://phabricator.wikimedia.org/T378922) [07:29:13] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [07:29:16] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [07:29:18] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [07:29:21] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [07:29:24] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [07:29:27] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [07:29:35] Brouberol: I broke Puppet CI, give me a minute [07:29:56] haha, no problem, take your time [07:30:18] (03CR) 10Jelto: [C:03+2] ceph: remove Ceph::S3::Credential String length constraints [puppet] - 10https://gerrit.wikimedia.org/r/1136657 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:30:27] 06SRE, 06Data-Platform-SRE: archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10742449 (10LSobanski) [07:31:39] 06SRE, 06Data-Platform-SRE: archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10742454 (10LSobanski) Archiva is on a path to deprecation so this is likely an ask to disable the alerting altogether. [07:31:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:32:07] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:32:22] (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:34:51] It looks like Puppet CI is happy again [07:35:44] neat, thank you [07:38:14] (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744) (owner: 10Brouberol) [07:38:47] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744) (owner: 10Brouberol) [07:39:26] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:58] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:41:43] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:43:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:43:39] jelto: speaking of puppet - would you merge my personal config files? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134638 [07:43:59] jelto: is there a process for getting that kind of thing deployed? [07:45:09] it's nor urgent, it has just been sitting there for a while, and I'm looking for a way to move it forward. [07:47:38] (03CR) 10Jelto: [C:03+2] ~daniel: Always run screen [puppet] - 10https://gerrit.wikimedia.org/r/1134638 (owner: 10Daniel Kinzler) [07:47:50] duesen: I can merge this change in a sec [07:47:52] !log upgrade thanos to 0.38.0 on prometheus100[57] - T383966 [07:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:55] T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966 [07:48:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:48:41] duesen: process is grabbing a slot during https://wikitech.wikimedia.org/wiki/Puppet_request_window [07:49:30] godog: ah, thanks! I guess I once knew that ;) [07:49:33] godog: Should I wait for the next window? [07:49:41] jelto: no please go ahead [07:50:04] duesen: heheh used sparingly also poking oncall/clinic duty has been known to work :D [07:50:30] * duesen pokes sparingly [07:55:06] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:55:09] duesen: your new screen config should be available in the next 30 minutes. I merged the change [07:57:08] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:58:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:58:44] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:59:03] topranks, XioNoX: cr2-eqin took a nap? [07:59:07] *eqsin [07:59:43] we are getting purged alerts in eqsin as well.. looks like we have some connectivity issues [07:59:48] yes [08:00:39] vgutierrez: looking [08:03:39] FIRING: [3x] ProbeDown: Service restbase1044-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:03:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:03:55] vgutierrez: there is an increase bw on the eqsin-codfw link : https://grafana.wikimedia.org/goto/DS2l63AHR?orgId=1 that could cause saturation and packet loss [08:03:58] we had some some timeouts trying to reach codfw (Apr 15 08:02:16 cp5017 purged[2028236]: %4|1744704136.818|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2009.codfw.wmnet:9093/bootstrap]: ssl://kafka-main2009.codfw.wmnet:9093/2004: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests) [08:04:53] but it shouldn't be enough to have an actual impact [08:04:56] losing ~10% of pings [08:05:18] now seems better [08:05:21] (03CR) 10Brouberol: [C:03+2] airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497) (owner: 10Brouberol) [08:05:26] also for latency [08:05:36] much more stable [08:05:36] pings between where and where? [08:05:44] eqsin -> codfw [08:06:47] https://grafana.wikimedia.org/goto/uAikge0HR?orgId=1 [08:06:54] that doesn't look great [08:07:26] vgutierrez: yeah was going to share https://grafana.wikimedia.org/goto/2HXzR60NR?orgId=1 [08:07:43] weird thing is that ulsfo is having the same issue while it's a different link/router [08:07:51] XioNoX: hmm latency is significantly worse over ip6 :] [08:08:30] https://grafana.wikimedia.org/goto/_b3WgeAHg?orgId=1 VS https://grafana.wikimedia.org/goto/sG5GgeAHg?orgId=1 [08:08:33] but looks like whatever happened it improved (still looking) [08:08:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:09:05] !log dcausse@deploy1003 Started deploy [wdqs/wdqs@4186ae7]: test deploy new scap config to wdqs2025.codfw.wmnet (T221709) [08:09:08] T221709: scap service restarts for WDQS are inconsistent - https://phabricator.wikimedia.org/T221709 [08:09:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:09:23] !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@4186ae7]: test deploy new scap config to wdqs2025.codfw.wmnet (T221709) (duration: 00m 18s) [08:09:45] yeah I don't get why ulsfo is going through codfw to reach eqsin [08:12:11] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1134173 (owner: 10Hashar) [08:12:24] (03CR) 10Effie Mouzeli: [C:03+1] "Thank you, this is great!" [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [08:13:09] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [08:17:02] vgutierrez: ok I get it more now. Looks like Arelion was havng issue, I'm going to put it in a normal state and not a "prefered" state. Then if the issue happen again we can drain it [08:19:07] ack [08:32:48] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:33:46] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:35:00] (03CR) 10Ayounsi: [C:03+2] Add BFD down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1134664 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:35:46] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:37] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10742637 (10Gehel) [08:36:39] (03Merged) 10jenkins-bot: Add BFD down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1134664 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:36:50] FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:37:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:38:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:40:07] (03CR) 10Jaime Nuche: "The repo is under the "releng" directory: `/srv/deployment/releng/jenkins-deploy`" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [08:40:24] (03CR) 10Alexandros Kosiaris: "Adding Moritz per my understanding that "if you request any new level of sudo privileges for a group (or for yourself individually, outsid" [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [08:40:38] looks like it's back.... vgutierrez [08:41:38] yep [08:42:01] !log drain arelion eqsin-codfw link [08:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:42] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:43:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:44:29] vgutierrez: done, let's see if it improves [08:44:40] XioNoX: see _security [08:47:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:48:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:11] !log dcausse@deploy1003 Started deploy [wdqs/wdqs@4186ae7] (wcqs): test deploy new scap config to wcqs2001.codfw.wmnet (T221709) [08:51:16] T221709: scap service restarts for WDQS are inconsistent - https://phabricator.wikimedia.org/T221709 [08:51:31] !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@4186ae7] (wcqs): test deploy new scap config to wcqs2001.codfw.wmnet (T221709) (duration: 00m 20s) [08:57:02] jouncebot: nowandnext [08:57:02] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [08:57:02] In 1 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000) [08:57:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [08:58:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:00:04] I started a sync from gitlab1003 to ceph/apus which seems to be doing 400MB/s. But that should not affect ulsfo or codfw [09:00:59] !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.25 refs T386220 [09:01:02] T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220 [09:01:36] ^ trian presync failed last night, I'm rerunning it [09:03:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:07:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [09:11:22] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [09:11:52] (03CR) 10Brouberol: [C:03+1] zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [09:12:02] (03CR) 10Brouberol: [C:03+1] zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [09:12:13] (03CR) 10Brouberol: [C:03+1] zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene) [09:13:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:15:35] !log jnuche@deploy1003 sync-world aborted: testwikis to 1.44.0-wmf.25 refs T386220 (duration: 14m 36s) [09:15:39] T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220 [09:19:47] (03CR) 10Marostegui: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [09:23:12] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:23:23] uh? [09:27:34] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:27:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:28:58] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:29:50] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:32:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:39:45] (03CR) 10Federico Ceratto: "Replied to a comment - no new code changes introduced." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [09:41:26] jouncebot: nowandnext [09:41:26] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [09:41:26] In 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000) [09:43:22] !log dcausse@deploy1003 Started deploy [wdqs/wdqs@fe88851]: version 0.3.156 (T326311) [09:43:26] T326311: Deletion of Lexemes appears to leak triples related to its forms and senses - https://phabricator.wikimedia.org/T326311 [09:49:19] (03PS1) 10AikoChou: ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:42] jouncebot: nowandnext [09:53:42] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [09:53:42] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000) [09:54:26] Amir1: not sure you're gonna be able to deploy [09:54:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:54:28] https://phabricator.wikimedia.org/T390251 is acting up [09:54:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:54:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T391056)', diff saved to https://phabricator.wikimedia.org/P75005 and previous config saved to /var/cache/conftool/dbconfig/20250415-095442-fceratto.json [09:54:46] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [09:54:59] although maybe just a backport will go through where the train didn't [09:55:01] (03PS1) 10Ladsgroup: Bump thumbnail steps to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136670 (https://phabricator.wikimedia.org/T360589) [09:55:03] :/ [09:55:10] Do you want me to try? [09:55:16] jnuche: ^ ? [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T391056)', diff saved to https://phabricator.wikimedia.org/P75006 and previous config saved to /var/cache/conftool/dbconfig/20250415-095650-fceratto.json [09:57:54] !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: version 0.3.156 (T326311) (duration: 14m 31s) [09:57:57] T326311: Deletion of Lexemes appears to leak triples related to its forms and senses - https://phabricator.wikimedia.org/T326311 [09:58:27] !log dcausse@deploy1003 Started deploy [wdqs/wdqs@fe88851] (wcqs): version 0.3.156 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000) [10:00:52] !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@fe88851] (wcqs): version 0.3.156 (duration: 02m 25s) [10:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:01:07] Amir1, claime: yeah, chances are you will run into the same issue [10:01:15] :( [10:01:22] I can wait then [10:01:22] jnuche: even for just a backport? [10:01:31] it shouldn't be rebuilding the whole image [10:01:43] (03PS1) 10Volans: docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673 [10:01:50] which is kind of the not-really-deterministic trigger for this [10:05:24] Amir1, claime: from my side it's okay to try. But if it fails I'd ask that you create a revert in gerrit for the backport change [10:05:45] ack [10:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:11:38] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou) [10:11:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P75007 and previous config saved to /var/cache/conftool/dbconfig/20250415-101158-fceratto.json [10:14:54] (03CR) 10Kamila Součková: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [10:15:33] (03CR) 10Ladsgroup: "gentle ping" [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup) [10:15:50] Amir1: Go ahead and try your backport [10:16:07] we'll revert is the registry is still fucking up [10:16:22] sure [10:17:05] (03PS1) 10Hnowlan: trafficserver: route various miscellaneous pcs services to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1136676 (https://phabricator.wikimedia.org/T385033) [10:17:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136670 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:18:17] (03Merged) 10jenkins-bot: Bump thumbnail steps to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136670 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:19:07] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]] [10:19:10] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:21:11] (03CR) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [10:24:23] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-text_eqsin [10:26:29] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp[5023-5024].eqsin.wmnet} and A:cp [10:27:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P75008 and previous config saved to /var/cache/conftool/dbconfig/20250415-102705-fceratto.json [10:28:39] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:22] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [10:32:09] claime: ^ [10:32:37] and 13 minutes stuck on this [10:32:38] > 10:19:34 K8s images build/push output redirected to /home/ladsgroup/scap-image-build-and-push-log [10:32:44] Amir1: yeah that's... happened a few times and I haven't figured out why [10:33:17] I ctrl+c'd now [10:33:18] !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]] (duration: 14m 11s) [10:33:21] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:33:29] (03PS22) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [10:33:37] try it again [10:33:39] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:33:41] (03PS1) 10Ladsgroup: Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 [10:33:45] (03CR) 10Ladsgroup: [C:03+2] Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup) [10:34:00] or revert :D [10:34:01] (03CR) 10Ladsgroup: Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup) [10:34:10] I stopped the revert :D [10:34:27] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5300/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [10:34:32] I'm kind of at a loss as to what we can do to fix this [10:34:35] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]] [10:36:38] (03CR) 10Vgutierrez: [C:03+1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [10:37:39] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_codfw [10:37:45] (03PS9) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [10:38:24] (03CR) 10AikoChou: [C:03+2] ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou) [10:38:49] !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1132669"' [10:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:44] !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]] (duration: 05m 08s) [10:39:47] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:39:49] (03CR) 10Ladsgroup: [C:03+2] Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup) [10:40:05] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [10:40:08] (03Merged) 10jenkins-bot: ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou) [10:40:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_codfw [10:40:19] I'm gonna try something [10:40:39] (03Merged) 10jenkins-bot: Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup) [10:40:53] (03CR) 10Vgutierrez: [C:03+1] "looking good, few nitpicks" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:41:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:38] !log enable puppet on durum2002 [10:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T391056)', diff saved to https://phabricator.wikimedia.org/P75009 and previous config saved to /var/cache/conftool/dbconfig/20250415-104212-fceratto.json [10:42:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:42:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:42:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75010 and previous config saved to /var/cache/conftool/dbconfig/20250415-104235-fceratto.json [10:43:52] (03PS10) 10Fabfur: cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) [10:44:29] (03CR) 10Fabfur: cache: install benthos on all cp hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:44:36] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:46:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:51:05] (03PS13) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [10:51:22] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [10:52:26] !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in drmrs - T391334 [10:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:30] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [10:52:37] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_drmrs [10:52:44] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_drmrs [10:56:44] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add query-legacy-full to existing gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135383 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:58:08] (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [10:58:37] (03Merged) 10jenkins-bot: wikidata-query-gui: add query-legacy-full to existing gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135383 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:58:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_eqsin [10:59:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75011 and previous config saved to /var/cache/conftool/dbconfig/20250415-105941-fceratto.json [10:59:44] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:00:14] (03CR) 10Cathal Mooney: [C:03+1] magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913) (owner: 10Ayounsi) [11:01:42] (03PS1) 10Fabfur: cache: use fqdn in syslog hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) [11:01:58] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [11:03:02] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10742996 (10Ifrahkhanyaree_WMDE) [11:04:39] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10743021 (10Ladsgroup) >>! In T355914#10738719, @hgzh wrote: > I tried an onwiki answer, so thank you for the reply here. But IMO this could have been ann... [11:05:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp[5023-5024].eqsin.wmnet} and A:cp [11:06:30] PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [11:06:40] ^^ that's probably sukhe [11:06:43] yes [11:06:51] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: extend to check DNS, LDAP, internet, etc [puppet] - 10https://gerrit.wikimedia.org/r/1136681 (https://phabricator.wikimedia.org/T391325) [11:06:58] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:06:58] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:07:15] !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on durum2002.codfw.wmnet with reason: testing [11:07:29] !log cgoubert@deploy1003 Started scap sync-world: test rebuild to look at logs [11:07:39] !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in esams - T391334 [11:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:42] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [11:08:19] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_esams and not P{cp3073.esams.wmnet} and A:cp [11:08:40] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_esams and not P{cp3081.esams.wmnet} and A:cp [11:11:37] (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: extend to check DNS, LDAP, internet, etc [puppet] - 10https://gerrit.wikimedia.org/r/1136681 (https://phabricator.wikimedia.org/T391325) [11:12:29] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: extend to check DNS, LDAP, internet, etc [puppet] - 10https://gerrit.wikimedia.org/r/1136681 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [11:14:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P75012 and previous config saved to /var/cache/conftool/dbconfig/20250415-111447-fceratto.json [11:16:25] (03PS14) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [11:16:54] (03PS1) 10Ssingh: Revert "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136684 [11:17:49] (03PS15) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [11:18:06] (03CR) 10Ayounsi: [C:03+1] docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673 (owner: 10Volans) [11:18:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:18:48] (03CR) 10Volans: [C:03+2] docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673 (owner: 10Volans) [11:20:23] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: fix yaml typos [puppet] - 10https://gerrit.wikimedia.org/r/1136685 (https://phabricator.wikimedia.org/T391325) [11:21:22] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [11:21:25] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: fix yaml typos [puppet] - 10https://gerrit.wikimedia.org/r/1136685 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [11:23:48] (03PS16) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [11:24:44] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [11:24:52] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [11:25:03] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [11:25:19] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [11:25:27] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [11:25:56] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [11:26:39] (03CR) 10Federico Ceratto: "Ok, I updated the code as required and tested it with real runs before and with dry-run in the last changes." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [11:28:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [11:29:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P75013 and previous config saved to /var/cache/conftool/dbconfig/20250415-112955-fceratto.json [11:30:02] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: more YAML formatting [puppet] - 10https://gerrit.wikimedia.org/r/1136687 (https://phabricator.wikimedia.org/T391325) [11:30:07] (03Merged) 10jenkins-bot: docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673 (owner: 10Volans) [11:33:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:37:19] (03CR) 10Ssingh: [C:03+2] Revert "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136684 (owner: 10Ssingh) [11:37:26] claime, Amir1 o/ were you be able to finish the deploy? [11:37:30] elukey: no [11:37:42] I just did a sync-world with no file to get a push [11:37:50] it did manage to push the images in about 15 minutes [11:38:06] but then it failed deploying to testservers because of the bad blob in dragonfly [11:39:12] is it still ongoing? Because I may have a workaround in mind [11:39:20] no, it's failed now [11:39:30] you can go ahead [11:39:49] nono it was more a manual fix for the workers failing to get the right blob [11:40:21] when the failures in pulling happens, we can try to wait 5 minutes and then explicitly kill the failed pods [11:40:36] ah [11:40:48] if our theory of the dragonfly involvement is true, they should trigger another pull [11:40:52] I'll run a scap sync-world again [11:40:52] a "fresh" one [11:40:54] we'll see [11:41:07] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: more YAML formatting [puppet] - 10https://gerrit.wikimedia.org/r/1136687 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [11:41:22] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [11:41:37] do we have a way to force a redeploy of the latest image, even though scap didn't update the release file? [11:42:10] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum2002.codfw.wmnet with OS bookworm [11:42:26] no idea [11:43:03] ok I'm gonna update the release files manually [11:43:20] then run a scap without build [11:43:48] I'm not even sure what I'm trying to achieve anymore... that will just work now that dragonfly has evicted the blob [11:44:15] also we can't ask deployers to wait 5 minutes looking at kubectl get pods for all debug envs, then delete the ones misbehaving [11:44:21] this is very problematic [11:45:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75014 and previous config saved to /var/cache/conftool/dbconfig/20250415-114501-fceratto.json [11:45:06] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:45:09] !log sudo cumin 'A:durum and not P{durum2002*}' 'run-puppet-agent --enable "rolling out CR 1132669"' [11:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:47:19] claime: yes I agree, but in theory danc*y is working on a solution to automatically force scap to pull the new images, and once that works proceed [11:47:24] it may alleviate the problem [11:48:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:48:48] yeah, it may [11:58:11] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1200) [12:00:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1174.eqiad.wmnet with reason: Maintenance [12:00:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75015 and previous config saved to /var/cache/conftool/dbconfig/20250415-120013-fceratto.json [12:00:18] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:01:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage [12:02:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75016 and previous config saved to /var/cache/conftool/dbconfig/20250415-120222-fceratto.json [12:07:15] (03PS1) 10Elukey: sre.hosts.provision: add a warning for Supermicro Config C [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) [12:08:26] (03PS1) 10Michael Große: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) [12:09:06] (03PS1) 10Michael Große: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695) [12:09:29] (03PS1) 10Michael Große: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695) [12:09:58] (03PS1) 10Michael Große: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695) [12:10:41] (03PS1) 10Michael Große: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695) [12:10:59] (03PS1) 10Michael Große: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695) [12:12:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:12:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:12:55] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743284 (10Jelto) There were some problem adding the Ceph apus credentials to puppet. It was mostly an issue of wrong file names a... [12:13:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:13:03] (03CR) 10Volans: [C:04-1] "LGTM but missing one needed comma" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [12:13:58] (03PS1) 10Ayounsi: gnmic: bump num-workers to 24 [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641) [12:14:17] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743317 (10Jelto) [12:14:58] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:14:58] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:15:12] (03PS2) 10Elukey: sre.hosts.provision: add a warning for Supermicro Config C [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) [12:15:24] (03CR) 10Elukey: sre.hosts.provision: add a warning for Supermicro Config C (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [12:15:42] (03CR) 10Cathal Mooney: [C:03+2] gnmic: bump num-workers to 24 [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:15:54] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:16:29] (03PS1) 10FNegri: openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953) [12:17:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P75017 and previous config saved to /var/cache/conftool/dbconfig/20250415-121728-fceratto.json [12:17:30] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:17:58] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:17:58] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:18:06] hmmm.... [12:18:26] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:18:27] (03PS1) 10Filippo Giunchedi: etcd: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129177 (https://phabricator.wikimedia.org/T389170) [12:18:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:18:58] RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:18:58] RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:19:27] (03CR) 10Ayounsi: [C:03+2] gnmic: bump num-workers to 24 [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:19:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:20:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS bookworm [12:20:42] (03CR) 10Milimetric: [C:03+1] wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:20:58] !log upgrade thanos to 0.38.0 on O:prometheus::pop [12:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:05] !log upgrade thanos to 0.38.0 on O:prometheus::pop - T383966 [12:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:09] T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966 [12:21:12] (03CR) 10CI reject: [V:04-1] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:21:13] (03CR) 10Milimetric: [C:03+1] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:22:08] (03CR) 10Milimetric: [C:03+1] wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:22:27] (03CR) 10Milimetric: [C:03+1] wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:22:42] (03CR) 10Milimetric: [C:03+1] wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:22:46] (03CR) 10Michael Große: "recheck" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:22:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:23:04] (03CR) 10Milimetric: [C:03+1] wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:23:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:23:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [12:23:45] (03CR) 10Milimetric: [C:03+1] "Ok, chain makes sense and looks good to me. Just noting here that I'm going to ask in Slack about archiving the content under htdocs." [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:23:54] (03PS1) 10Brouberol: airflow: ensure the pod running in the KubernetesPodOperator itself gets low resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136706 (https://phabricator.wikimedia.org/T391669) [12:24:48] (03PS2) 10FNegri: openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953) [12:25:17] (03CR) 10Elukey: [C:03+1] log: notify user on IRC when awaiting input (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans) [12:25:18] !log cgoubert@deploy1003 Started scap build-images: (no justification provided) [12:25:59] (03CR) 10Elukey: [C:03+1] tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608 (owner: 10Volans) [12:26:31] !log cgoubert@deploy1003 build-images aborted: (no justification provided) (duration: 01m 12s) [12:26:33] !log cgoubert@deploy1003 Started scap build-images: (no justification provided) [12:26:34] !log cgoubert@deploy1003 build-images aborted: (no justification provided) (duration: 00m 01s) [12:26:37] !log cgoubert@deploy1003 Started scap build-images: (no justification provided) [12:26:51] Don't mind this, I can't use my fingers apparently [12:29:22] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743380 (10Jelto) [12:31:49] (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:31:54] (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:32:02] (03CR) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [12:32:05] !log cgoubert@deploy1003 Finished scap build-images: (no justification provided) (duration: 05m 27s) [12:32:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P75018 and previous config saved to /var/cache/conftool/dbconfig/20250415-123236-fceratto.json [12:33:01] (03CR) 10Elukey: [C:03+1] hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans) [12:33:04] !log cgoubert@deploy1003 Started scap sync-world: test rebuild to test swift eventual consistency [12:33:45] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Andy Cooper out of all services on: 2393 hosts [12:34:43] crap that test won't work, it's the same image [12:34:49] (03CR) 10Volans: [C:03+1] "In the interest of unblocking the situation between the this CR and I4ce9217392a7795940c981e1ee7da52df026cb5c let's merge this as-is even " [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [12:34:56] well it'll work, but it won't tell us anything [12:35:36] (03PS1) 10Slyngshede: data.yaml: Offboarding Andy Cooper [puppet] - 10https://gerrit.wikimedia.org/r/1136710 [12:35:39] I have a full-image-build requiring change to push anyways, so I'm gonna do that afterwards [12:36:27] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136706 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol) [12:36:50] FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:37:48] (03CR) 10Mark Bergsma: [C:03+2] data.yaml: Offboarding Andy Cooper [puppet] - 10https://gerrit.wikimedia.org/r/1136710 (owner: 10Slyngshede) [12:39:49] (03PS2) 10Arnaudb: gerrit: failover cookbook bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) [12:39:49] (03CR) 10Arnaudb: "This patch adds a missing element to our logic, to properly handle gerrit's service state." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:39:55] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [12:41:59] !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for durum2002.codfw.wmnet [12:42:00] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for durum2002.codfw.wmnet [12:43:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca) [12:44:30] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:45:26] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:47:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75020 and previous config saved to /var/cache/conftool/dbconfig/20250415-124743-fceratto.json [12:47:47] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:47:55] jouncebot: nowandnext [12:47:55] For the next 0 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1200) [12:47:55] In 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1300) [12:47:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: Maintenance [12:48:02] god dammit 12 minutes [12:48:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T391056)', diff saved to https://phabricator.wikimedia.org/P75021 and previous config saved to /var/cache/conftool/dbconfig/20250415-124805-fceratto.json [12:48:10] well we'll see if backports work ig [12:48:22] (03CR) 10Volans: [C:03+2] log: notify user on IRC when awaiting input [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans) [12:48:33] (03CR) 10Volans: [C:03+2] tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608 (owner: 10Volans) [12:48:39] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:47] !log cgoubert@deploy1003 cgoubert: test rebuild to test swift eventual consistency synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:49:54] !log cgoubert@deploy1003 cgoubert: Continuing with sync [12:50:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T391056)', diff saved to https://phabricator.wikimedia.org/P75022 and previous config saved to /var/cache/conftool/dbconfig/20250415-125014-fceratto.json [12:50:16] (03PS1) 10Filippo Giunchedi: logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) [12:50:32] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:51:01] (03PS1) 10Jelto: wikidata-query-gui: add query-legacy-full.w.o to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136714 (https://phabricator.wikimedia.org/T350793) [12:52:04] (03CR) 10Brouberol: [C:03+2] airflow: ensure the pod running in the KubernetesPodOperator itself gets low resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136706 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol) [12:52:25] jnuche: we're gonna try to run the deployment window with the sleep in place... maybe we can at least deploy with that [12:52:26] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:52:30] sgty? [12:53:17] claime: sounds good, ty! [12:53:45] I hate that workaround but I don't have anything better rn [12:53:54] (03CR) 10Jelto: [C:03+1] "looks good to me, thanks for the addition!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:53:56] We'll try to batch the backports as much as possible [12:54:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [12:54:36] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add query-legacy-full.w.o to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136714 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:54:36] With a little luck my current deploy will be done just in time for the window [12:55:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [12:55:18] robertsky, MichaelG_WMF, please look at your patches and tell me if I can backport any of them in the same scap or if they need staggering [12:55:24] this is gonna be a long window [12:55:37] (03PS1) 10DCausse: cirrus-streaming-updater: set upgradeMode to savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136716 (https://phabricator.wikimedia.org/T390853) [12:55:49] claime: you can backport them all together [12:55:54] MichaelG_WMF: awesome [12:56:04] claime, you can do it altogether for mine as well. [12:56:10] (03CR) 10Elukey: [C:03+2] "Tested with test-cookbook :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [12:56:13] robertsky: fantastic, thanks [12:56:23] claime: (per release that is, so two sets, one for .24 and one for .25) [12:56:24] (03Merged) 10jenkins-bot: wikidata-query-gui: add query-legacy-full.w.o to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136714 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:56:31] claime: the sleep is better than having train and backports blocked, it's just a stopgap measure until someone else can take a look. Thanks for doing that [12:56:33] I'll be back in a minute or two, and will run the window as soon as my current scap is done [12:56:43] I need a small break x) [12:57:06] also, there is nothing to test for mine. They fix a disabled maintenance script which will be re-enabled in a follow-up window [12:57:07] have the break. :) [12:57:12] take your time :) [12:58:08] (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:58:21] (03Merged) 10jenkins-bot: log: notify user on IRC when awaiting input [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans) [12:58:55] (03Merged) 10jenkins-bot: tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608 (owner: 10Volans) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1300). [13:00:04] robertsky, MichaelG_WMF, and Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:41] Lucas_WMDE, Urbanecm, and TheresNoTime: I'll run that window as we're having registry issues. I have a scap completing, then we'll start [13:00:47] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [13:01:04] claime: ack, you'll run this window [13:01:47] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:02:01] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:02:07] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [13:02:15] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [13:02:21] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [13:02:28] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [13:02:47] !log cgoubert@deploy1003 Finished scap sync-world: test rebuild to test swift eventual consistency (duration: 30m 09s) [13:02:53] (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [13:03:08] ok robertsky starting with your patches [13:03:19] ok [13:04:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:04:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:04:10] (03PS1) 10Ayounsi: Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) [13:04:49] o/ [13:04:57] (03Merged) 10jenkins-bot: updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:04:59] (ack) [13:05:01] (03Merged) 10jenkins-bot: update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [13:05:21] * claime crosses fingers we can actualy deploy [13:05:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P75023 and previous config saved to /var/cache/conftool/dbconfig/20250415-130522-fceratto.json [13:05:29] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1131038|updating wikimaniawiki namespace configurations: (T389729)]], [[gerrit:1131119|update wikimaniawiki perms configurations: (T389729)]] [13:05:32] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [13:06:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:06:27] (03PS1) 10Ssingh: wikimedia-dns.org: add HTTPS record (test) [dns] - 10https://gerrit.wikimedia.org/r/1136718 [13:07:01] ok pushes went through [13:07:10] sweet [13:07:18] it's now sleeping for 5 minutes for swift to catch up [13:07:26] (03PS2) 10Ayounsi: Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) [13:07:26] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: add HTTPS record (test) [dns] - 10https://gerrit.wikimedia.org/r/1136718 (owner: 10Ssingh) [13:07:29] (hence why the window's gonna be a little long) [13:07:33] (03PS2) 10Ssingh: wikimedia-dns.org: add HTTPS record (test) [dns] - 10https://gerrit.wikimedia.org/r/1136718 [13:07:34] ah. ok. [13:08:08] robertsky: yeah, we're having major issues with the registry, and that's the stopgap measure for being able to possibly deploy stuff [13:08:15] cf T390251 [13:08:15] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [13:08:34] (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1136718 (owner: 10Ssingh) [13:08:48] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: refresh FQDN of the neutron virtual router [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) [13:09:24] claime: sorry to hear about the issues with the registry. Could you link me to the task? [13:09:26] (03Merged) 10jenkins-bot: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [13:09:28] !log sukhe@dns1004 START - running authdns-update [13:09:34] MichaelG_WMF: T390251 [13:09:39] Thanks! [13:09:44] (03Merged) 10jenkins-bot: gerrit: failover cookbook bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [13:10:51] (03CR) 10Giuseppe Lavagetto: [C:03+1] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [13:11:54] !log sukhe@dns1004 END - running authdns-update [13:11:56] (03CR) 10Neriah: [C:03+1] testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [13:11:57] claime: looks tough. hope it resolves soon. [13:12:21] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:12:28] robertsky: thanks :) [13:12:29] (03PS1) 10Ssingh: Revert "wikimedia-dns.org: add HTTPS record (test)" [dns] - 10https://gerrit.wikimedia.org/r/1136722 [13:12:59] (03PS3) 10Ayounsi: Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) [13:13:32] (03CR) 10Ssingh: [C:03+2] Revert "wikimedia-dns.org: add HTTPS record (test)" [dns] - 10https://gerrit.wikimedia.org/r/1136722 (owner: 10Ssingh) [13:13:39] !log sukhe@dns1004 START - running authdns-update [13:14:18] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Andy Cooper out of all services on: 2393 hosts [13:16:05] !log sukhe@dns1004 END - running authdns-update [13:17:23] !log cgoubert@deploy1003 cgoubert, robertsky: Backport for [[gerrit:1131038|updating wikimaniawiki namespace configurations: (T389729)]], [[gerrit:1131119|update wikimaniawiki perms configurations: (T389729)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:26] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [13:17:28] robertsky: please go ahead and test your patches with XWD [13:17:34] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [13:17:37] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [13:17:40] Not sure if this was the right place, but just hit ' [13:17:40] [cee30b80-6232-414c-b271-aaa8b4dfa616] 2025-04-15 13:15:55: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"' when going to Special:BlockList on ENWP [13:17:52] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2001.codfw.wmnet, wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:18:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:18:41] (03PS2) 10Volans: hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 [13:18:41] (03PS2) 10Volans: hosts: add a is_dns_propagated() method to Host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135764 [13:18:51] hnowlan: do you have a minute to check what's going on there ^ (wikikube-ctrl) [13:18:52] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:18:56] Ah, transient [13:18:58] we're good [13:19:01] sorry for the ping [13:19:05] :) [13:19:25] (03CR) 10FNegri: openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [13:19:33] claime: I think it was the reload for TLS certs [13:19:34] claime: hold on.. i got to apologise for this, how to i get onto debug server to test? (it's my first time attending the backport).. [13:19:38] I don't see horrors in the logs [13:19:44] claime: looking just to be sure [13:19:51] robertsky: do you have the X-Wikimedia-Debug extension installed? [13:19:53] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:19:57] yes [13:20:05] go to the wiki you want to test [13:20:07] (03CR) 10Volans: "Moved the non immutable accessors from @property to methods" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans) [13:20:16] turn it on [13:20:18] test [13:20:20] :D [13:20:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P75024 and previous config saved to /var/cache/conftool/dbconfig/20250415-132029-fceratto.json [13:21:01] (03PS12) 10Tiziano Fogli: prometheus/alerts: define alert rules directly in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1101066 (https://phabricator.wikimedia.org/T381665) [13:21:09] (03CR) 10FNegri: openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [13:21:42] (03PS1) 10Jelto: miscweb: remove query-service from legacy vms [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) [13:22:15] checking [13:23:14] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2065 to cirrussearch2065 [13:23:31] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:24:11] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5301/co" [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:25:25] claime: lgtm. [13:25:30] cool proceeding [13:25:32] !log cgoubert@deploy1003 cgoubert, robertsky: Continuing with sync [13:26:10] MichaelG_WMF: I'll do your backports as they're for a disabled periodic job, but fwiw, it'd be better if they were +1'd before being scheduled for deployment [13:26:31] (03PS1) 10Brouberol: airflow: hotfix: only assign low resources to kubernetes pod operator pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136726 (https://phabricator.wikimedia.org/T391669) [13:26:56] Thank you. If you want, I can ask Amir1 about the backports? [13:27:21] They have my +1 [13:27:23] cool [13:27:27] (03PS2) 10Jelto: miscweb: remove query-service from legacy vms [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) [13:27:34] once you're done, I have a patch too [13:27:36] thanks both [13:27:45] Amir1: i know you do :P [13:27:47] and a backport [13:28:08] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2065 to cirrussearch2065 - bking@cumin2002" [13:28:22] Aca: you around? [13:28:30] ye ye [13:28:32] your patch isn't +1'd either [13:28:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2065 to cirrussearch2065 - bking@cumin2002" [13:28:37] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:38] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2065 [13:28:49] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2065 [13:28:53] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5302/co" [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:29:24] I could call a colleague to review it, but I think he's not around [13:29:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2065 to cirrussearch2065 [13:29:32] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2065.codfw.wmnet on all recursors [13:29:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2065.codfw.wmnet on all recursors [13:29:51] (03CR) 10CI reject: [V:04-1] hosts: add a is_dns_propagated() method to Host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135764 (owner: 10Volans) [13:29:57] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2065.codfw.wmnet with OS bullseye [13:30:09] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2065 [13:31:51] Aca: that would be best as it's adding things I'm not sure are standard for wiktionary [13:32:13] umm, elaborate? [13:32:51] So I don't have domain specific knowledge for this [13:33:37] (03CR) 10Tiziano Fogli: [C:03+2] netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:33:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:33:51] my bad got tripped buy order [13:33:52] Basic import source setup for wiktionary is: [13:33:53]  'wiktionary' => [ 'w', 'w:en', 'en', 'ar', 'es', 'fr', 'ru', 'zh', 'de', 'id', 'commons', 'meta', 'incubator' ], [13:33:53] This change just add "bs", per community consensus. The rest is just duplicated in order to prevent overwriting. [13:33:54] s/buy/by/ [13:33:58] yeah yeah [13:34:16] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131038|updating wikimaniawiki namespace configurations: (T389729)]], [[gerrit:1131119|update wikimaniawiki perms configurations: (T389729)]] (duration: 28m 46s) [13:34:19] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [13:34:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 1.013e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:34:25] The order wasn't the same as the generic wiktionary entry, and that tripped my quick reading [13:34:36] yeah, I get itt [13:34:53] ok MichaelG_WMF I'll do your patches now [13:34:58] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10743696 (10Jclark-ctr) a:03Jclark-ctr @Eevans This server is out of Warranty We have used drives from recently Decom servers please advise when and if you would like to replace. [13:35:08] claime: Thank you! [13:35:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:35:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:35:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:35:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T391056)', diff saved to https://phabricator.wikimedia.org/P75025 and previous config saved to /var/cache/conftool/dbconfig/20250415-133536-fceratto.json [13:35:40] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:35:44] (03CR) 10Anzx: "looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca) [13:35:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:35:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:35:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T391056)', diff saved to https://phabricator.wikimedia.org/P75027 and previous config saved to /var/cache/conftool/dbconfig/20250415-133558-fceratto.json [13:36:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:37:01] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743705 (10MatthewVernon) Looking at the Ceph metrics, it seems the packages were fewer larger objects, and the artifacts are more... [13:37:20] (03Merged) 10jenkins-bot: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:37:27] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743708 (10Jclark-ctr) @elukey thanks for downtime raid card has been installed. @MatthewVernon All yours to verify [13:37:30] (03Merged) 10jenkins-bot: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:37:33] (03Merged) 10jenkins-bot: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:38:02] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] [13:38:06] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [13:38:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T391056)', diff saved to https://phabricator.wikimedia.org/P75028 and previous config saved to /var/cache/conftool/dbconfig/20250415-133807-fceratto.json [13:38:15] (03PS3) 10Volans: hosts: add a is_dns_propagated() method to Host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135764 [13:38:25] (03CR) 10Cathal Mooney: [C:03+1] "nice! will be great to have those stats." [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:38:54] (03CR) 10Elukey: "Hey Jesse! I tried with and without the patch, output in https://phabricator.wikimedia.org/P75026. For some reason it is very different, I" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway) [13:39:50] (03CR) 10Ayounsi: [C:03+2] Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:40:18] jouncebot: now and next [13:40:18] For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1300) [13:40:24] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2065 - bking@cumin2002" [13:40:29] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2065 - bking@cumin2002" [13:40:29] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:30] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2065.codfw.wmnet 68.32.192.10.in-addr.arpa 8.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:40:33] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2065.codfw.wmnet 68.32.192.10.in-addr.arpa 8.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:40:34] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2065 [13:40:44] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2065 [13:40:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2065 [13:43:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743739 (10elukey) Thanks a lot! I see the new controller but also some errors while mounting swift partitions: ` [Tue Apr 15 13:41:... [13:44:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743756 (10MatthewVernon) Currently puppet is failing on this host: ` mvernon@ms-be1091:~$ sudo run-puppet-agent Info: Using environme... [13:45:32] !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231" [13:45:34] !log tappof@cumin1002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231" [13:45:35] T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231 [13:45:44] (03CR) 10Tiziano Fogli: [C:03+2] sre.puppet.sync-netbox-hiera: add data::pdus to exports [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:46:07] (03PS1) 10Elukey: role::ml_k8s::master: move to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) [13:46:12] (03CR) 10Edgar Allan Poe: [C:03+1] shwiktionary: Add bs as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca) [13:46:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743765 (10MatthewVernon) @elukey that might help, yes, it looks like puppet finds the disks, but they've changed their path: ` swift_... [13:47:34] (03CR) 10Edgar Allan Poe: [C:03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca) [13:48:04] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743769 (10MatthewVernon) (I don't know whether everything will Just Work with a reimage, or if some awful regexes will need adjusting) [13:48:08] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5303/" [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [13:48:38] (03PS2) 10Elukey: role::ml_k8s::master: move 1001 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) [13:49:42] !log cgoubert@deploy1003 migr, cgoubert: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] synced to the testservers (https://wikitech.wikimedia [13:49:42] .org/wiki/Mwdebug) [13:49:46] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [13:49:57] !log cgoubert@deploy1003 migr, cgoubert: Continuing with sync [13:52:13] ty for getting +1 Aca [13:52:27] (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: add data::pdus to exports [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:52:42] no problemm [13:53:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P75029 and previous config saved to /var/cache/conftool/dbconfig/20250415-135313-fceratto.json [13:54:13] (03PS1) 10Ssingh: [test commit] wikimedia-dns.org: add HTTPS records [dns] - 10https://gerrit.wikimedia.org/r/1136730 [13:55:12] !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231" [13:55:16] T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231 [13:55:36] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2065.codfw.wmnet with reason: host reimage [13:55:44] !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231" [13:56:29] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] (duration: 18m 27s) [13:56:32] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [13:56:39] (03CR) 10Ssingh: [C:03+2] [test commit] wikimedia-dns.org: add HTTPS records [dns] - 10https://gerrit.wikimedia.org/r/1136730 (owner: 10Ssingh) [13:56:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:56:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:56:50] !log sukhe@dns1004 START - running authdns-update [13:56:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:57:37] (03PS1) 10Robertsky: fix wgAddGroup for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) [13:58:49] (03Merged) 10jenkins-bot: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:59:17] !log sukhe@dns1004 END - running authdns-update [13:59:26] (03Merged) 10jenkins-bot: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:59:29] (03Merged) 10jenkins-bot: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große) [13:59:49] (03PS1) 10Ssingh: Revert "[test commit] wikimedia-dns.org: add HTTPS records" [dns] - 10https://gerrit.wikimedia.org/r/1136732 [13:59:57] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136698|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] [14:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:01:21] (03CR) 10Ssingh: [C:03+2] Revert "[test commit] wikimedia-dns.org: add HTTPS records" [dns] - 10https://gerrit.wikimedia.org/r/1136732 (owner: 10Ssingh) [14:01:39] !log sukhe@dns1004 START - running authdns-update [14:02:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2065.codfw.wmnet with reason: host reimage [14:03:08] (03PS2) 10Robertsky: fix wgAddGroup for wikimaniawiki. No need for translateadmin to add xcon and xcon to add more xcon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) [14:04:06] !log sukhe@dns1004 END - running authdns-update [14:04:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:06:53] jouncebot: nowandnext [14:06:53] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [14:06:53] In 0 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1500) [14:07:10] we're running over a bit but I'll still finish up [14:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:07:57] !log bootstrapping Cassandra/restbase1044-c — T389423 [14:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:01] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [14:08:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P75030 and previous config saved to /var/cache/conftool/dbconfig/20250415-140820-fceratto.json [14:08:42] (03PS1) 10Volans: doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733 [14:09:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:11:50] !log cgoubert@deploy1003 migr, cgoubert: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136698|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] synced to the testservers (https://wikitech.wikimedia [14:11:50] .org/wiki/Mwdebug) [14:11:53] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [14:11:57] !log cgoubert@deploy1003 migr, cgoubert: Continuing with sync [14:12:08] (03PS46) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [14:13:39] FIRING: [2x] ProbeDown: Service restbase1044-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:32] (03CR) 10Brouberol: [C:03+2] airflow: hotfix: only assign low resources to kubernetes pod operator pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136726 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol) [14:14:55] (03PS47) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [14:15:33] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_esams and not P{cp3081.esams.wmnet} and A:cp [14:17:22] (03CR) 10Tiziano Fogli: [C:03+2] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:18:03] (03PS3) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) [14:18:28] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136698|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] (duration: 18m 30s) [14:18:31] T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695 [14:18:38] ok Aca moving on to your patch [14:18:39] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:43] ack [14:18:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca) [14:19:17] MichaelG_WMF: your patch is fully backported, so the updated script should be up on mwmaint [14:19:18] (03CR) 10Elukey: [C:03+1] doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733 (owner: 10Volans) [14:19:28] Maybe don't run it rn tho :P [14:19:40] claime: thank you! [14:19:46] (03Merged) 10jenkins-bot: shwiktionary: Add bs as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca) [14:19:59] (03CR) 10Kamila Součková: [C:04-1] "typo in team label" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [14:20:08] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:20:11] (03CR) 10Elukey: [C:03+1] hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans) [14:20:14] !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1136133|shwiktionary: Add bs as import source (T391621)]] [14:20:19] T391621: shwiktionary: Add bs as import source - https://phabricator.wikimedia.org/T391621 [14:20:28] yes, the plan is to re-enable it maybe tomorrow when we can prepare it and watch for fallout [14:20:29] (03CR) 10Herron: [C:03+1] logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi) [14:20:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:20:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_esams and not P{cp3073.esams.wmnet} and A:cp [14:21:03] (03CR) 10Volans: [C:03+2] doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733 (owner: 10Volans) [14:21:07] claime: `Maybe don't run it rn tho :P` Are you mainly concerned about the registry issue or something else? [14:21:46] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:21:56] (03CR) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [14:22:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2065.codfw.wmnet with OS bullseye [14:22:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [14:22:45] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:22:46] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10743933 (10MatthewVernon) There's an LVM layer here too, isn't there? It's a software-RAID-1 of sda2 and sdb2... [14:23:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T391056)', diff saved to https://phabricator.wikimedia.org/P75031 and previous config saved to /var/cache/conftool/dbconfig/20250415-142327-fceratto.json [14:23:30] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:23:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:23:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T391056)', diff saved to https://phabricator.wikimedia.org/P75032 and previous config saved to /var/cache/conftool/dbconfig/20250415-142349-fceratto.json [14:24:19] (03PS1) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) [14:24:39] claime: I'd like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136605 please LMK when good to do so [14:24:58] godog: Amir1 asked first, so see with him :P [14:25:37] !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in eqiad - T391334 [14:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:42] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [14:25:42] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqiad and A:cp [14:25:57] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqiad and A:cp [14:25:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T391056)', diff saved to https://phabricator.wikimedia.org/P75033 and previous config saved to /var/cache/conftool/dbconfig/20250415-142558-fceratto.json [14:26:20] claime: lolz [14:26:29] claime: I'll stand in line [14:26:35] (03PS1) 10Brouberol: airflow/hotfix: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136735 [14:26:42] (03CR) 10Tiziano Fogli: [C:03+1] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [14:28:55] (03CR) 10Hashar: "That is for the odd use case when I am running `bundle exec rspec` from my local machine (Debian Bookworm) which comes with Ruby 3.1." [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar) [14:29:47] awesome [14:29:49] (03PS4) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) [14:29:59] (03CR) 10Brouberol: [C:03+2] airflow/hotfix: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136735 (owner: 10Brouberol) [14:30:07] (03PS1) 10Ladsgroup: Revert^2 "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136737 [14:30:15] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [14:30:41] Amir1: I still have a deploy in flight, btw, so wait a bit [14:30:54] ah okay [14:31:25] (03CR) 10Arturo Borrero Gonzalez: openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [14:31:32] !log cgoubert@deploy1003 aleksandar, cgoubert: Backport for [[gerrit:1136133|shwiktionary: Add bs as import source (T391621)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:31:35] Aca: /39 [14:31:36] T391621: shwiktionary: Add bs as import source - https://phabricator.wikimedia.org/T391621 [14:31:37] sorry [14:31:40] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_drmrs [14:31:45] MichaelG_WMF: testing [14:31:46] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:31:48] (03Merged) 10jenkins-bot: doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733 (owner: 10Volans) [14:31:50] oops [14:31:52] wrong ping [14:31:52] haha [14:32:01] (03PS3) 10Robertsky: fix wgAddGroup for wikimaniawiki. No need for translateadmin to add xcon and xcon to add more xcon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) [14:32:03] o.O [14:32:06] fails all around :D [14:32:47] (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [14:32:53] works as expected [14:32:56] lgtm [14:32:59] cool, proceeding [14:33:04] !log cgoubert@deploy1003 aleksandar, cgoubert: Continuing with sync [14:33:10] (03CR) 10Brouberol: [C:03+2] wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [14:33:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2066-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:33:43] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:00] (03CR) 10Volans: [C:03+2] hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans) [14:34:08] (03PS5) 10Bking: sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [14:34:11] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:34:15] Hey - there was a bad security patch from yesterday that went out to wmf.24 for a few minutes and was then reverted/redeployed. But it looks like it made it back onto wmf.24 (train?) and is causing prod errors now: https://phabricator.wikimedia.org/T391969 [14:34:29] (03PS3) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) [14:34:29] (03PS3) 10Brouberol: wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) [14:34:29] (03PS3) 10Brouberol: wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) [14:34:30] (03PS3) 10Brouberol: wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) [14:34:31] (03PS3) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) [14:35:01] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:35:05] I’ve removed the patch from /srv/patches/1.44.0-wmf.24 now and it never made it to 1.44.0-wmf.25. We’ll need to redeploy core:includes/specials/pagers/BlockListPager.php as soon as we can. [14:35:51] (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [14:36:03] (03CR) 10Brouberol: [C:03+2] wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [14:36:27] sbassett: shoot [14:36:47] ok that takes precedence on Amir1 backport [14:37:09] sbassett: I have no experience on deploying security patches, can you handle it once the current scap is done? [14:37:57] I’m ready to deploy the fix if I can [14:38:00] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_drmrs [14:38:05] If there’s no scap lock rn [14:38:19] I think this got accidentally reapplied via scap backport :/ [14:38:38] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2066-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:38:54] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [14:38:57] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:39:09] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [14:39:10] possibly during my tests this morning, I'm sorry [14:39:28] i'll ping you as soon as the current deploy is done [14:39:33] couple minutes max [14:39:42] !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136133|shwiktionary: Add bs as import source (T391621)]] (duration: 19m 28s) [14:39:44] sbassett: go [14:39:45] T391621: shwiktionary: Add bs as import source - https://phabricator.wikimedia.org/T391621 [14:40:01] claime: running [14:40:12] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [14:40:32] (03CR) 10Brouberol: [C:03+2] wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [14:41:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P75034 and previous config saved to /var/cache/conftool/dbconfig/20250415-144106-fceratto.json [14:42:46] (03PS2) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) [14:42:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:42:47] thankies for the deploy :) [14:43:07] (03Merged) 10jenkins-bot: hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans) [14:43:11] Aca: np [14:43:15] ty for the patch [14:44:11] (03CR) 10Kamila Součková: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [14:44:51] (03CR) 10Clément Goubert: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [14:45:55] (03PS4) 10Robertsky: wikimaniawiki: fix add/remove groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) [14:48:04] (03CR) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [14:48:07] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [14:49:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [14:50:01] (03CR) 10Chlod Alejandro: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [14:51:01] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:52:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:52:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:52:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:54:18] prod k8s 40% done, error rates seem to be declining in logstash [14:55:31] sbassett: cool thanks [14:56:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P75035 and previous config saved to /var/cache/conftool/dbconfig/20250415-145613-fceratto.json [14:57:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [14:57:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [14:57:45] !log Undeployed security patch for T391343 (reapplied during recent scap backport, patch now removed from deployment hosts) [14:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:17] PROBLEM - Hadoop NodeManager on an-worker1206 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:05] jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1500). [15:00:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:03:49] PROBLEM - Hadoop NodeManager on an-worker1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:04:40] claime: should be all good now [15:06:39] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:09] (03PS1) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) [15:07:45] (03CR) 10CI reject: [V:04-1] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper) [15:07:49] RECOVERY - Hadoop NodeManager on an-worker1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:08:39] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:53] PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:09:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744161 (10MatthewVernon) [15:09:27] (03CR) 10Ahmon Dancy: [C:03+1] Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [15:10:01] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:10:27] (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde: Configure Prometheus Pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/1136741 (https://phabricator.wikimedia.org/T389344) [15:10:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:11:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T391056)', diff saved to https://phabricator.wikimedia.org/P75036 and previous config saved to /var/cache/conftool/dbconfig/20250415-151121-fceratto.json [15:11:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:11:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:11:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T391056)', diff saved to https://phabricator.wikimedia.org/P75037 and previous config saved to /var/cache/conftool/dbconfig/20250415-151144-fceratto.json [15:13:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T391056)', diff saved to https://phabricator.wikimedia.org/P75038 and previous config saved to /var/cache/conftool/dbconfig/20250415-151354-fceratto.json [15:16:07] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2059 to cirrussearch2059 [15:16:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:16:24] !log dzahn@deploy1003 Started deploy [releng/jenkins-deploy@c274545] (releasing): T391590 [15:16:29] T391590: PuppetFailure - releases2003 - https://phabricator.wikimedia.org/T391590 [15:17:08] !log dzahn@deploy1003 Finished deploy [releng/jenkins-deploy@c274545] (releasing): T391590 (duration: 01m 14s) [15:18:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:19:43] PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:20:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744224 (10MatthewVernon) [15:22:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744247 (10MatthewVernon) @RHo can you approve this request, please? Once that's done, this request can proceed. [15:22:12] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2059 to cirrussearch2059 - bking@cumin2002" [15:22:29] claime: since scott is done, shall I deploy? [15:22:49] Amir1: +1, I don't think there is anything else pending [15:23:01] PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:23:02] the sleep fix is live so usable [15:23:11] (03CR) 10Ladsgroup: [C:03+2] Revert^2 "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136737 (owner: 10Ladsgroup) [15:23:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:23:59] (03Merged) 10jenkins-bot: Revert^2 "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136737 (owner: 10Ladsgroup) [15:24:01] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [15:24:03] (03CR) 10Scott French: [C:03+2] hieradata: switch parsoidtest1001 to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French) [15:24:17] RECOVERY - Hadoop NodeManager on an-worker1206 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:24:45] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136737|Revert^2 "Bump thumbnail steps to 95%"]] [15:25:11] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:26:22] (03PS15) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [15:26:28] (03PS1) 10Ryan Kemper: wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151) [15:26:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10744259 (10MatthewVernon) [15:29:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P75041 and previous config saved to /var/cache/conftool/dbconfig/20250415-152901-fceratto.json [15:29:02] (03PS1) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) [15:29:03] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [15:29:15] (03PS2) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925) [15:31:14] 06SRE, 10DNS, 10Wikimedia-Apache-configuration: Unconfigured subdomains of wikimedia.org should display an error page rather than the wikimedia.org homepage - https://phabricator.wikimedia.org/T391016#10744291 (10Joe) 05Open→03Declined This was never the behaviour of our servers, as far back as I can... [15:31:39] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:32:14] (03PS16) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [15:32:19] (03PS1) 10MVernon: admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) [15:32:57] (03CR) 10CI reject: [V:04-1] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon) [15:33:01] RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:34:14] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10744312 (10TheDJ) >>! In T355914#10742000, @MikhailRyazanov wrote: > By the way, are there any reasons, besides historical, to specify image sizes in “pi... [15:34:21] (03PS2) 10MVernon: admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) [15:34:43] RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:34:49] (03PS1) 10Ryan Kemper: wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151) [15:35:00] (03CR) 10CI reject: [V:04-1] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon) [15:35:17] (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Remove the main release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136748 [15:35:19] (03PS1) 10Alexandros Kosiaris: scap: Stop updating main mw-wikifunctions release [puppet] - 10https://gerrit.wikimedia.org/r/1136749 [15:35:29] (03CR) 10Brouberol: [C:03+2] wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [15:35:47] (03PS3) 10MVernon: admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) [15:35:53] RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:36:11] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:36:39] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136737|Revert^2 "Bump thumbnail steps to 95%"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:17] PROBLEM - Hadoop NodeManager on an-worker1204 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:39:03] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [15:39:36] (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [15:39:47] (03PS4) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) [15:40:19] (03CR) 10Vgutierrez: [C:04-1] "I think this won't work as expected since haproxykafka gets the hostname from its configuration: https://gitlab.wikimedia.org/repos/sre/ha" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [15:40:48] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2059 to cirrussearch2059 - bking@cumin2002" [15:40:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:49] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2059 [15:41:16] (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol) [15:41:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2059 [15:41:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2059 to cirrussearch2059 [15:42:00] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2059.codfw.wmnet on all recursors [15:42:03] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2059.codfw.wmnet on all recursors [15:42:25] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2059.codfw.wmnet with OS bullseye [15:42:37] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2059 [15:42:57] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:44:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P75042 and previous config saved to /var/cache/conftool/dbconfig/20250415-154407-fceratto.json [15:44:22] (03CR) 10Ssingh: [C:03+1] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon) [15:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10744354 (10phaultfinder) [15:45:48] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136737|Revert^2 "Bump thumbnail steps to 95%"]] (duration: 21m 02s) [15:47:07] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2059 - bking@cumin2002" [15:47:13] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2059 - bking@cumin2002" [15:47:13] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:13] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2059.codfw.wmnet 5.32.192.10.in-addr.arpa 5.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:47:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2059.codfw.wmnet 5.32.192.10.in-addr.arpa 5.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:47:18] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2059 [15:47:42] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2059 [15:47:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2059 [15:48:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:49:08] (03CR) 10MVernon: [C:03+2] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon) [15:50:39] (03PS2) 10Ryan Kemper: wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151) [15:50:39] (03PS2) 10Ryan Kemper: wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151) [15:51:02] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10744402 (10Jclark-ctr) @fnegri i had looked at this briefly kinda looks like might be a bad intake sensor and might not be over heating comparin... [15:51:35] (03PS3) 10Ryan Kemper: wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151) [15:51:35] (03PS3) 10Ryan Kemper: wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151) [15:52:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10744409 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @Lena_WMDE this is done now... [15:54:07] is there a train deploy next tuesday, despite it being a global WMF holiday? [15:54:49] (03PS2) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) [15:55:17] RECOVERY - Hadoop NodeManager on an-worker1204 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:57:36] !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231" [15:57:41] T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231 [15:58:01] !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231" [15:58:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:59:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T391056)', diff saved to https://phabricator.wikimedia.org/P75043 and previous config saved to /var/cache/conftool/dbconfig/20250415-155914-fceratto.json [15:59:21] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:59:28] cscott: normally this would be at https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar but that is not very helpful for that atm [15:59:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1227.eqiad.wmnet with reason: Maintenance [15:59:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75044 and previous config saved to /var/cache/conftool/dbconfig/20250415-155939-fceratto.json [16:00:03] taavi: the google calendar also lists a deploy on the 22nd [16:00:05] jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1600). [16:00:05] Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] o/ [16:00:48] taavi: i'm assuming the group0 deploy will get shifted to wednesday, since there won't be anyone around to fix any problems with group0 if they arise? [16:00:50] (03CR) 10FNegri: [C:03+1] openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [16:00:54] o/ [16:01:01] (sorry, my bouncer hung for a few seconds) [16:01:36] jeez how dare you show up late to the puppet window, a thing I would never do in my entire life :D [16:01:40] (03CR) 10RLazarus: [C:03+2] statistics::wmde: Configure Prometheus Pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/1136741 (https://phabricator.wikimedia.org/T389344) (owner: 10Lucas Werkmeister (WMDE)) [16:01:42] :P [16:02:01] will you want a manual run on stat1011 to test? [16:03:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2059.codfw.wmnet with reason: host reimage [16:03:30] rzl: the only testing I could do would be to check that the line shows up in the config file [16:03:40] oh okay [16:03:41] beyond that, the required code isn’t quite ready yet [16:03:55] (I’ll test later if pushing to that Prometheus Pushgateway thingy works, probably tomorrow or so) [16:04:05] nod, makes sense [16:04:21] I'll run puppet anyway even though I'm pretty sure it mathematically can't fail on that patch, and then we can call it a day [16:04:26] sure [16:05:31] (03CR) 10Federico Ceratto: [C:03+1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [16:05:37] (03CR) 10Federico Ceratto: [C:03+2] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [16:06:08] (03PS1) 10Mhorsey: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) [16:06:30] rzl: saying that something absolutely cannot fail is in my experience a very good way to make something fail [16:06:33] Lucas_WMDE: done, and I do see the diff in the puppet output, so you should be all good [16:06:44] rzl: and I see the lines in sudo -u analytics-wmde cat /srv/analytics-wmde/graphite/src/config \o/ [16:06:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2059.codfw.wmnet with reason: host reimage [16:06:56] thanks! [16:06:56] (03CR) 10CI reject: [V:04-1] Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey) [16:07:05] taavi: haha it's a template patch with no control characters, I defy the universe to break it just to teach me a lesson [16:07:34] (it didn't tho) [16:10:34] (03PS1) 10Andrew Bogott: wmcs-dnsleaks: slow down --doublecheck [puppet] - 10https://gerrit.wikimedia.org/r/1136755 [16:11:34] (03CR) 10Andrew Bogott: [C:03+2] wmcs-dnsleaks: slow down --doublecheck [puppet] - 10https://gerrit.wikimedia.org/r/1136755 (owner: 10Andrew Bogott) [16:12:47] (03PS1) 10Ryan Kemper: wdqs-internal: remove service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1136756 (https://phabricator.wikimedia.org/T376151) [16:12:49] (03Merged) 10jenkins-bot: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [16:12:49] (03PS1) 10Ryan Kemper: wdqs-internal: rip out remaining logic/config [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151) [16:13:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75046 and previous config saved to /var/cache/conftool/dbconfig/20250415-161335-fceratto.json [16:13:39] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:16:27] PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:17:27] RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:18:21] (03PS2) 10Mhorsey: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) [16:18:39] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey) [16:20:40] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10744550 (10cmooney) >>! In T374614#10707267, @cmooney wrote: >>>! In T374614#10147994, @ayounsi wrote: >> Short term I think if you add `[4Gbps]` to the interface... [16:23:41] (03PS3) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) [16:27:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2059.codfw.wmnet with OS bullseye [16:28:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P75047 and previous config saved to /var/cache/conftool/dbconfig/20250415-162842-fceratto.json [16:30:21] (03PS1) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) [16:36:50] FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:38:04] (03CR) 10Vgutierrez: [C:04-1] "this currently breaks logging of x-cache-status for 301s responses generated by the `http` frontend" [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) (owner: 10Fabfur) [16:41:52] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744645 (10RHo) >>! In T391861#10744224, @MatthewVernon wrote: > @RHo can you approve this request, please? Once that's done,... [16:42:10] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2098 to cirrussearch2098 [16:42:32] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:42:43] (03PS1) 10Ssingh: wikimedia-dns.org: check: add HTTPS record (TTL to increase later) [dns] - 10https://gerrit.wikimedia.org/r/1136764 [16:43:31] (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: check: add HTTPS record (TTL to increase later) [dns] - 10https://gerrit.wikimedia.org/r/1136764 (owner: 10Ssingh) [16:43:50] !log sukhe@dns1004 START - running authdns-update [16:43:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P75048 and previous config saved to /var/cache/conftool/dbconfig/20250415-164350-fceratto.json [16:45:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:45:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from elastic2098 to cirrussearch2098 [16:45:23] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2098.codfw.wmnet on all recursors [16:45:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2098.codfw.wmnet on all recursors [16:45:52] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2102.codfw.wmnet on all recursors [16:45:56] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2102.codfw.wmnet on all recursors [16:46:02] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [16:46:05] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [16:46:22] !log sukhe@dns1004 END - running authdns-update [16:46:43] (03CR) 10Bking: [C:03+2] sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [16:48:13] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [16:50:53] (03PS1) 10Dzahn: jenkins: ensure systemd service dir exists before override [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) [16:52:34] (03CR) 10Dzahn: [C:04-1] "Thanks! I tried this but ran into errors. Pasted it here: https://phabricator.wikimedia.org/T391590#10744225" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [16:54:36] (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [16:55:19] (03CR) 10Dzahn: [C:03+1] "Reverting seems still easy enough if needed, afaict! lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [16:55:21] (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [16:58:57] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [16:58:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75049 and previous config saved to /var/cache/conftool/dbconfig/20250415-165859-fceratto.json [16:59:00] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [16:59:04] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:59:06] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [16:59:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1253.eqiad.wmnet with reason: Maintenance [16:59:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T391056)', diff saved to https://phabricator.wikimedia.org/P75050 and previous config saved to /var/cache/conftool/dbconfig/20250415-165922-fceratto.json [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1700) [17:01:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T391056)', diff saved to https://phabricator.wikimedia.org/P75051 and previous config saved to /var/cache/conftool/dbconfig/20250415-170132-fceratto.json [17:03:04] (03CR) 10Dzahn: [C:04-2] "replacing this approach with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136765" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [17:08:52] (03PS1) 10Ssingh: Revert "wikimedia-dns.org: check: add HTTPS record (TTL to increase later)" [dns] - 10https://gerrit.wikimedia.org/r/1136768 [17:09:07] (03CR) 10Ssingh: "This works but reverting till we actually finish other deployment." [dns] - 10https://gerrit.wikimedia.org/r/1136768 (owner: 10Ssingh) [17:10:21] (03PS4) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) [17:10:21] (03PS1) 10Hnowlan: mw::maintenance: migrate deleteExpiredUserImpactData to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782) [17:11:32] (03CR) 10Ssingh: [C:03+2] Revert "wikimedia-dns.org: check: add HTTPS record (TTL to increase later)" [dns] - 10https://gerrit.wikimedia.org/r/1136768 (owner: 10Ssingh) [17:11:41] !log sukhe@dns1004 START - running authdns-update [17:13:39] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:08] !log sukhe@dns1004 END - running authdns-update [17:16:06] (03CR) 10JHathaway: [C:03+1] Gemfile: update rspec-puppet to 2.10.x [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar) [17:16:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P75052 and previous config saved to /var/cache/conftool/dbconfig/20250415-171639-fceratto.json [17:23:22] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744752 (10Ahoelzl) Approved from DPE DE. [17:23:33] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744754 (10Ahoelzl) [17:23:40] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@f650091]: Pickup latest artifacts. T391280. [17:23:44] T391280: Modify table maintenance mechanism to support Iceberg's rewrite_position_delete_files() - https://phabricator.wikimedia.org/T391280 [17:24:23] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@f650091]: Pickup latest artifacts. T391280. (duration: 01m 08s) [17:31:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P75053 and previous config saved to /var/cache/conftool/dbconfig/20250415-173146-fceratto.json [17:46:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T391056)', diff saved to https://phabricator.wikimedia.org/P75054 and previous config saved to /var/cache/conftool/dbconfig/20250415-174653-fceratto.json [17:46:58] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:47:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:47:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2150.codfw.wmnet with reason: Maintenance [17:47:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T391056)', diff saved to https://phabricator.wikimedia.org/P75055 and previous config saved to /var/cache/conftool/dbconfig/20250415-174734-fceratto.json [17:54:43] (03PS1) 10Ssingh: Revert^2 "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772 [17:57:02] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_eqiad and A:cp [18:00:05] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_eqiad and A:cp [18:00:05] dduvall and brennen: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1800). [18:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:01:31] !log removing from reprepro -C component/nginx-ech libssl and openssl packages [18:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:40] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006 (10RobH) 03NEW p:05Triage→03High [18:04:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T391056)', diff saved to https://phabricator.wikimedia.org/P75056 and previous config saved to /var/cache/conftool/dbconfig/20250415-180400-fceratto.json [18:04:05] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:04:32] o/ [18:04:51] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007 (10RobH) 03NEW [18:05:10] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [18:05:13] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [18:05:15] (ah, currently blocked it seems.) [18:07:13] brennen: o/ i don't think it's a complete blocker per se as it's intermittent [18:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:10:15] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136776 (https://phabricator.wikimedia.org/T386220) [18:10:17] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136776 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [18:11:04] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136776 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot) [18:11:51] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10744966 (10RobH) [18:13:34] (03PS1) 10Jforrester: VE: Start setting wgVisualEditorMobileInsertMenu, default to off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136777 (https://phabricator.wikimedia.org/T388604) [18:13:35] (03PS1) 10Jforrester: VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145) [18:13:39] FIRING: ProbeDown: Service restbase1044-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1044-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:10] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10744975 (10RobH) @ayounsi & @cmooney: Per our conversation today in our codfw/eqiad buildout meetings, this was brought up and I've created th... [18:14:33] (03CR) 10DLynch: [C:03+1] VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145) (owner: 10Jforrester) [18:14:35] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10744979 (10RobH) @Jclark-ctr & @VRiley-WMF Per today's meeting, one of the action items was to have an eqiad onsite detrmine how many free cro... [18:19:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P75057 and previous config saved to /var/cache/conftool/dbconfig/20250415-181906-fceratto.json [18:23:39] RESOLVED: ProbeDown: Service restbase1044-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1044-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:46] (03PS1) 10Andrew Bogott: mwopenstackclients: fix DnsManager [puppet] - 10https://gerrit.wikimedia.org/r/1136780 [18:27:02] (03CR) 10Bartosz Dziewoński: [C:03+1] logging: Add context processor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza) [18:28:07] (03Abandoned) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [18:28:47] 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745024 (10cmooney) >>! In T392007#10744966, @RobH wrote: > Please detail via comment specifically how using D6 would cause a network imbalance... [18:29:32] !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.25 refs T386220 [18:29:36] T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220 [18:30:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745029 (10RobH) [18:31:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745033 (10RobH) [18:34:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P75058 and previous config saved to /var/cache/conftool/dbconfig/20250415-183413-fceratto.json [18:34:36] (03CR) 10Bartosz Dziewoński: [C:03+1] "(Needs to wait under the dependency is deployed, otherwise it throws exceptions)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza) [18:49:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T391056)', diff saved to https://phabricator.wikimedia.org/P75059 and previous config saved to /var/cache/conftool/dbconfig/20250415-184921-fceratto.json [18:49:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:49:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2159.codfw.wmnet with reason: Maintenance [18:49:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance [18:49:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10745097 (10RobH) [18:49:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10745098 (10RobH) [18:50:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T391056)', diff saved to https://phabricator.wikimedia.org/P75060 and previous config saved to /var/cache/conftool/dbconfig/20250415-185000-fceratto.json [18:50:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10745104 (10RobH) Please note I've tied original task T390240 to this for ease of tracking. If rack D6 is not selected (likely wont be) then I'll invalid... [18:55:47] (03PS2) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) [18:57:33] (03PS3) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) [18:58:09] (03PS1) 10Volans: CHANGELOG: add changelogs for release v10.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136785 [19:01:45] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v10.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136785 (owner: 10Volans) [19:03:49] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610 [19:03:53] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [19:04:24] dduvall: Are you using the train window, or can I do a deploy? No available windows on Tuesday afternoons, sadly. [19:05:15] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2082 to cirrussearch2082 [19:05:27] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:06:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T391056)', diff saved to https://phabricator.wikimedia.org/P75061 and previous config saved to /var/cache/conftool/dbconfig/20250415-190613-fceratto.json [19:06:17] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:10:02] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2082 to cirrussearch2082 - bking@cumin2002" [19:10:19] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2082 to cirrussearch2082 - bking@cumin2002" [19:10:20] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:21] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2082 [19:10:39] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:10:56] (03PS4) 10Jforrester: [wikifunctionswiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106) [19:11:02] (03PS4) 10Jforrester: [dagwiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106) [19:11:32] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v10.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136785 (owner: 10Volans) [19:11:35] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:11:47] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136789 [19:12:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745165 (10Jclark-ctr) @RobH we have 1 free cross connect circuit id 21996480 [19:12:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136777 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [19:12:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145) (owner: 10Jforrester) [19:12:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester) [19:12:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester) [19:13:06] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136791 [19:13:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745171 (10Jclark-ctr) [19:13:37] (03Merged) 10jenkins-bot: VE: Start setting wgVisualEditorMobileInsertMenu, default to off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136777 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [19:13:41] (03Merged) 10jenkins-bot: VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145) (owner: 10Jforrester) [19:13:45] (03Merged) 10jenkins-bot: [wikifunctionswiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester) [19:13:48] (03Merged) 10jenkins-bot: [dagwiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester) [19:14:14] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136777|VE: Start setting wgVisualEditorMobileInsertMenu, default to off (T388604)]], [[gerrit:1136778|VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis (T383145 T388604)]], [[gerrit:1126661|[wikifunctionswiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1126662|[dagwiki] Enable Wikifunctions client mode (T383106)] [19:14:14] ] [19:14:21] T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604 [19:14:21] T383145: [Abstract Wikipedia] Adding Wikifunctions in VE (desktop + mobile) - https://phabricator.wikimedia.org/T383145 [19:14:21] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [19:17:38] James_F: yeah, go ahead. train is done. looks ok [19:17:55] dduvall: Awesome. (And of course I now have a train-blocker, unrelated to this. Oy.) [19:18:59] James_F: ok. thanks for dealing with that blocker [19:19:20] Sorry to be the creator of the code that's making the blockage. :_) [19:19:49] "19:15:47 [root] Sleeping for 5 minutes to allow swift eventual consistency, sorry. T390251", sigh. [19:19:50] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [19:19:59] Oh, oops, sorry stashbot. [19:21:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P75062 and previous config saved to /var/cache/conftool/dbconfig/20250415-192120-fceratto.json [19:23:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:25:47] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136777|VE: Start setting wgVisualEditorMobileInsertMenu, default to off (T388604)]], [[gerrit:1136778|VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis (T383145 T388604)]], [[gerrit:1126661|[wikifunctionswiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1126662|[dagwiki] Enable Wikifunctions client mode (T383106)]] synced to t [19:25:48] he testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:25:53] T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604 [19:25:54] T383145: [Abstract Wikipedia] Adding Wikifunctions in VE (desktop + mobile) - https://phabricator.wikimedia.org/T383145 [19:25:54] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [19:26:03] (03PS2) 10Fabfur: cache: use fqdn in syslog hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) [19:27:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745240 (10RobH) [19:28:43] !log jforrester@deploy1003 jforrester: Continuing with sync [19:28:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur) [19:33:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:35:09] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136777|VE: Start setting wgVisualEditorMobileInsertMenu, default to off (T388604)]], [[gerrit:1136778|VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis (T383145 T388604)]], [[gerrit:1126661|[wikifunctionswiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1126662|[dagwiki] Enable Wikifunctions client mode (T383106) [19:35:09] ]] (duration: 20m 54s) [19:35:14] T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604 [19:35:14] T383145: [Abstract Wikipedia] Adding Wikifunctions in VE (desktop + mobile) - https://phabricator.wikimedia.org/T383145 [19:35:15] T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106 [19:36:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P75063 and previous config saved to /var/cache/conftool/dbconfig/20250415-193627-fceratto.json [19:38:05] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10745274 (10RobH) Updates: Work is scheduled for this afternoon, but the host is depooled so no maint window needed. I've sent the engineer a detailed info breakdown on what to swap (pcie riser and slo... [19:38:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:45:32] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2082 [19:45:40] jouncebot: next [19:45:40] In 0 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T2000) [19:46:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2082 to cirrussearch2082 [19:46:13] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2082.codfw.wmnet on all recursors [19:46:16] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2082.codfw.wmnet on all recursors [19:46:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2082.codfw.wmnet with OS bullseye [19:46:40] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet [19:46:50] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2082 [19:47:25] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:48:06] (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) [19:48:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:49:03] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:51:29] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2082 - bking@cumin2002" [19:51:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T391056)', diff saved to https://phabricator.wikimedia.org/P75064 and previous config saved to /var/cache/conftool/dbconfig/20250415-195134-fceratto.json [19:51:35] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2082 - bking@cumin2002" [19:51:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:51:35] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2082.codfw.wmnet 87.32.192.10.in-addr.arpa 7.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:51:38] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:51:39] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2082.codfw.wmnet 87.32.192.10.in-addr.arpa 7.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:51:40] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2082 [19:51:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2168.codfw.wmnet with reason: Maintenance [19:51:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T391056)', diff saved to https://phabricator.wikimedia.org/P75065 and previous config saved to /var/cache/conftool/dbconfig/20250415-195157-fceratto.json [19:52:44] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2082 [19:52:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2082 [19:53:25] (03CR) 10Ryan Kemper: [C:03+2] "forgot to send old comment" [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [19:54:36] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [19:55:40] (03CR) 10Ryan Kemper: cirrussearch: Add new master-eligibles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:58:04] (03CR) 10Ryan Kemper: cirrussearch: Add new master-eligibles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T2000). [20:00:05] robertsky: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:36] (03PS2) 10Ryan Kemper: sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) [20:01:27] I can take the window. [20:01:39] i am around [20:01:46] robertsky: Excellent, let's do this. [20:01:59] just woke up for this. :) [20:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [20:02:49] (03PS4) 10Ryan Kemper: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:02:59] (03Merged) 10jenkins-bot: wikimaniawiki: fix add/remove groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [20:03:07] (03PS5) 10Ryan Kemper: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:03:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745411 (10Jclark-ctr) [20:03:25] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136731|wikimaniawiki: fix add/remove groups (T389729)]] [20:03:29] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [20:03:49] (03PS1) 10Volans: Upstream release v10.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1136800 [20:04:56] (03CR) 10Bking: [C:03+1] "Conditional +1, if this works with test-cookbook we can go ahead and merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [20:05:08] (03CR) 10Ryan Kemper: [C:03+1] "Fixed a small error; this patch should be ready to ship now" [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:05:24] (03CR) 10Volans: [C:03+2] Upstream release v10.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1136800 (owner: 10Volans) [20:07:03] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2082.codfw.wmnet with reason: host reimage [20:07:14] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [20:08:39] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:08:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T391056)', diff saved to https://phabricator.wikimedia.org/P75066 and previous config saved to /var/cache/conftool/dbconfig/20250415-200855-fceratto.json [20:09:02] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:09:21] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 4 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:09:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745517 (10Jclark-ctr) [20:10:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2082.codfw.wmnet with reason: host reimage [20:10:45] (03PS5) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 [20:10:52] (03CR) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall) [20:12:37] (03PS3) 10Ryan Kemper: sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) [20:13:34] (03CR) 10Bking: [C:03+1] cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:13:42] (03CR) 10Bking: [C:03+2] cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:14:32] (03PS1) 10Jforrester: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014) [20:14:42] (03PS1) 10Jforrester: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014) [20:14:54] (03Merged) 10jenkins-bot: Upstream release v10.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1136800 (owner: 10Volans) [20:15:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:15:09] !log jforrester@deploy1003 robertsky, jforrester: Backport for [[gerrit:1136731|wikimaniawiki: fix add/remove groups (T389729)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:15:12] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [20:15:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:15:23] robertsky: Can you test to confirm it's working as planned? [20:16:25] yes. changes are in. lgtm. [20:17:57] !log jforrester@deploy1003 robertsky, jforrester: Continuing with sync [20:18:01] Excellent, thank you. [20:24:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P75067 and previous config saved to /var/cache/conftool/dbconfig/20250415-202401-fceratto.json [20:24:12] 06SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120#10745567 (10BCornwall) 05Open→03Resolved a:03BCornwall Setting this as closed as it's basically done already [20:24:29] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136731|wikimaniawiki: fix add/remove groups (T389729)]] (duration: 21m 04s) [20:24:33] T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729 [20:27:47] !log uploaded spicerack_10.1.0 to apt.wikimedia.org bullseye-wikimedia [20:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:47] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:33:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10745595 (10Dzahn) 05Open→03Resolved a:03Dzahn Looking at the graph for the last 7 days there is nothing out of the ordinary anymore... [20:33:17] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10745598 (10Dzahn) a:05Dzahn→03None [20:35:06] (03PS1) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) [20:35:15] (03PS1) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) [20:35:47] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:36:50] FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:37:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:37:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:37:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:37:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:37:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2082.codfw.wmnet with OS bullseye [20:38:49] (03Merged) 10jenkins-bot: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:39:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P75068 and previous config saved to /var/cache/conftool/dbconfig/20250415-203909-fceratto.json [20:39:25] (03CR) 10CI reject: [V:04-1] FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:39:50] (03CR) 10Scott French: [C:03+1] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková) [20:39:55] (03CR) 10CI reject: [V:04-1] FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:41:02] (03CR) 10Scott French: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [20:41:08] (03PS2) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) [20:41:20] (03PS2) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) [20:42:44] (03Merged) 10jenkins-bot: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:43:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:43:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:48:12] (03Merged) 10jenkins-bot: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:48:14] (03Merged) 10jenkins-bot: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [20:48:44] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136802|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136807|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]], [[gerrit:1136801|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136808|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014) [20:48:44] ]] [20:48:47] T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014 [20:51:14] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1180.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:52:36] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1136765/5306/releases1003.eqiad.wmnet/change.releases1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [20:53:19] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2103 to cirrussearch2103 [20:53:40] (03PS2) 10Dzahn: jenkins: ensure systemd service dir exists before override [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) [20:53:42] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:54:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T391056)', diff saved to https://phabricator.wikimedia.org/P75069 and previous config saved to /var/cache/conftool/dbconfig/20250415-205416-fceratto.json [20:54:19] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:54:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2182.codfw.wmnet with reason: Maintenance [20:54:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75070 and previous config saved to /var/cache/conftool/dbconfig/20250415-205427-fceratto.json [20:56:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3619 MB (3% inode=98%): /tmp 3619 MB (3% inode=98%): /var/tmp 3619 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [20:56:43] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1136765/5307/" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T2100) [21:05:02] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1180.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:07:00] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1180.eqiad.wmnet with OS bullseye [21:07:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10745687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1180... [21:09:48] FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:10:31] (03PS3) 10Eevans: restbase: bootstrap restbase1045 (refresh for restbase1030) [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) [21:11:14] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [21:11:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75071 and previous config saved to /var/cache/conftool/dbconfig/20250415-211152-fceratto.json [21:11:56] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:13:13] (03PS1) 10JHathaway: postfix: remove exim aliases [puppet] - 10https://gerrit.wikimedia.org/r/1136811 [21:13:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136811 (owner: 10JHathaway) [21:15:43] (03CR) 10Eevans: [C:03+2] restbase: bootstrap restbase1045 (refresh for restbase1030) [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [21:16:14] (03CR) 10JHathaway: [C:03+2] postfix: remove exim aliases [puppet] - 10https://gerrit.wikimedia.org/r/1136811 (owner: 10JHathaway) [21:20:52] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2103 to cirrussearch2103 - bking@cumin2002" [21:21:12] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2103 to cirrussearch2103 - bking@cumin2002" [21:21:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:21:13] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2103 [21:21:27] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2103 [21:22:08] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2103 to cirrussearch2103 [21:22:08] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2103.codfw.wmnet on all recursors [21:22:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2103.codfw.wmnet on all recursors [21:22:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2103.codfw.wmnet with OS bullseye [21:22:55] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2103 [21:23:09] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:26:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P75072 and previous config saved to /var/cache/conftool/dbconfig/20250415-212659-fceratto.json [21:27:09] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2103 - bking@cumin2002" [21:27:14] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2103 - bking@cumin2002" [21:27:15] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:27:15] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2103.codfw.wmnet 222.32.192.10.in-addr.arpa 2.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:27:19] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2103.codfw.wmnet 222.32.192.10.in-addr.arpa 2.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:27:19] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136802|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136807|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]], [[gerrit:1136801|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136808|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]] synced to [21:27:19] the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:19] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2103 [21:27:25] T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014 [21:27:37] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2103 [21:27:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2103 [21:27:41] !log jforrester@deploy1003 jforrester: Continuing with sync [21:28:30] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10745745 (10BCornwall) 05Open→03Stalled Indeed.... too bad. Hopefully we'll hear back sooner rather than later! [21:30:59] (03CR) 10Scott French: [C:03+1] "Thanks! Nice to see the host list is now sorted, too :)" [puppet] - 10https://gerrit.wikimedia.org/r/1129177 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [21:41:36] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136802|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136807|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]], [[gerrit:1136801|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136808|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014 [21:41:36] )]] (duration: 52m 52s) [21:41:39] T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014 [21:42:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P75073 and previous config saved to /var/cache/conftool/dbconfig/20250415-214206-fceratto.json [21:42:41] !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1045.eqiad.wmnet with reason: Bootstrapping — T389423 [21:42:44] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [21:44:48] FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:44:49] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2103.codfw.wmnet with reason: host reimage [21:46:39] !log bootstrapping Cassandra/restbase1045-{a,b,c} — T389423 [21:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2103.codfw.wmnet with reason: host reimage [21:48:30] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1180.eqiad.wmnet with OS bullseye [21:48:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10745778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1180.eqi... [21:50:54] (03CR) 10Andrew Bogott: [C:03+2] mwopenstackclients: fix DnsManager [puppet] - 10https://gerrit.wikimedia.org/r/1136780 (owner: 10Andrew Bogott) [21:53:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [21:53:39] FIRING: [5x] ProbeDown: Service restbase1045-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:54:48] FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:57:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75074 and previous config saved to /var/cache/conftool/dbconfig/20250415-215714-fceratto.json [21:57:18] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:57:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2198.codfw.wmnet with reason: Maintenance [21:58:39] FIRING: [4x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [22:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:07:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:10:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2200.codfw.wmnet with reason: Maintenance [22:13:39] FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3485 MB (3% inode=98%): /tmp 3485 MB (3% inode=98%): /var/tmp 3485 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:17:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2103.codfw.wmnet with OS bullseye [22:22:50] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10745829 (10Eevans) >>! In T391544#10743933, @MatthewVernon wrote: > There's an LVM layer here too, isn't ther... [22:23:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2208.codfw.wmnet with reason: Maintenance [22:23:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T391056)', diff saved to https://phabricator.wikimedia.org/P75075 and previous config saved to /var/cache/conftool/dbconfig/20250415-222316-fceratto.json [22:23:20] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:23:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:35:01] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2112 to cirrussearch2112 [22:35:22] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:38:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:39:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T391056)', diff saved to https://phabricator.wikimedia.org/P75076 and previous config saved to /var/cache/conftool/dbconfig/20250415-223949-fceratto.json [22:39:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:41:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:39] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4047 is OK: HTTP OK: HTTP/1.1 200 OK - 48114 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:46:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:50:56] (03CR) 10Cwhite: "On hold until 2025-04-30." [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [22:52:02] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2112 to cirrussearch2112 - bking@cumin2002" [22:54:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P75077 and previous config saved to /var/cache/conftool/dbconfig/20250415-225456-fceratto.json [22:57:07] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10745924 (10RobH) a:05RobH→03ssingh Ok, parts swapped and new PCIe riser and SSD detected (only change really is serial of the ssd in lshw output). This is now ready to have puppet run and tenativel... [23:09:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2112 to cirrussearch2112 - bking@cumin2002" [23:09:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:09:10] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2112 [23:09:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2112 [23:09:48] RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:10:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2112 to cirrussearch2112 [23:10:00] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2112.codfw.wmnet on all recursors [23:10:04] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2112.codfw.wmnet on all recursors [23:10:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P75078 and previous config saved to /var/cache/conftool/dbconfig/20250415-231003-fceratto.json [23:10:53] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2112.codfw.wmnet with OS bullseye [23:10:58] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2112 [23:10:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2112 [23:12:22] (03CR) 10Dzahn: [C:03+2] jenkins: ensure systemd service dir exists before override [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [23:16:17] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3391 MB (3% inode=98%): /tmp 3391 MB (3% inode=98%): /var/tmp 3391 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [23:20:26] (03CR) 10Dzahn: [C:03+2] "noop confirmed on releases* and contint*" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [23:24:31] why do we alert on archiva disk space if it doesn' [23:25:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T391056)', diff saved to https://phabricator.wikimedia.org/P75079 and previous config saved to /var/cache/conftool/dbconfig/20250415-232511-fceratto.json [23:25:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:25:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2220.codfw.wmnet with reason: Maintenance [23:25:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T391056)', diff saved to https://phabricator.wikimedia.org/P75080 and previous config saved to /var/cache/conftool/dbconfig/20250415-232535-fceratto.json [23:25:35] (03CR) 10Dzahn: [C:03+2] "manually deleted the directory and saw puppet re-create it on releases2003" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [23:27:37] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2112.codfw.wmnet with reason: host reimage [23:32:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2112.codfw.wmnet with reason: host reimage [23:40:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136821 [23:40:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136821 (owner: 10TrainBranchBot) [23:41:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T391056)', diff saved to https://phabricator.wikimedia.org/P75081 and previous config saved to /var/cache/conftool/dbconfig/20250415-234142-fceratto.json [23:41:47] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:48:24] PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:48:24] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [23:48:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2103:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [23:50:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:52:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2112.codfw.wmnet with OS bullseye [23:52:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136821 (owner: 10TrainBranchBot) [23:53:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P75082 and previous config saved to /var/cache/conftool/dbconfig/20250415-235649-fceratto.json