[00:02:03] <wikibugs>	 (03CR) 10Scott French: "While clearly very large, the PCC diff generally looks like what I'd expect, which is nice. Thanks in advance for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[00:09:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136474
[00:09:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136474 (owner: 10TrainBranchBot)
[00:16:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3378 MB (3% inode=98%): /tmp 3378 MB (3% inode=98%): /var/tmp 3378 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[00:29:55] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136474 (owner: 10TrainBranchBot)
[00:46:36] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/69d4adff9ec963248074b4ed851e430576834914028afdd60017788f3eea3f8c/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[00:48:39] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:56:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3590 MB (3% inode=98%): /tmp 3590 MB (3% inode=98%): /var/tmp 3590 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[01:03:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[01:09:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.25 [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136477 (https://phabricator.wikimedia.org/T386220)
[01:09:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.25 [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136477 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[01:13:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[01:22:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.25 [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136477 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[01:24:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[01:26:36] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:30:48] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1181']
[01:31:06] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-worker1181']
[01:32:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye
[01:32:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10741929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker11...
[01:58:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[01:59:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0200)
[02:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[02:06:24] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye
[02:06:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10741941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1181.e...
[02:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[02:27:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10742000 (10MikhailRyazanov) By the way, are there any reasons, besides historical, to specify image sizes in “pixels” (which nowadays often don't corresp...
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0300)
[03:01:44] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136485 (https://phabricator.wikimedia.org/T386220)
[03:01:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136485 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[03:02:34] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136485 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[03:02:57] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.25  refs T386220
[03:03:00] <stashbot>	 T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220
[03:11:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:23:39] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service restbase1044-b:9042 has failed probes (tcp_cassandra_b_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:26:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:28:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[03:31:22] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[03:43:21] <logmsgbot>	 !log mwpresync@deploy1003 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.24,1.44.0-wmf.25 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discov
[03:43:21] <logmsgbot>	 ery.wmnet/restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.153.0 --label vnd.wikimedia.mediawiki.versions=1.44.0-wmf.24,1.44.0-wmf.25 --label vnd.wi
[03:43:21] <logmsgbot>	 kimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.153.0) (duration: 40m 23s)
[03:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0400)
[04:10:13] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.22 (duration: 10m 03s)
[04:13:39] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service restbase1044-b:9042 has failed probes (tcp_cassandra_b_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:30:50] <icinga-wm>	 PROBLEM - Host ml-serve2007 is DOWN: PING CRITICAL - Packet loss = 100%
[04:32:30] <icinga-wm>	 PROBLEM - BGP status on lsw1-c3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:36:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:48:39] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:56:49] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Migrate pc6 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136487 (https://phabricator.wikimedia.org/T391454)
[04:57:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc6 T391454', diff saved to https://phabricator.wikimedia.org/P75002 and previous config saved to /var/cache/conftool/dbconfig/20250415-045700-marostegui.json
[04:57:05] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[04:57:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance
[04:59:15] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Migrate pc6 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136487 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui)
[05:03:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc6 T391454', diff saved to https://phabricator.wikimedia.org/P75003 and previous config saved to /var/cache/conftool/dbconfig/20250415-050307-marostegui.json
[05:03:11] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:13:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:19:56] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:20:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:20:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:23:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:45:26] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:46:16] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0600).
[06:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[06:16:56] <icinga-wm>	 PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim
[06:19:00] <icinga-wm>	 RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim
[06:27:58] <wikibugs>	 (03PS1) 10Marostegui: events_coredb_master.sql: Add s8 [software] - 10https://gerrit.wikimedia.org/r/1136594
[06:29:19] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [software] - 10https://gerrit.wikimedia.org/r/1136594 (owner: 10Marostegui)
[06:29:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] events_coredb_master.sql: Add s8 [software] - 10https://gerrit.wikimedia.org/r/1136594 (owner: 10Marostegui)
[06:29:49] <wikibugs>	 (03Merged) 10jenkins-bot: events_coredb_master.sql: Add s8 [software] - 10https://gerrit.wikimedia.org/r/1136594 (owner: 10Marostegui)
[06:31:34] <wikibugs>	 (03CR) 10Marostegui: "Just some comments, the mysql side of things looks good, but I'd like to see the code reviewed by someone with more expertise." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[06:33:25] <wikibugs>	 (03PS1) 10Marostegui: events_sanitarium.sql: Update sanitarium hosts. [software] - 10https://gerrit.wikimedia.org/r/1136598
[06:33:40] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [software] - 10https://gerrit.wikimedia.org/r/1136598 (owner: 10Marostegui)
[06:34:07] <wikibugs>	 (03CR) 10Jelto: [V:03+2 C:03+2] gitlab: use a wmflib::expand_path compatible path for apus keys [labs/private] - 10https://gerrit.wikimedia.org/r/1136391 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[06:34:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] events_sanitarium.sql: Update sanitarium hosts. [software] - 10https://gerrit.wikimedia.org/r/1136598 (owner: 10Marostegui)
[06:34:55] <wikibugs>	 (03Merged) 10jenkins-bot: events_sanitarium.sql: Update sanitarium hosts. [software] - 10https://gerrit.wikimedia.org/r/1136598 (owner: 10Marostegui)
[06:40:11] <kart_>	 Deploying cxserver. Minor changes.
[06:41:34] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-04-07-053106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134397 (https://phabricator.wikimedia.org/T390732) (owner: 10KartikMistry)
[06:43:23] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-04-07-053106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134397 (https://phabricator.wikimedia.org/T390732) (owner: 10KartikMistry)
[06:44:44] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:45:06] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:45:57] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:46:31] <logmsgbot>	 !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:47:48] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:48:20] <logmsgbot>	 !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:48:58] <kart_>	 !log Updated cxserver to 2025-04-07-053106-production (T390732, T390711)
[06:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:03] <stashbot>	 T390732: Close pihwiki - https://phabricator.wikimedia.org/T390732
[06:49:03] <stashbot>	 T390711: Post-creation work for nupwiki - https://phabricator.wikimedia.org/T390711
[06:49:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: librenms: stop sending data to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457)
[06:49:22] <kart_>	 Also, deploying MinT (in staging first!) It will be bit slower one.
[06:50:15] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[06:51:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] librenms: stop sending data to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi)
[06:51:30] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] librenms: stop sending data to graphite [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi)
[06:51:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: kubernetes: remove master usage of prometheus_all_nodes, access is implicit [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170)
[06:51:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170)
[06:51:45] <godog>	 I bet I didn't align some arrows, HOW COULD I FORGET
[06:52:58] <godog>	 actually no, unrelated
[06:53:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] "CI failures are unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi)
[06:53:50] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Could be worth running PCC for netmon1003.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1136603 (https://phabricator.wikimedia.org/T372457) (owner: 10Filippo Giunchedi)
[06:54:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kubernetes: remove master usage of prometheus_all_nodes, access is implicit [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[06:54:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[06:55:24] <godog>	 XioNoX: heh, next time
[06:56:28] <kart_>	 Any recent change with people.wikimedia.org DNS?
[06:56:49] <XioNoX>	 godog: no pb ;)
[06:57:29] <wikibugs>	 (03CR) 10Brouberol: "Bear in mind that removing the puppet code will not stop/delete the systemd timers. It will just stop managing them via puppet." [puppet] - 10https://gerrit.wikimedia.org/r/1135996 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:28] <vgutierrez>	 !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in eqsin and  codfw - T391334
[07:04:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:32] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[07:05:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqsin
[07:05:37] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[07:06:02] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqsin
[07:06:10] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_codfw
[07:06:21] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_codfw
[07:08:26] <godog>	 jelto: I think puppet CI is busted btw
[07:08:45] <godog>	 jelto: compilation errors like these https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/8717/console for modules/profile/spec/classes/profile_gitlab_spec.rb
[07:09:41] <jelto>	 yes I'm currently troubleshooting the issue, give me a sec
[07:10:16] <logmsgbot>	 !log elukey@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ms-be1091.eqiad.wmnet with reason: dcops maintenance
[07:10:39] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10742385 (10elukey) @Jclark-ctr I downtimed the host for two days, please feel free to shut it down when it is convenient for you :)
[07:11:52] <godog>	 jelto: ok no worries, I'm not impacted atm
[07:13:41] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: fix type of s3 credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:15:23] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: kubernetes: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129178 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[07:15:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[07:16:57] <jelto>	 godog: I think the issue is fixed, let me know when you see the error again
[07:19:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[07:19:16] <wikibugs>	 (03PS3) 10Volans: log: notify user on IRC when awaiting input [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955
[07:19:16] <wikibugs>	 (03PS1) 10Volans: tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608
[07:19:23] <godog>	 jelto: ack, thank you
[07:19:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[07:21:13] <wikibugs>	 (03CR) 10Volans: log: notify user on IRC when awaiting input (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans)
[07:21:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] statsd: remove ferm rule for statsd port 8125 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[07:21:28] <jelto>	 You probably have to rebase to get the fix from https://gerrit.wikimedia.org/r/1136359
[07:21:44] <godog>	 ah yeah of course
[07:21:52] <hashar>	 CI autorebases behind the good :)
[07:22:10] <hashar>	 s/good/hood/
[07:22:25] <hashar>	 what I mean is the patch is first merged against the tip of the target branch (production)
[07:22:35] <hashar>	 and the result is what is fetched by the jobs
[07:22:48] <godog>	 ah thank you hashar I didn't realize that was the case
[07:23:02] <jelto>	 Oh it's a different error now, it's complaining about the string length
[07:23:07] <hashar>	 so you can `recheck` to verify the new state
[07:23:42] <hashar>	 but of course pressing `Rebase` is conveniently one click away and will ultimately end up with the same state
[07:24:13] <wikibugs>	 (03CR) 10Vgutierrez: P:durum: add conditional to enable ECH (durum2002) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[07:25:30] <wikibugs>	 (03PS1) 10Brouberol: wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107)
[07:25:31] <wikibugs>	 (03PS1) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107)
[07:25:32] <wikibugs>	 (03PS1) 10Brouberol: wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107)
[07:25:32] <wikibugs>	 (03PS1) 10Brouberol: wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107)
[07:25:33] <wikibugs>	 (03PS1) 10Brouberol: wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107)
[07:25:34] <wikibugs>	 (03PS1) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107)
[07:25:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[07:26:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:27:15] <wikibugs>	 (03PS2) 10Brouberol: wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107)
[07:27:15] <wikibugs>	 (03PS2) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107)
[07:27:15] <wikibugs>	 (03PS2) 10Brouberol: wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107)
[07:27:16] <wikibugs>	 (03PS2) 10Brouberol: wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107)
[07:27:17] <wikibugs>	 (03PS2) 10Brouberol: wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107)
[07:27:18] <wikibugs>	 (03PS2) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107)
[07:28:06] <Emperor>	 !log make sure all disks are mounted correctly prior to disk-swap testing T391854
[07:28:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:10] <stashbot>	 T391854: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854
[07:28:14] <Emperor>	 !log make sure all disks are mounted correctly prior to disk-swap testing T391854 ms-be1091
[07:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[07:28:45] <wikibugs>	 (03PS1) 10Jelto: ceph: remove Ceph::S3::Credential String length constraints [puppet] - 10https://gerrit.wikimedia.org/r/1136657 (https://phabricator.wikimedia.org/T378922)
[07:29:13] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[07:29:16] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[07:29:18] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[07:29:21] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[07:29:24] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[07:29:27] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[07:29:35] <jelto>	 Brouberol: I broke Puppet CI, give me a minute
[07:29:56] <brouberol>	 haha, no problem, take your time
[07:30:18] <wikibugs>	 (03CR) 10Jelto: [C:03+2] ceph: remove Ceph::S3::Credential String length constraints [puppet] - 10https://gerrit.wikimedia.org/r/1136657 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:30:27] <wikibugs>	 06SRE, 06Data-Platform-SRE: archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10742449 (10LSobanski)
[07:31:39] <wikibugs>	 06SRE, 06Data-Platform-SRE: archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904#10742454 (10LSobanski) Archiva is on a path to deprecation so this is likely an ask to disable the alerting altogether.
[07:31:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:32:07] <wikibugs>	 (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[07:32:22] <wikibugs>	 (03CR) 10Jelto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136604 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[07:34:51] <jelto>	 It looks like Puppet CI is happy again
[07:35:44] <godog>	 neat, thank you
[07:38:14] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744) (owner: 10Brouberol)
[07:38:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744) (owner: 10Brouberol)
[07:39:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:39:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:40:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:41:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[07:43:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[07:43:39] <duesen>	 jelto: speaking of puppet - would you merge my personal config files? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134638
[07:43:59] <duesen>	 jelto: is there a process for getting that kind of thing deployed?
[07:45:09] <duesen>	 it's nor urgent, it has just been sitting there for a while, and I'm looking for a way to move it forward.
[07:47:38] <wikibugs>	 (03CR) 10Jelto: [C:03+2] ~daniel: Always run screen [puppet] - 10https://gerrit.wikimedia.org/r/1134638 (owner: 10Daniel Kinzler)
[07:47:50] <jelto>	 duesen: I can merge this change in a sec
[07:47:52] <godog>	 !log upgrade thanos to 0.38.0 on prometheus100[57] - T383966
[07:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:55] <stashbot>	 T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966
[07:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:48:41] <godog>	 duesen: process is grabbing a slot during https://wikitech.wikimedia.org/wiki/Puppet_request_window
[07:49:30] <duesen>	 godog: ah, thanks! I guess I once knew that ;)
[07:49:33] <jelto>	 godog: Should I wait for the next window?
[07:49:41] <godog>	 jelto: no please go ahead
[07:50:04] <godog>	 duesen: heheh used sparingly also poking oncall/clinic duty has been known to work :D
[07:50:30] * duesen pokes sparingly
[07:55:06] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:55:09] <jelto>	 duesen: your new screen config should be available in the next 30 minutes. I merged the change
[07:57:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:58:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:58:44] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:59:03] <vgutierrez>	 topranks, XioNoX: cr2-eqin took a nap?
[07:59:07] <vgutierrez>	 *eqsin
[07:59:43] <vgutierrez>	 we are getting purged alerts in eqsin as well.. looks like we have some connectivity issues
[07:59:48] <fabfur>	 yes
[08:00:39] <XioNoX>	 vgutierrez: looking
[08:03:39] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service restbase1044-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:03:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:03:55] <XioNoX>	 vgutierrez: there is an increase bw on the eqsin-codfw link : https://grafana.wikimedia.org/goto/DS2l63AHR?orgId=1 that could cause saturation and packet loss
[08:03:58] <vgutierrez>	 we had some some timeouts trying to reach codfw (Apr 15 08:02:16 cp5017 purged[2028236]: %4|1744704136.818|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2009.codfw.wmnet:9093/bootstrap]: ssl://kafka-main2009.codfw.wmnet:9093/2004: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests)
[08:04:53] <XioNoX>	 but it shouldn't be enough to have an actual impact
[08:04:56] <fabfur>	 losing ~10% of pings
[08:05:18] <fabfur>	 now seems better
[08:05:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497) (owner: 10Brouberol)
[08:05:26] <fabfur>	 also for latency
[08:05:36] <fabfur>	 much more stable
[08:05:36] <vgutierrez>	 pings between where and where?
[08:05:44] <fabfur>	 eqsin -> codfw
[08:06:47] <vgutierrez>	 https://grafana.wikimedia.org/goto/uAikge0HR?orgId=1
[08:06:54] <vgutierrez>	 that doesn't look great
[08:07:26] <XioNoX>	 vgutierrez: yeah was going to share https://grafana.wikimedia.org/goto/2HXzR60NR?orgId=1
[08:07:43] <XioNoX>	 weird thing is that ulsfo is having the same issue while it's a different link/router
[08:07:51] <vgutierrez>	 XioNoX: hmm latency is significantly worse over ip6 :]
[08:08:30] <vgutierrez>	 https://grafana.wikimedia.org/goto/_b3WgeAHg?orgId=1 VS https://grafana.wikimedia.org/goto/sG5GgeAHg?orgId=1
[08:08:33] <XioNoX>	 but looks like whatever happened it improved (still looking)
[08:08:42] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[08:09:05] <logmsgbot>	 !log dcausse@deploy1003 Started deploy [wdqs/wdqs@4186ae7]: test deploy new scap config to wdqs2025.codfw.wmnet (T221709)
[08:09:08] <stashbot>	 T221709: scap service restarts for WDQS are inconsistent - https://phabricator.wikimedia.org/T221709
[08:09:13] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[08:09:23] <logmsgbot>	 !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@4186ae7]: test deploy new scap config to wdqs2025.codfw.wmnet (T221709) (duration: 00m 18s)
[08:09:45] <XioNoX>	 yeah I don't get why ulsfo is going through codfw to reach eqsin
[08:12:11] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1134173 (owner: 10Hashar)
[08:12:24] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "Thank you, this is great!" [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[08:13:09] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[08:17:02] <XioNoX>	 vgutierrez: ok I get it more now. Looks like Arelion was havng issue, I'm going to put it in a normal state and not a "prefered" state. Then if the issue happen again we can drain it
[08:19:07] <vgutierrez>	 ack
[08:32:48] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:33:46] <icinga-wm>	 PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:35:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add BFD down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1134664 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[08:35:46] <icinga-wm>	 RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:36:37] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Comm Error: Backplane 0 on cirrussearch2091 (Row/Rack A7) - https://phabricator.wikimedia.org/T391639#10742637 (10Gehel)
[08:36:39] <wikibugs>	 (03Merged) 10jenkins-bot: Add BFD down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1134664 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[08:36:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:37:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:38:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[08:40:07] <wikibugs>	 (03CR) 10Jaime Nuche: "The repo is under the "releng" directory: `/srv/deployment/releng/jenkins-deploy`" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[08:40:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Adding Moritz per my understanding that "if you request any new level of sudo privileges for a group (or for yourself individually, outsid" [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar)
[08:40:38] <XioNoX>	 looks like it's back.... vgutierrez 
[08:41:38] <vgutierrez>	 yep
[08:42:01] <XioNoX>	 !log drain arelion eqsin-codfw link
[08:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:43:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[08:44:29] <XioNoX>	 vgutierrez: done, let's see if it improves
[08:44:40] <vgutierrez>	 XioNoX: see _security
[08:47:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:48:39] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:51:11] <logmsgbot>	 !log dcausse@deploy1003 Started deploy [wdqs/wdqs@4186ae7] (wcqs): test deploy new scap config to wcqs2001.codfw.wmnet (T221709)
[08:51:16] <stashbot>	 T221709: scap service restarts for WDQS are inconsistent - https://phabricator.wikimedia.org/T221709
[08:51:31] <logmsgbot>	 !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@4186ae7] (wcqs): test deploy new scap config to wcqs2001.codfw.wmnet (T221709) (duration: 00m 20s)
[08:57:02] <jnuche>	 jouncebot: nowandnext
[08:57:02] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 2 minute(s)
[08:57:02] <jouncebot>	 In 1 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000)
[08:57:30] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr4-ulsfo.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[08:58:30] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[09:00:04] <jelto>	 I started a sync from gitlab1003 to ceph/apus which seems to be doing 400MB/s. But that should not affect ulsfo or codfw
[09:00:59] <logmsgbot>	 !log jnuche@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.25  refs T386220
[09:01:02] <stashbot>	 T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220
[09:01:36] <jnuche>	 ^ trian presync failed last night, I'm rerunning it
[09:03:30] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[09:07:30] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr4-ulsfo.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[09:11:22] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[09:11:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[09:12:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[09:12:13] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) (owner: 10Stevemunene)
[09:13:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[09:15:35] <logmsgbot>	 !log jnuche@deploy1003 sync-world aborted: testwikis to 1.44.0-wmf.25  refs T386220 (duration: 14m 36s)
[09:15:39] <stashbot>	 T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220
[09:19:47] <wikibugs>	 (03CR) 10Marostegui: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[09:23:12] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - No response from remote host 198.35.26.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:23:23] <vgutierrez>	 uh?
[09:27:34] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:27:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:28:58] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:29:50] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:32:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:39:45] <wikibugs>	 (03CR) 10Federico Ceratto: "Replied to a comment - no new code changes introduced." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[09:41:26] <dcausse>	 jouncebot: nowandnext
[09:41:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 18 minute(s)
[09:41:26] <jouncebot>	 In 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000)
[09:43:22] <logmsgbot>	 !log dcausse@deploy1003 Started deploy [wdqs/wdqs@fe88851]: version 0.3.156 (T326311)
[09:43:26] <stashbot>	 T326311: Deletion of Lexemes appears to leak triples related to its forms and senses - https://phabricator.wikimedia.org/T326311
[09:49:19] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984)
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:53:42] <Amir1>	 jouncebot: nowandnext
[09:53:42] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 6 minute(s)
[09:53:42] <jouncebot>	 In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000)
[09:54:26] <claime>	 Amir1: not sure you're gonna be able to deploy
[09:54:28] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[09:54:28] <claime>	 https://phabricator.wikimedia.org/T390251 is acting up
[09:54:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:54:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T391056)', diff saved to https://phabricator.wikimedia.org/P75005 and previous config saved to /var/cache/conftool/dbconfig/20250415-095442-fceratto.json
[09:54:46] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[09:54:59] <claime>	 although maybe just a backport will go through where the train didn't
[09:55:01] <wikibugs>	 (03PS1) 10Ladsgroup: Bump thumbnail steps to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136670 (https://phabricator.wikimedia.org/T360589)
[09:55:03] <Amir1>	 :/
[09:55:10] <Amir1>	 Do you want me to try?
[09:55:16] <claime>	 jnuche: ^ ?
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T391056)', diff saved to https://phabricator.wikimedia.org/P75006 and previous config saved to /var/cache/conftool/dbconfig/20250415-095650-fceratto.json
[09:57:54] <logmsgbot>	 !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@fe88851]: version 0.3.156 (T326311) (duration: 14m 31s)
[09:57:57] <stashbot>	 T326311: Deletion of Lexemes appears to leak triples related to its forms and senses - https://phabricator.wikimedia.org/T326311
[09:58:27] <logmsgbot>	 !log dcausse@deploy1003 Started deploy [wdqs/wdqs@fe88851] (wcqs): version 0.3.156
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1000)
[10:00:52] <logmsgbot>	 !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@fe88851] (wcqs): version 0.3.156 (duration: 02m 25s)
[10:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[10:01:07] <jnuche>	 Amir1, claime: yeah, chances are you will run into the same issue
[10:01:15] <Amir1>	 :(
[10:01:22] <Amir1>	 I can wait then
[10:01:22] <claime>	 jnuche: even for just a backport?
[10:01:31] <claime>	 it shouldn't be rebuilding the whole image
[10:01:43] <wikibugs>	 (03PS1) 10Volans: docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673
[10:01:50] <claime>	 which is kind of the not-really-deterministic trigger for this
[10:05:24] <jnuche>	 Amir1, claime: from my side it's okay to try. But if it fails I'd ask that you create a revert in gerrit for the backport change
[10:05:45] <claime>	 ack
[10:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:11:38] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou)
[10:11:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P75007 and previous config saved to /var/cache/conftool/dbconfig/20250415-101158-fceratto.json
[10:14:54] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[10:15:33] <wikibugs>	 (03CR) 10Ladsgroup: "gentle ping" [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) (owner: 10Ladsgroup)
[10:15:50] <claime>	 Amir1: Go ahead and try your backport
[10:16:07] <claime>	 we'll revert is the registry is still fucking up
[10:16:22] <Amir1>	 sure
[10:17:05] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route various miscellaneous pcs services to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1136676 (https://phabricator.wikimedia.org/T385033)
[10:17:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136670 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: Bump thumbnail steps to 95% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136670 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:19:07] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]]
[10:19:10] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:21:11] <wikibugs>	 (03CR) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[10:24:23] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-text_eqsin
[10:26:29] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp[5023-5024].eqsin.wmnet} and A:cp
[10:27:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P75008 and previous config saved to /var/cache/conftool/dbconfig/20250415-102705-fceratto.json
[10:28:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:31:22] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[10:32:09] <Amir1>	 claime: ^
[10:32:37] <Amir1>	 and 13 minutes stuck on this
[10:32:38] <Amir1>	 > 10:19:34 K8s images build/push output redirected to /home/ladsgroup/scap-image-build-and-push-log
[10:32:44] <claime>	 Amir1: yeah that's... happened a few times and I haven't figured out why
[10:33:17] <Amir1>	 I ctrl+c'd now
[10:33:18] <logmsgbot>	 !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]] (duration: 14m 11s)
[10:33:21] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:33:29] <wikibugs>	 (03PS22) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[10:33:37] <claime>	 try it again
[10:33:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:33:41] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678
[10:33:45] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup)
[10:34:00] <claime>	 or revert :D
[10:34:01] <wikibugs>	 (03CR) 10Ladsgroup: Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup)
[10:34:10] <Amir1>	 I stopped the revert :D
[10:34:27] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5300/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[10:34:32] <claime>	 I'm kind of at a loss as to what we can do to fix this
[10:34:35] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]]
[10:36:38] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[10:37:39] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_codfw
[10:37:45] <wikibugs>	 (03PS9) 10Fabfur: benthos: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332)
[10:38:24] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou)
[10:38:49] <sukhe>	 !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1132669"'
[10:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:44] <logmsgbot>	 !log ladsgroup@deploy1003 sync-world aborted: Backport for [[gerrit:1136670|Bump thumbnail steps to 95% (T360589)]] (duration: 05m 08s)
[10:39:47] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:39:49] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup)
[10:40:05] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[10:40:08] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update edit-check image with shap values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136669 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou)
[10:40:17] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_codfw
[10:40:19] <claime>	 I'm gonna try something
[10:40:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136678 (owner: 10Ladsgroup)
[10:40:53] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looking good, few nitpicks" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[10:41:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:41:38] <sukhe>	 !log enable puppet on durum2002
[10:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T391056)', diff saved to https://phabricator.wikimedia.org/P75009 and previous config saved to /var/cache/conftool/dbconfig/20250415-104212-fceratto.json
[10:42:16] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:42:29] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:42:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75010 and previous config saved to /var/cache/conftool/dbconfig/20250415-104235-fceratto.json
[10:43:52] <wikibugs>	 (03PS10) 10Fabfur: cache: install benthos on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332)
[10:44:29] <wikibugs>	 (03CR) 10Fabfur: cache: install benthos on all cp hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[10:44:36] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135643 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[10:46:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:51:05] <wikibugs>	 (03PS13) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[10:51:22] <icinga-wm>	 RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[10:52:26] <vgutierrez>	 !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in drmrs - T391334
[10:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:52:30] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[10:52:37] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_drmrs
[10:52:44] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_drmrs
[10:56:44] <wikibugs>	 (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add query-legacy-full to existing gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135383 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[10:58:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[10:58:37] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: add query-legacy-full to existing gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135383 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[10:58:50] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_eqsin
[10:59:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75011 and previous config saved to /var/cache/conftool/dbconfig/20250415-105941-fceratto.json
[10:59:44] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:00:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913) (owner: 10Ayounsi)
[11:01:42] <wikibugs>	 (03PS1) 10Fabfur: cache: use fqdn in syslog hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571)
[11:01:58] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur)
[11:03:02] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10742996 (10Ifrahkhanyaree_WMDE)
[11:04:39] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10743021 (10Ladsgroup) >>! In T355914#10738719, @hgzh wrote: > I tried an onwiki answer, so thank you for the reply here. But IMO this could have been ann...
[11:05:27] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp[5023-5024].eqsin.wmnet} and A:cp
[11:06:30] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[11:06:40] <vgutierrez>	 ^^ that's probably sukhe 
[11:06:43] <sukhe>	 yes
[11:06:51] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: extend to check DNS, LDAP, internet, etc [puppet] - 10https://gerrit.wikimedia.org/r/1136681 (https://phabricator.wikimedia.org/T391325)
[11:06:58] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:06:58] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:07:15] <logmsgbot>	 !log sukhe@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on durum2002.codfw.wmnet with reason: testing
[11:07:29] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: test rebuild to look at logs
[11:07:39] <vgutierrez>	 !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in esams - T391334
[11:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:42] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[11:08:19] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_esams and not P{cp3073.esams.wmnet} and A:cp
[11:08:40] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_esams and not P{cp3081.esams.wmnet} and A:cp
[11:11:37] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: extend to check DNS, LDAP, internet, etc [puppet] - 10https://gerrit.wikimedia.org/r/1136681 (https://phabricator.wikimedia.org/T391325)
[11:12:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: extend to check DNS, LDAP, internet, etc [puppet] - 10https://gerrit.wikimedia.org/r/1136681 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[11:14:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P75012 and previous config saved to /var/cache/conftool/dbconfig/20250415-111447-fceratto.json
[11:16:25] <wikibugs>	 (03PS14) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[11:16:54] <wikibugs>	 (03PS1) 10Ssingh: Revert "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136684
[11:17:49] <wikibugs>	 (03PS15) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[11:18:06] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673 (owner: 10Volans)
[11:18:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:18:48] <wikibugs>	 (03CR) 10Volans: [C:03+2] docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673 (owner: 10Volans)
[11:20:23] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: fix yaml typos [puppet] - 10https://gerrit.wikimedia.org/r/1136685 (https://phabricator.wikimedia.org/T391325)
[11:21:22] <icinga-wm>	 PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/51e26c1e0f39e1935a3cafc60f73aa272a120b6c331359bfc3f18088bc2045c0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[11:21:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: fix yaml typos [puppet] - 10https://gerrit.wikimedia.org/r/1136685 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[11:23:48] <wikibugs>	 (03PS16) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[11:24:44] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[11:24:52] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[11:25:03] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[11:25:19] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[11:25:27] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[11:25:56] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[11:26:39] <wikibugs>	 (03CR) 10Federico Ceratto: "Ok, I updated the code as required and tested it with real runs before and with dry-run in the last changes." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[11:28:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[11:29:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P75013 and previous config saved to /var/cache/conftool/dbconfig/20250415-112955-fceratto.json
[11:30:02] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: more YAML formatting [puppet] - 10https://gerrit.wikimedia.org/r/1136687 (https://phabricator.wikimedia.org/T391325)
[11:30:07] <wikibugs>	 (03Merged) 10jenkins-bot: docstrings: remove types from docstrings [software/homer] - 10https://gerrit.wikimedia.org/r/1136673 (owner: 10Volans)
[11:33:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:37:19] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136684 (owner: 10Ssingh)
[11:37:26] <elukey>	 claime, Amir1 o/ were you be able to finish the deploy?
[11:37:30] <claime>	 elukey: no
[11:37:42] <claime>	 I just did a sync-world with no file to get a push
[11:37:50] <claime>	 it did manage to push the images in about 15 minutes
[11:38:06] <claime>	 but then it failed deploying to testservers because of the bad blob in dragonfly
[11:39:12] <elukey>	 is it still ongoing? Because I may have a workaround in mind
[11:39:20] <claime>	 no, it's failed now
[11:39:30] <claime>	 you can go ahead
[11:39:49] <elukey>	 nono it was more a manual fix for the workers failing to get the right blob
[11:40:21] <elukey>	 when the failures in pulling happens, we can try to wait 5 minutes and then explicitly kill the failed pods
[11:40:36] <claime>	 ah
[11:40:48] <elukey>	 if our theory of the dragonfly involvement is true, they should trigger another pull
[11:40:52] <claime>	 I'll run a scap sync-world again
[11:40:52] <elukey>	 a "fresh" one
[11:40:54] <claime>	 we'll see
[11:41:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: more YAML formatting [puppet] - 10https://gerrit.wikimedia.org/r/1136687 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez)
[11:41:22] <icinga-wm>	 RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops
[11:41:37] <claime>	 do we have a way to force a redeploy of the latest image, even though scap didn't update the release file?
[11:42:10] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host durum2002.codfw.wmnet with OS bookworm
[11:42:26] <elukey>	 no idea
[11:43:03] <claime>	 ok I'm gonna update the release files manually
[11:43:20] <claime>	 then run a scap without build
[11:43:48] <claime>	 I'm not even sure what I'm trying to achieve anymore... that will just work now that dragonfly has evicted the blob
[11:44:15] <claime>	 also we can't ask deployers to wait 5 minutes looking at kubectl get pods for all debug envs, then delete the ones misbehaving
[11:44:21] <claime>	 this is very problematic
[11:45:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T391056)', diff saved to https://phabricator.wikimedia.org/P75014 and previous config saved to /var/cache/conftool/dbconfig/20250415-114501-fceratto.json
[11:45:06] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:45:09] <sukhe>	 !log sudo cumin 'A:durum and not P{durum2002*}' 'run-puppet-agent --enable "rolling out CR 1132669"'
[11:45:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:18] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:47:19] <elukey>	 claime: yes I agree, but in theory danc*y is working on a solution to automatically force scap to pull the new images, and once that works proceed
[11:47:24] <elukey>	 it may alleviate the problem
[11:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:48:48] <claime>	 yeah, it may
[11:58:11] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1200)
[12:00:07] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[12:00:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75015 and previous config saved to /var/cache/conftool/dbconfig/20250415-120013-fceratto.json
[12:00:18] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[12:01:55] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum2002.codfw.wmnet with reason: host reimage
[12:02:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75016 and previous config saved to /var/cache/conftool/dbconfig/20250415-120222-fceratto.json
[12:07:15] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: add a warning for Supermicro Config C [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577)
[12:08:26] <wikibugs>	 (03PS1) 10Michael Große: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695)
[12:09:06] <wikibugs>	 (03PS1) 10Michael Große: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695)
[12:09:29] <wikibugs>	 (03PS1) 10Michael Große: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695)
[12:09:58] <wikibugs>	 (03PS1) 10Michael Große: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695)
[12:10:41] <wikibugs>	 (03PS1) 10Michael Große: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695)
[12:10:59] <wikibugs>	 (03PS1) 10Michael Große: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695)
[12:12:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:12:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:12:55] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743284 (10Jelto) There were some problem adding the Ceph apus credentials to puppet. It was mostly an issue of wrong file names a...
[12:13:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:13:03] <wikibugs>	 (03CR) 10Volans: [C:04-1] "LGTM but missing one needed comma" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey)
[12:13:58] <wikibugs>	 (03PS1) 10Ayounsi: gnmic: bump num-workers to 24 [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641)
[12:14:17] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743317 (10Jelto)
[12:14:58] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:14:58] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:15:12] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: add a warning for Supermicro Config C [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577)
[12:15:24] <wikibugs>	 (03CR) 10Elukey: sre.hosts.provision: add a warning for Supermicro Config C (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey)
[12:15:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] gnmic: bump num-workers to 24 [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[12:15:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[12:16:29] <wikibugs>	 (03PS1) 10FNegri: openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953)
[12:17:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P75017 and previous config saved to /var/cache/conftool/dbconfig/20250415-121728-fceratto.json
[12:17:30] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:17:58] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:17:58] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:18:06] <topranks>	 hmmm....
[12:18:26] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:18:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: etcd: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129177 (https://phabricator.wikimedia.org/T389170)
[12:18:46] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:18:58] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:18:58] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:19:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] gnmic: bump num-workers to 24 [puppet] - 10https://gerrit.wikimedia.org/r/1136704 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[12:19:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:20:01] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum2002.codfw.wmnet with OS bookworm
[12:20:42] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:20:58] <godog>	 !log upgrade thanos to 0.38.0 on O:prometheus::pop
[12:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:05] <godog>	 !log upgrade thanos to 0.38.0 on O:prometheus::pop - T383966
[12:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:09] <stashbot>	 T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966
[12:21:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:21:13] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:22:08] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:22:27] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:22:42] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:22:46] <wikibugs>	 (03CR) 10Michael Große: "recheck" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:22:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:23:04] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:23:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:23:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[12:23:45] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] "Ok, chain makes sense and looks good to me.  Just noting here that I'm going to ask in Slack about archiving the content under htdocs." [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:23:54] <wikibugs>	 (03PS1) 10Brouberol: airflow: ensure the pod running in the KubernetesPodOperator itself gets low resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136706 (https://phabricator.wikimedia.org/T391669)
[12:24:48] <wikibugs>	 (03PS2) 10FNegri: openstack: Tidy up wmcs-wikireplica-dns script [puppet] - 10https://gerrit.wikimedia.org/r/1136705 (https://phabricator.wikimedia.org/T374953)
[12:25:17] <wikibugs>	 (03CR) 10Elukey: [C:03+1] log: notify user on IRC when awaiting input (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans)
[12:25:18] <logmsgbot>	 !log cgoubert@deploy1003 Started scap build-images: (no justification provided)
[12:25:59] <wikibugs>	 (03CR) 10Elukey: [C:03+1] tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608 (owner: 10Volans)
[12:26:31] <logmsgbot>	 !log cgoubert@deploy1003 build-images aborted: (no justification provided) (duration: 01m 12s)
[12:26:33] <logmsgbot>	 !log cgoubert@deploy1003 Started scap build-images: (no justification provided)
[12:26:34] <logmsgbot>	 !log cgoubert@deploy1003 build-images aborted: (no justification provided) (duration: 00m 01s)
[12:26:37] <logmsgbot>	 !log cgoubert@deploy1003 Started scap build-images: (no justification provided)
[12:26:51] <claime>	 Don't mind this, I can't use my fingers apparently
[12:29:22] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743380 (10Jelto)
[12:31:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove /srv/stats.wikimedia.org/htdocs/v2 directory [puppet] - 10https://gerrit.wikimedia.org/r/1136639 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:31:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:32:02] <wikibugs>	 (03CR) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[12:32:05] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap build-images: (no justification provided) (duration: 05m 27s)
[12:32:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P75018 and previous config saved to /var/cache/conftool/dbconfig/20250415-123236-fceratto.json
[12:33:01] <wikibugs>	 (03CR) 10Elukey: [C:03+1] hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans)
[12:33:04] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: test rebuild to test swift eventual consistency
[12:33:45] <logmsgbot>	 !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Andy Cooper out of all services on: 2393 hosts
[12:34:43] <claime>	 crap that test won't work, it's the same image
[12:34:49] <wikibugs>	 (03CR) 10Volans: [C:03+1] "In the interest of unblocking the situation between the this CR and I4ce9217392a7795940c981e1ee7da52df026cb5c let's merge this as-is even " [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[12:34:56] <claime>	 well it'll work, but it won't tell us anything
[12:35:36] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Offboarding Andy Cooper [puppet] - 10https://gerrit.wikimedia.org/r/1136710
[12:35:39] <claime>	 I have a full-image-build requiring change to push anyways, so I'm gonna do that afterwards
[12:36:27] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136706 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol)
[12:36:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:37:48] <wikibugs>	 (03CR) 10Mark Bergsma: [C:03+2] data.yaml: Offboarding Andy Cooper [puppet] - 10https://gerrit.wikimedia.org/r/1136710 (owner: 10Slyngshede)
[12:39:49] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: failover cookbook bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666)
[12:39:49] <wikibugs>	 (03CR) 10Arnaudb: "This patch adds a missing element to our logic, to properly handle gerrit's service state." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:39:55] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey)
[12:41:59] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.remove-downtime for durum2002.codfw.wmnet
[12:42:00] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for durum2002.codfw.wmnet
[12:43:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca)
[12:44:30] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:45:26] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:47:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T391056)', diff saved to https://phabricator.wikimedia.org/P75020 and previous config saved to /var/cache/conftool/dbconfig/20250415-124743-fceratto.json
[12:47:47] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[12:47:55] <claime>	 jouncebot: nowandnext
[12:47:55] <jouncebot>	 For the next 0 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1200)
[12:47:55] <jouncebot>	 In 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1300)
[12:47:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[12:48:02] <claime>	 god dammit 12 minutes
[12:48:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T391056)', diff saved to https://phabricator.wikimedia.org/P75021 and previous config saved to /var/cache/conftool/dbconfig/20250415-124805-fceratto.json
[12:48:10] <claime>	 well we'll see if backports work ig
[12:48:22] <wikibugs>	 (03CR) 10Volans: [C:03+2] log: notify user on IRC when awaiting input [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans)
[12:48:33] <wikibugs>	 (03CR) 10Volans: [C:03+2] tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608 (owner: 10Volans)
[12:48:39] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:49:47] <logmsgbot>	 !log cgoubert@deploy1003 cgoubert: test rebuild to test swift eventual consistency synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:49:54] <logmsgbot>	 !log cgoubert@deploy1003 cgoubert: Continuing with sync
[12:50:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T391056)', diff saved to https://phabricator.wikimedia.org/P75022 and previous config saved to /var/cache/conftool/dbconfig/20250415-125014-fceratto.json
[12:50:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661)
[12:50:32] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:51:01] <wikibugs>	 (03PS1) 10Jelto: wikidata-query-gui: add query-legacy-full.w.o to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136714 (https://phabricator.wikimedia.org/T350793)
[12:52:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: ensure the pod running in the KubernetesPodOperator itself gets low resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136706 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol)
[12:52:25] <claime>	 jnuche: we're gonna try to run the deployment window with the sleep in place... maybe we can at least deploy with that
[12:52:26] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:52:30] <claime>	 sgty?
[12:53:17] <jnuche>	 claime: sounds good, ty!
[12:53:45] <claime>	 I hate that workaround but I don't have anything better rn
[12:53:54] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me, thanks for the addition!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:53:56] <claime>	 We'll try to batch the backports as much as possible
[12:54:35] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[12:54:36] <wikibugs>	 (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add query-legacy-full.w.o to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136714 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[12:54:36] <claime>	 With a little luck my current deploy will be done just in time for the window
[12:55:07] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[12:55:18] <claime>	 robertsky, MichaelG_WMF, please look at your patches and tell me if I can backport any of them in the same scap or if they need staggering
[12:55:24] <claime>	 this is gonna be a long window
[12:55:37] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: set upgradeMode to savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136716 (https://phabricator.wikimedia.org/T390853)
[12:55:49] <MichaelG_WMF>	 claime: you can backport them all together
[12:55:54] <claime>	 MichaelG_WMF: awesome
[12:56:04] <robertsky>	 claime, you can do it altogether for mine as well.
[12:56:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Tested with test-cookbook :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136695 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey)
[12:56:13] <claime>	 robertsky: fantastic, thanks
[12:56:23] <MichaelG_WMF>	 claime: (per release that is, so two sets, one for .24 and one for .25)
[12:56:24] <wikibugs>	 (03Merged) 10jenkins-bot: wikidata-query-gui: add query-legacy-full.w.o to querybuilder hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136714 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[12:56:31] <jnuche>	 claime: the sleep is better than having train and backports blocked, it's just a stopgap measure until someone else can take a look. Thanks for doing that
[12:56:33] <claime>	 I'll be back in a minute or two, and will run the window as soon as my current scap is done
[12:56:43] <claime>	 I need a small break x)
[12:57:06] <MichaelG_WMF>	 also, there is nothing to test for mine. They fix a disabled maintenance script which will be re-enabled in a follow-up window
[12:57:07] <robertsky>	 have the break. :)
[12:57:12] <MichaelG_WMF>	 take your time :)
[12:58:08] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:58:21] <wikibugs>	 (03Merged) 10jenkins-bot: log: notify user on IRC when awaiting input [software/spicerack] - 10https://gerrit.wikimedia.org/r/1125955 (owner: 10Volans)
[12:58:55] <wikibugs>	 (03Merged) 10jenkins-bot: tests: refactor logging related tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136608 (owner: 10Volans)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1300).
[13:00:04] <jouncebot>	 robertsky, MichaelG_WMF, and Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:41] <claime>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I'll run that window as we're having registry issues. I have a scap completing, then we'll start
[13:00:47] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[13:01:04] <TheresNoTime>	 claime: ack, you'll run this window
[13:01:47] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply
[13:02:01] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply
[13:02:07] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[13:02:15] <logmsgbot>	 !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[13:02:21] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[13:02:28] <logmsgbot>	 !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[13:02:47] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: test rebuild to test swift eventual consistency (duration: 30m 09s)
[13:02:53] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[13:03:08] <claime>	 ok robertsky starting with your patches
[13:03:19] <robertsky>	 ok
[13:04:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[13:04:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[13:04:10] <wikibugs>	 (03PS1) 10Ayounsi: Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641)
[13:04:49] <Lucas_WMDE>	 o/
[13:04:57] <wikibugs>	 (03Merged) 10jenkins-bot: updating wikimaniawiki namespace configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131038 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[13:04:59] <Lucas_WMDE>	 (ack)
[13:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[13:05:21] * claime crosses fingers we can actualy deploy
[13:05:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P75023 and previous config saved to /var/cache/conftool/dbconfig/20250415-130522-fceratto.json
[13:05:29] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1131038|updating wikimaniawiki namespace configurations: (T389729)]], [[gerrit:1131119|update wikimaniawiki perms configurations: (T389729)]]
[13:05:32] <stashbot>	 T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729
[13:06:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:06:27] <wikibugs>	 (03PS1) 10Ssingh: wikimedia-dns.org: add HTTPS record (test) [dns] - 10https://gerrit.wikimedia.org/r/1136718
[13:07:01] <claime>	 ok pushes went through
[13:07:10] <robertsky>	 sweet
[13:07:18] <claime>	 it's now sleeping for 5 minutes for swift to catch up
[13:07:26] <wikibugs>	 (03PS2) 10Ayounsi: Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641)
[13:07:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: add HTTPS record (test) [dns] - 10https://gerrit.wikimedia.org/r/1136718 (owner: 10Ssingh)
[13:07:29] <claime>	 (hence why the window's gonna be a little long)
[13:07:33] <wikibugs>	 (03PS2) 10Ssingh: wikimedia-dns.org: add HTTPS record (test) [dns] - 10https://gerrit.wikimedia.org/r/1136718
[13:07:34] <robertsky>	 ah. ok.
[13:08:08] <claime>	 robertsky: yeah, we're having major issues with the registry, and that's the stopgap measure for being able to possibly deploy stuff
[13:08:15] <claime>	 cf T390251
[13:08:15] <stashbot>	 T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251
[13:08:34] <wikibugs>	 (03CR) 10Ssingh: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1136718 (owner: 10Ssingh)
[13:08:48] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: refresh FQDN of the neutron virtual router [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174)
[13:09:24] <MichaelG_WMF>	 claime: sorry to hear about the issues with the registry. Could you link me to the task? 
[13:09:26] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[13:09:28] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[13:09:34] <claime>	 MichaelG_WMF: T390251
[13:09:39] <MichaelG_WMF>	 Thanks!
[13:09:44] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: failover cookbook bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136709 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[13:10:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] deployment_server: stop shipping prometheus_nodes for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136605 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[13:11:54] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[13:11:56] <wikibugs>	 (03CR) 10Neriah: [C:03+1] testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal)
[13:11:57] <robertsky>	 claime: looks tough. hope it resolves soon.
[13:12:21] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[13:12:28] <claime>	 robertsky: thanks :)
[13:12:29] <wikibugs>	 (03PS1) 10Ssingh: Revert "wikimedia-dns.org: add HTTPS record (test)" [dns] - 10https://gerrit.wikimedia.org/r/1136722
[13:12:59] <wikibugs>	 (03PS3) 10Ayounsi: Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641)
[13:13:32] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "wikimedia-dns.org: add HTTPS record (test)" [dns] - 10https://gerrit.wikimedia.org/r/1136722 (owner: 10Ssingh)
[13:13:39] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[13:14:18] <logmsgbot>	 !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Andy Cooper out of all services on: 2393 hosts
[13:16:05] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[13:17:23] <logmsgbot>	 !log cgoubert@deploy1003 cgoubert, robertsky: Backport for [[gerrit:1131038|updating wikimaniawiki namespace configurations: (T389729)]], [[gerrit:1131119|update wikimaniawiki perms configurations: (T389729)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:17:26] <stashbot>	 T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729
[13:17:28] <claime>	 robertsky: please go ahead and test your patches with XWD
[13:17:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[13:17:37] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[13:17:40] <RichSmithLaptop>	 Not sure if this was the right place, but just hit '
[13:17:40] <RichSmithLaptop>	 [cee30b80-6232-414c-b271-aaa8b4dfa616] 2025-04-15 13:15:55: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"' when going to Special:BlockList on ENWP
[13:17:52] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2001.codfw.wmnet, wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:18:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:18:41] <wikibugs>	 (03PS2) 10Volans: hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763
[13:18:41] <wikibugs>	 (03PS2) 10Volans: hosts: add a is_dns_propagated() method to Host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135764
[13:18:51] <claime>	 hnowlan: do you have a minute to check what's going on there ^ (wikikube-ctrl)
[13:18:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:18:56] <claime>	 Ah, transient
[13:18:58] <claime>	 we're good
[13:19:01] <claime>	 sorry for the ping
[13:19:05] <sukhe>	 :)
[13:19:25] <wikibugs>	 (03CR) 10FNegri: openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[13:19:33] <elukey>	 claime: I think it was the reload for TLS certs
[13:19:34] <robertsky>	 claime: hold on.. i got to apologise for this, how to i get onto debug server to test? (it's my first time attending the backport)..
[13:19:38] <elukey>	 I don't see horrors in the logs
[13:19:44] <hnowlan>	 claime: looking just to be sure 
[13:19:51] <claime>	 robertsky: do you have the X-Wikimedia-Debug extension installed?
[13:19:53] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[13:19:57] <robertsky>	 yes
[13:20:05] <claime>	 go to the wiki you want to test
[13:20:07] <wikibugs>	 (03CR) 10Volans: "Moved the non immutable accessors from @property to methods" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans)
[13:20:16] <claime>	 turn it on
[13:20:18] <claime>	 test
[13:20:20] <claime>	 :D
[13:20:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P75024 and previous config saved to /var/cache/conftool/dbconfig/20250415-132029-fceratto.json
[13:21:01] <wikibugs>	 (03PS12) 10Tiziano Fogli: prometheus/alerts: define alert rules directly in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1101066 (https://phabricator.wikimedia.org/T381665)
[13:21:09] <wikibugs>	 (03CR) 10FNegri: openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[13:21:42] <wikibugs>	 (03PS1) 10Jelto: miscweb: remove query-service from legacy vms [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793)
[13:22:15] <robertsky>	 checking
[13:23:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2065 to cirrussearch2065
[13:23:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:24:11] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5301/co" [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[13:25:25] <robertsky>	 claime: lgtm.
[13:25:30] <claime>	 cool proceeding
[13:25:32] <logmsgbot>	 !log cgoubert@deploy1003 cgoubert, robertsky: Continuing with sync
[13:26:10] <claime>	 MichaelG_WMF: I'll do your backports as they're for a disabled periodic job, but fwiw, it'd be better if they were +1'd before being scheduled for deployment
[13:26:31] <wikibugs>	 (03PS1) 10Brouberol: airflow: hotfix: only assign low resources to kubernetes pod operator pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136726 (https://phabricator.wikimedia.org/T391669)
[13:26:56] <MichaelG_WMF>	 Thank you. If you want, I can ask Amir1 about the backports?
[13:27:21] <Amir1>	 They have my +1
[13:27:23] <claime>	 cool
[13:27:27] <wikibugs>	 (03PS2) 10Jelto: miscweb: remove query-service from legacy vms [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793)
[13:27:34] <Amir1>	 once you're done, I have a patch too
[13:27:36] <claime>	 thanks both
[13:27:45] <claime>	 Amir1: i know you do :P
[13:27:47] <Amir1>	 and a backport
[13:28:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2065 to cirrussearch2065 - bking@cumin2002"
[13:28:22] <claime>	 Aca: you around?
[13:28:30] <Aca>	 ye ye
[13:28:32] <claime>	 your patch isn't +1'd either
[13:28:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2065 to cirrussearch2065 - bking@cumin2002"
[13:28:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:28:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2065
[13:28:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2065
[13:28:53] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5302/co" [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[13:29:24] <Aca>	 I could call a colleague to review it, but I think he's not around
[13:29:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2065 to cirrussearch2065
[13:29:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2065.codfw.wmnet on all recursors
[13:29:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2065.codfw.wmnet on all recursors
[13:29:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hosts: add a is_dns_propagated() method to Host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135764 (owner: 10Volans)
[13:29:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2065.codfw.wmnet with OS bullseye
[13:30:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2065
[13:31:51] <claime>	 Aca: that would be best as it's adding things I'm not sure are standard for wiktionary
[13:32:13] <Aca>	 umm, elaborate?
[13:32:51] <claime>	 So I don't have domain specific knowledge for this
[13:33:37] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:33:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[13:33:51] <claime>	 my bad got tripped buy order
[13:33:52] <Aca>	 Basic import source setup for wiktionary is:
[13:33:53] <Aca>	  'wiktionary' => [ 'w', 'w:en', 'en', 'ar', 'es', 'fr', 'ru', 'zh', 'de', 'id', 'commons', 'meta', 'incubator' ],
[13:33:53] <Aca>	 This change just add "bs", per community consensus. The rest is just duplicated in order to prevent overwriting.
[13:33:54] <claime>	 s/buy/by/
[13:33:58] <claime>	 yeah yeah
[13:34:16] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131038|updating wikimaniawiki namespace configurations: (T389729)]], [[gerrit:1131119|update wikimaniawiki perms configurations: (T389729)]] (duration: 28m 46s)
[13:34:19] <stashbot>	 T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729
[13:34:20] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 1.013e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[13:34:25] <claime>	 The order wasn't the same as the generic wiktionary entry, and that tripped my quick reading
[13:34:36] <Aca>	 yeah, I get itt
[13:34:53] <claime>	 ok MichaelG_WMF I'll do your patches now
[13:34:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903#10743696 (10Jclark-ctr) a:03Jclark-ctr @Eevans  This server is out of Warranty  We have  used drives from recently Decom servers please advise when and if you would like to replace.
[13:35:08] <MichaelG_WMF>	 claime: Thank you!
[13:35:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:35:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:35:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:35:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T391056)', diff saved to https://phabricator.wikimedia.org/P75025 and previous config saved to /var/cache/conftool/dbconfig/20250415-133536-fceratto.json
[13:35:40] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[13:35:44] <wikibugs>	 (03CR) 10Anzx: "looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca)
[13:35:52] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[13:35:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:35:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T391056)', diff saved to https://phabricator.wikimedia.org/P75027 and previous config saved to /var/cache/conftool/dbconfig/20250415-133558-fceratto.json
[13:36:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:37:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10743705 (10MatthewVernon) Looking at the Ceph metrics, it seems the packages were fewer larger objects, and the artifacts are more...
[13:37:20] <wikibugs>	 (03Merged) 10jenkins-bot: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136701 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:37:27] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743708 (10Jclark-ctr) @elukey   thanks for downtime  raid card has been installed. @MatthewVernon All yours to verify
[13:37:30] <wikibugs>	 (03Merged) 10jenkins-bot: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136702 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:37:33] <wikibugs>	 (03Merged) 10jenkins-bot: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136703 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:38:02] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]]
[13:38:06] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[13:38:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T391056)', diff saved to https://phabricator.wikimedia.org/P75028 and previous config saved to /var/cache/conftool/dbconfig/20250415-133807-fceratto.json
[13:38:15] <wikibugs>	 (03PS3) 10Volans: hosts: add a is_dns_propagated() method to Host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135764
[13:38:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "nice! will be great to have those stats." [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[13:38:54] <wikibugs>	 (03CR) 10Elukey: "Hey Jesse! I tried with and without the patch, output in https://phabricator.wikimedia.org/P75026. For some reason it is very different, I" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway)
[13:39:50] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add CPU/RAM/DISK [puppet] - 10https://gerrit.wikimedia.org/r/1136717 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[13:40:18] <godog>	 jouncebot: now and next
[13:40:18] <jouncebot>	 For the next 0 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1300)
[13:40:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2065 - bking@cumin2002"
[13:40:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2065 - bking@cumin2002"
[13:40:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:40:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2065.codfw.wmnet 68.32.192.10.in-addr.arpa 8.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:40:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2065.codfw.wmnet 68.32.192.10.in-addr.arpa 8.6.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[13:40:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2065
[13:40:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2065
[13:40:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2065
[13:43:03] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743739 (10elukey) Thanks a lot!  I see the new controller but also some errors while mounting swift partitions:  ` [Tue Apr 15 13:41:...
[13:44:32] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743756 (10MatthewVernon) Currently puppet is failing on this host: ` mvernon@ms-be1091:~$ sudo run-puppet-agent Info: Using environme...
[13:45:32] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231"
[13:45:34] <logmsgbot>	 !log tappof@cumin1002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231"
[13:45:35] <stashbot>	 T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231
[13:45:44] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] sre.puppet.sync-netbox-hiera: add data::pdus to exports [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:46:07] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::master: move to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854)
[13:46:12] <wikibugs>	 (03CR) 10Edgar Allan Poe: [C:03+1] shwiktionary: Add bs as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca)
[13:46:52] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743765 (10MatthewVernon) @elukey that might help, yes, it looks like puppet finds the disks, but they've changed their path: ` swift_...
[13:47:34] <wikibugs>	 (03CR) 10Edgar Allan Poe: [C:03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca)
[13:48:04] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10743769 (10MatthewVernon) (I don't know whether everything will Just Work with a reimage, or if some awful regexes will need adjusting)
[13:48:08] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5303/" [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[13:48:38] <wikibugs>	 (03PS2) 10Elukey: role::ml_k8s::master: move 1001 to Bookworm and containerd [puppet] - 10https://gerrit.wikimedia.org/r/1136728 (https://phabricator.wikimedia.org/T387854)
[13:49:42] <logmsgbot>	 !log cgoubert@deploy1003 migr, cgoubert: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] synced to the testservers (https://wikitech.wikimedia
[13:49:42] <logmsgbot>	 .org/wiki/Mwdebug)
[13:49:46] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[13:49:57] <logmsgbot>	 !log cgoubert@deploy1003 migr, cgoubert: Continuing with sync
[13:52:13] <claime>	 ty for getting +1 Aca 
[13:52:27] <wikibugs>	 (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: add data::pdus to exports [cookbooks] - 10https://gerrit.wikimedia.org/r/1125206 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:52:42] <Aca>	 no problemm
[13:53:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P75029 and previous config saved to /var/cache/conftool/dbconfig/20250415-135313-fceratto.json
[13:54:13] <wikibugs>	 (03PS1) 10Ssingh: [test commit] wikimedia-dns.org: add HTTPS records [dns] - 10https://gerrit.wikimedia.org/r/1136730
[13:55:12] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231"
[13:55:16] <stashbot>	 T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231
[13:55:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2065.codfw.wmnet with reason: host reimage
[13:55:44] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231"
[13:56:29] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136701|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136702|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136703|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] (duration: 18m 27s)
[13:56:32] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[13:56:39] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] [test commit] wikimedia-dns.org: add HTTPS records [dns] - 10https://gerrit.wikimedia.org/r/1136730 (owner: 10Ssingh)
[13:56:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:56:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:56:50] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[13:56:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:57:37] <wikibugs>	 (03PS1) 10Robertsky: fix wgAddGroup for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729)
[13:58:49] <wikibugs>	 (03Merged) 10jenkins-bot: tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136696 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:59:17] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[13:59:26] <wikibugs>	 (03Merged) 10jenkins-bot: perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136698 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:59:29] <wikibugs>	 (03Merged) 10jenkins-bot: perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider [extensions/GrowthExperiments] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136700 (https://phabricator.wikimedia.org/T391695) (owner: 10Michael Große)
[13:59:49] <wikibugs>	 (03PS1) 10Ssingh: Revert "[test commit] wikimedia-dns.org: add HTTPS records" [dns] - 10https://gerrit.wikimedia.org/r/1136732
[13:59:57] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136698|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]]
[14:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:01:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "[test commit] wikimedia-dns.org: add HTTPS records" [dns] - 10https://gerrit.wikimedia.org/r/1136732 (owner: 10Ssingh)
[14:01:39] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:02:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2065.codfw.wmnet with reason: host reimage
[14:03:08] <wikibugs>	 (03PS2) 10Robertsky: fix wgAddGroup for wikimaniawiki. No need for translateadmin to add xcon and xcon to add more xcon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729)
[14:04:06] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:04:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:06:53] <claime>	 jouncebot: nowandnext
[14:06:53] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 53 minute(s)
[14:06:53] <jouncebot>	 In 0 hour(s) and 53 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1500)
[14:07:10] <claime>	 we're running over a bit but I'll still finish up
[14:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[14:07:57] <urandom>	 !log bootstrapping Cassandra/restbase1044-c — T389423
[14:08:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:01] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[14:08:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P75030 and previous config saved to /var/cache/conftool/dbconfig/20250415-140820-fceratto.json
[14:08:42] <wikibugs>	 (03PS1) 10Volans: doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733
[14:09:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:11:50] <logmsgbot>	 !log cgoubert@deploy1003 migr, cgoubert: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136698|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] synced to the testservers (https://wikitech.wikimedia
[14:11:50] <logmsgbot>	 .org/wiki/Mwdebug)
[14:11:53] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[14:11:57] <logmsgbot>	 !log cgoubert@deploy1003 migr, cgoubert: Continuing with sync
[14:12:08] <wikibugs>	 (03PS46) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[14:13:39] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service restbase1044-c:7000 has failed probes (tcp_cassandra_c_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:14:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: hotfix: only assign low resources to kubernetes pod operator pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136726 (https://phabricator.wikimedia.org/T391669) (owner: 10Brouberol)
[14:14:55] <wikibugs>	 (03PS47) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[14:15:33] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_esams and not P{cp3081.esams.wmnet} and A:cp
[14:17:22] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[14:18:03] <wikibugs>	 (03PS3) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610)
[14:18:28] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136696|tests(Mentorship): add coverage for UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136698|perf(Mentorship): extract sub-queries from UncachedMenteeOverviewDataProvider (T391695)]], [[gerrit:1136700|perf(Mentorship): batch filtering mentees in UncachedMenteeOverviewDataProvider (T391695)]] (duration: 18m 30s)
[14:18:31] <stashbot>	 T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages - https://phabricator.wikimedia.org/T391695
[14:18:38] <claime>	 ok Aca moving on to your patch
[14:18:39] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:18:43] <Aca>	 ack
[14:18:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca)
[14:19:17] <claime>	 MichaelG_WMF: your patch is fully backported, so the updated script should be up on mwmaint
[14:19:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733 (owner: 10Volans)
[14:19:28] <claime>	 Maybe don't run it rn tho :P
[14:19:40] <MichaelG_WMF>	 claime: thank you!
[14:19:46] <wikibugs>	 (03Merged) 10jenkins-bot: shwiktionary: Add bs as import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136133 (https://phabricator.wikimedia.org/T391621) (owner: 10Acamicamacaraca)
[14:19:59] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "typo in team label" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[14:20:08] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:20:11] <wikibugs>	 (03CR) 10Elukey: [C:03+1] hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans)
[14:20:14] <logmsgbot>	 !log cgoubert@deploy1003 Started scap sync-world: Backport for [[gerrit:1136133|shwiktionary: Add bs as import source (T391621)]]
[14:20:19] <stashbot>	 T391621: shwiktionary: Add bs as import source - https://phabricator.wikimedia.org/T391621
[14:20:28] <MichaelG_WMF>	 yes, the plan is to re-enable it maybe tomorrow when we can prepare it and watch for fallout
[14:20:29] <wikibugs>	 (03CR) 10Herron: [C:03+1] logstash: restore forcemerge in curator [puppet] - 10https://gerrit.wikimedia.org/r/1136713 (https://phabricator.wikimedia.org/T391661) (owner: 10Filippo Giunchedi)
[14:20:48] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:20:59] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_esams and not P{cp3073.esams.wmnet} and A:cp
[14:21:03] <wikibugs>	 (03CR) 10Volans: [C:03+2] doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733 (owner: 10Volans)
[14:21:07] <MichaelG_WMF>	 claime: `Maybe don't run it rn tho :P` Are you mainly concerned about the registry issue or something else?
[14:21:46] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:21:56] <wikibugs>	 (03CR) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[14:22:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2065.codfw.wmnet with OS bullseye
[14:22:41] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[14:22:45] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[14:22:46] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10743933 (10MatthewVernon) There's an LVM layer here too, isn't there? It's a software-RAID-1 of sda2 and sdb2...
[14:23:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T391056)', diff saved to https://phabricator.wikimedia.org/P75031 and previous config saved to /var/cache/conftool/dbconfig/20250415-142327-fceratto.json
[14:23:30] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:23:43] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[14:23:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T391056)', diff saved to https://phabricator.wikimedia.org/P75032 and previous config saved to /var/cache/conftool/dbconfig/20250415-142349-fceratto.json
[14:24:19] <wikibugs>	 (03PS1) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867)
[14:24:39] <godog>	 claime: I'd like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136605 please LMK when good to do so
[14:24:58] <claime>	 godog: Amir1 asked first, so see with him :P
[14:25:37] <vgutierrez>	 !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in eqiad - T391334
[14:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:42] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[14:25:42] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqiad and A:cp
[14:25:57] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqiad and A:cp
[14:25:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T391056)', diff saved to https://phabricator.wikimedia.org/P75033 and previous config saved to /var/cache/conftool/dbconfig/20250415-142558-fceratto.json
[14:26:20] <godog>	 claime: lolz
[14:26:29] <godog>	 claime: I'll stand in line
[14:26:35] <wikibugs>	 (03PS1) 10Brouberol: airflow/hotfix: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136735
[14:26:42] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[14:28:55] <wikibugs>	 (03CR) 10Hashar: "That is for the odd use case when I am running `bundle exec rspec` from my local machine (Debian Bookworm) which comes with Ruby 3.1." [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar)
[14:29:47] <Amir1>	 awesome
[14:29:49] <wikibugs>	 (03PS4) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867)
[14:29:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow/hotfix: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136735 (owner: 10Brouberol)
[14:30:07] <wikibugs>	 (03PS1) 10Ladsgroup: Revert^2 "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136737
[14:30:15] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[14:30:41] <claime>	 Amir1: I still have a deploy in flight, btw, so wait a bit
[14:30:54] <Amir1>	 ah okay
[14:31:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[14:31:32] <logmsgbot>	 !log cgoubert@deploy1003 aleksandar, cgoubert: Backport for [[gerrit:1136133|shwiktionary: Add bs as import source (T391621)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:31:35] <claime>	 Aca: /39
[14:31:36] <stashbot>	 T391621: shwiktionary: Add bs as import source - https://phabricator.wikimedia.org/T391621
[14:31:37] <claime>	 sorry
[14:31:40] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_drmrs
[14:31:45] <Aca>	 MichaelG_WMF: testing
[14:31:46] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:31:48] <wikibugs>	 (03Merged) 10jenkins-bot: doc: fine-tune settings for magic methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136733 (owner: 10Volans)
[14:31:50] <Aca>	 oops
[14:31:52] <Aca>	 wrong ping
[14:31:52] <claime>	 haha
[14:32:01] <wikibugs>	 (03PS3) 10Robertsky: fix wgAddGroup for wikimaniawiki. No need for translateadmin to add xcon and xcon to add more xcon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729)
[14:32:03] <MichaelG_WMF>	 o.O
[14:32:06] <claime>	 fails all around :D
[14:32:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[14:32:53] <Aca>	 works as expected
[14:32:56] <Aca>	 lgtm
[14:32:59] <claime>	 cool, proceeding
[14:33:04] <logmsgbot>	 !log cgoubert@deploy1003 aleksandar, cgoubert: Continuing with sync
[14:33:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[14:33:38] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2066-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[14:33:43] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:34:00] <wikibugs>	 (03CR) 10Volans: [C:03+2] hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans)
[14:34:08] <wikibugs>	 (03PS5) 10Bking: sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[14:34:11] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:34:15] <sbassett>	 Hey - there was a bad security patch from yesterday that went out to wmf.24 for a few minutes and was then reverted/redeployed.  But it looks like it made it back onto wmf.24 (train?) and is causing prod errors now: https://phabricator.wikimedia.org/T391969
[14:34:29] <wikibugs>	 (03PS3) 10Brouberol: wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107)
[14:34:29] <wikibugs>	 (03PS3) 10Brouberol: wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107)
[14:34:29] <wikibugs>	 (03PS3) 10Brouberol: wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107)
[14:34:30] <wikibugs>	 (03PS3) 10Brouberol: wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107)
[14:34:31] <wikibugs>	 (03PS3) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107)
[14:35:01] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:35:05] <sbassett>	 I’ve removed the patch from /srv/patches/1.44.0-wmf.24 now and it never made it to 1.44.0-wmf.25.  We’ll need to redeploy core:includes/specials/pagers/BlockListPager.php as soon as we can.
[14:35:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove htdocs/v2 link from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136640 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[14:36:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv2: move all content under /srv/stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1136641 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[14:36:27] <claime>	 sbassett: shoot
[14:36:47] <claime>	 ok that takes precedence on Amir1 backport
[14:37:09] <claime>	 sbassett: I have no experience on deploying security patches, can you handle it once the current scap is done?
[14:37:57] <sbassett>	 I’m ready to deploy the fix if I can
[14:38:00] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_drmrs
[14:38:05] <sbassett>	 If there’s no scap lock rn
[14:38:19] <sbassett>	 I think this got accidentally reapplied via scap backport :/
[14:38:38] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2066-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[14:38:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[14:38:57] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[14:39:09] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[14:39:10] <claime>	 possibly during my tests this morning, I'm sorry
[14:39:28] <claime>	 i'll ping you as soon as the current deploy is done
[14:39:33] <claime>	 couple minutes max
[14:39:42] <logmsgbot>	 !log cgoubert@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136133|shwiktionary: Add bs as import source (T391621)]] (duration: 19m 28s)
[14:39:44] <claime>	 sbassett: go
[14:39:45] <stashbot>	 T391621: shwiktionary: Add bs as import source - https://phabricator.wikimedia.org/T391621
[14:40:01] <sbassett>	 claime: running
[14:40:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[14:40:32] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv2: remove assets from htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136642 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[14:41:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P75034 and previous config saved to /var/cache/conftool/dbconfig/20250415-144106-fceratto.json
[14:42:46] <wikibugs>	 (03PS2) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867)
[14:42:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:42:47] <Aca>	 thankies for the deploy :)
[14:43:07] <wikibugs>	 (03Merged) 10jenkins-bot: hosts: add a new hosts module with a Host class [software/spicerack] - 10https://gerrit.wikimedia.org/r/1135763 (owner: 10Volans)
[14:43:11] <claime>	 Aca: np
[14:43:15] <claime>	 ty for the patch
[14:44:11] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[14:44:51] <wikibugs>	 (03CR) 10Clément Goubert: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[14:45:55] <wikibugs>	 (03PS4) 10Robertsky: wikimaniawiki: fix add/remove groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729)
[14:48:04] <wikibugs>	 (03CR) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[14:48:07] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[14:49:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[14:50:01] <wikibugs>	 (03CR) 10Chlod Alejandro: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[14:51:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:52:00] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:52:38] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:52:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:54:18] <sbassett>	 prod k8s 40% done, error rates seem to be declining in logstash
[14:55:31] <claime>	 sbassett: cool thanks
[14:56:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P75035 and previous config saved to /var/cache/conftool/dbconfig/20250415-145613-fceratto.json
[14:57:03] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[14:57:33] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[14:57:45] <sbassett>	 !log Undeployed security patch for T391343 (reapplied during recent scap backport, patch now removed from deployment hosts)
[14:57:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1206 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: Time to do the SRE Collaboration Services office hours deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1500).
[15:00:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:03:49] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:04:40] <sbassett>	 claime: should be all good now
[15:06:39] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:09] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151)
[15:07:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper)
[15:07:49] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1067 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:08:39] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:08:53] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1137 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:09:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744161 (10MatthewVernon)
[15:09:27] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar)
[15:10:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:10:27] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde: Configure Prometheus Pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/1136741 (https://phabricator.wikimedia.org/T389344)
[15:10:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:11:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T391056)', diff saved to https://phabricator.wikimedia.org/P75036 and previous config saved to /var/cache/conftool/dbconfig/20250415-151121-fceratto.json
[15:11:25] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:11:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[15:11:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T391056)', diff saved to https://phabricator.wikimedia.org/P75037 and previous config saved to /var/cache/conftool/dbconfig/20250415-151144-fceratto.json
[15:13:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T391056)', diff saved to https://phabricator.wikimedia.org/P75038 and previous config saved to /var/cache/conftool/dbconfig/20250415-151354-fceratto.json
[15:16:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2059 to cirrussearch2059
[15:16:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:16:24] <logmsgbot>	 !log dzahn@deploy1003 Started deploy [releng/jenkins-deploy@c274545] (releasing): T391590
[15:16:29] <stashbot>	 T391590: PuppetFailure - releases2003 - https://phabricator.wikimedia.org/T391590
[15:17:08] <logmsgbot>	 !log dzahn@deploy1003 Finished deploy [releng/jenkins-deploy@c274545] (releasing): T391590 (duration: 01m 14s)
[15:18:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:19:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1133 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:20:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744224 (10MatthewVernon)
[15:22:01] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744247 (10MatthewVernon) @RHo can you approve this request, please? Once that's done, this request can proceed.
[15:22:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2059 to cirrussearch2059 - bking@cumin2002"
[15:22:29] <Amir1>	 claime: since scott is done, shall I deploy?
[15:22:49] <elukey>	 Amir1: +1, I don't think there is anything else pending
[15:23:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1160 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:23:02] <elukey>	 the sleep fix is live so usable
[15:23:11] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert^2 "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136737 (owner: 10Ladsgroup)
[15:23:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:23:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Bump thumbnail steps to 95%" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136737 (owner: 10Ladsgroup)
[15:24:01] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[15:24:03] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: switch parsoidtest1001 to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485) (owner: 10Scott French)
[15:24:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1206 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:24:45] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136737|Revert^2 "Bump thumbnail steps to 95%"]]
[15:25:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:26:22] <wikibugs>	 (03PS15) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389
[15:26:28] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151)
[15:26:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10744259 (10MatthewVernon)
[15:29:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P75041 and previous config saved to /var/cache/conftool/dbconfig/20250415-152901-fceratto.json
[15:29:02] <wikibugs>	 (03PS1) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925)
[15:29:03] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[15:29:15] <wikibugs>	 (03PS2) 10Herron: alertmanager: update irc template for pyrra slo alerts [puppet] - 10https://gerrit.wikimedia.org/r/1136745 (https://phabricator.wikimedia.org/T391925)
[15:31:14] <wikibugs>	 06SRE, 10DNS, 10Wikimedia-Apache-configuration: Unconfigured subdomains of wikimedia.org should display an error page rather than the wikimedia.org homepage - https://phabricator.wikimedia.org/T391016#10744291 (10Joe) 05Open→03Declined This was never the behaviour of our servers, as far back as I can...
[15:31:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:32:14] <wikibugs>	 (03PS16) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389
[15:32:19] <wikibugs>	 (03PS1) 10MVernon: admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820)
[15:32:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon)
[15:33:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1160 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:34:14] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10744312 (10TheDJ) >>! In T355914#10742000, @MikhailRyazanov wrote: > By the way, are there any reasons, besides historical, to specify image sizes in “pi...
[15:34:21] <wikibugs>	 (03PS2) 10MVernon: admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820)
[15:34:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1133 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:34:49] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151)
[15:35:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon)
[15:35:17] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Remove the main release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136748
[15:35:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: scap: Stop updating main mw-wikifunctions release [puppet] - 10https://gerrit.wikimedia.org/r/1136749
[15:35:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv2: remove htdocs [puppet] - 10https://gerrit.wikimedia.org/r/1136643 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[15:35:47] <wikibugs>	 (03PS3) 10MVernon: admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820)
[15:35:53] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1137 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:36:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:36:39] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1136737|Revert^2 "Bump thumbnail steps to 95%"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:38:17] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1204 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:39:03] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[15:39:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[15:39:47] <wikibugs>	 (03PS4) 10Brouberol: wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107)
[15:40:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "I think this won't work as expected since haproxykafka gets the hostname from its configuration: https://gitlab.wikimedia.org/repos/sre/ha" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur)
[15:40:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2059 to cirrussearch2059 - bking@cumin2002"
[15:40:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:40:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2059
[15:41:16] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] wikistatsv1: remove old resources from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1136644 (https://phabricator.wikimedia.org/T389107) (owner: 10Brouberol)
[15:41:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2059
[15:41:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2059 to cirrussearch2059
[15:42:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2059.codfw.wmnet on all recursors
[15:42:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2059.codfw.wmnet on all recursors
[15:42:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2059.codfw.wmnet with OS bullseye
[15:42:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2059
[15:42:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:44:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P75042 and previous config saved to /var/cache/conftool/dbconfig/20250415-154407-fceratto.json
[15:44:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon)
[15:44:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10744354 (10phaultfinder)
[15:45:48] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136737|Revert^2 "Bump thumbnail steps to 95%"]] (duration: 21m 02s)
[15:47:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2059 - bking@cumin2002"
[15:47:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2059 - bking@cumin2002"
[15:47:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:47:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2059.codfw.wmnet 5.32.192.10.in-addr.arpa 5.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:47:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2059.codfw.wmnet 5.32.192.10.in-addr.arpa 5.0.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:47:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2059
[15:47:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2059
[15:47:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2059
[15:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:48:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:49:08] <wikibugs>	 (03CR) 10MVernon: [C:03+2] admin: add lmeintrup to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1136746 (https://phabricator.wikimedia.org/T391820) (owner: 10MVernon)
[15:50:39] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151)
[15:50:39] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151)
[15:51:02] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10744402 (10Jclark-ctr) @fnegri  i had looked at this briefly  kinda looks like might be a bad intake sensor and might not be over heating comparin...
[15:51:35] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs-internal: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1136744 (https://phabricator.wikimedia.org/T376151)
[15:51:35] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs-internal: remove from LBs and backend servers [puppet] - 10https://gerrit.wikimedia.org/r/1136747 (https://phabricator.wikimedia.org/T376151)
[15:52:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10744409 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon @Lena_WMDE this is done now...
[15:54:07] <cscott>	 is there a train deploy next tuesday, despite it being a global WMF holiday?
[15:54:49] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151)
[15:55:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1204 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:57:36] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231"
[15:57:41] <stashbot>	 T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231
[15:58:01] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add data::pdus to exports - tappof@cumin1002 - T387231"
[15:58:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:59:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T391056)', diff saved to https://phabricator.wikimedia.org/P75043 and previous config saved to /var/cache/conftool/dbconfig/20250415-155914-fceratto.json
[15:59:21] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:59:28] <taavi>	 cscott: normally this would be at https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar but that is not very helpful for that atm
[15:59:32] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[15:59:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75044 and previous config saved to /var/cache/conftool/dbconfig/20250415-155939-fceratto.json
[16:00:03] <cscott>	 taavi: the google calendar also lists a deploy on the 22nd
[16:00:05] <jouncebot>	 jhathaway and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1600).
[16:00:05] <jouncebot>	 Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:09] <rzl>	 o/
[16:00:48] <cscott>	 taavi: i'm assuming the group0 deploy will get shifted to wednesday, since there won't be anyone around to fix any problems with group0 if they arise?
[16:00:50] <wikibugs>	 (03CR) 10FNegri: [C:03+1] openstack: networktests: refresh FQDN of the neutron virtual router (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136719 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez)
[16:00:54] <Lucas_WMDE>	 o/
[16:01:01] <Lucas_WMDE>	 (sorry, my bouncer hung for a few seconds)
[16:01:36] <rzl>	 jeez how dare you show up late to the puppet window, a thing I would never do in my entire life :D
[16:01:40] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] statistics::wmde: Configure Prometheus Pushgateway [puppet] - 10https://gerrit.wikimedia.org/r/1136741 (https://phabricator.wikimedia.org/T389344) (owner: 10Lucas Werkmeister (WMDE))
[16:01:42] <Lucas_WMDE>	 :P
[16:02:01] <rzl>	 will you want a manual run on stat1011 to test?
[16:03:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2059.codfw.wmnet with reason: host reimage
[16:03:30] <Lucas_WMDE>	 rzl: the only testing I could do would be to check that the line shows up in the config file
[16:03:40] <rzl>	 oh okay
[16:03:41] <Lucas_WMDE>	 beyond that, the required code isn’t quite ready yet
[16:03:55] <Lucas_WMDE>	 (I’ll test later if pushing to that Prometheus Pushgateway thingy works, probably tomorrow or so)
[16:04:05] <rzl>	 nod, makes sense
[16:04:21] <rzl>	 I'll run puppet anyway even though I'm pretty sure it mathematically can't fail on that patch, and then we can call it a day
[16:04:26] <Lucas_WMDE>	 sure
[16:05:31] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[16:05:37] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[16:06:08] <wikibugs>	 (03PS1) 10Mhorsey: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805)
[16:06:30] <taavi>	 rzl: saying that something absolutely cannot fail is in my experience a very good way to make something fail
[16:06:33] <rzl>	 Lucas_WMDE: done, and I do see the diff in the puppet output, so you should be all good
[16:06:44] <Lucas_WMDE>	 rzl: and I see the lines in sudo -u analytics-wmde cat /srv/analytics-wmde/graphite/src/config \o/
[16:06:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2059.codfw.wmnet with reason: host reimage
[16:06:56] <Lucas_WMDE>	 thanks!
[16:06:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey)
[16:07:05] <rzl>	 taavi: haha it's a template patch with no control characters, I defy the universe to break it just to teach me a lesson
[16:07:34] <rzl>	 (it didn't tho)
[16:10:34] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-dnsleaks: slow down --doublecheck [puppet] - 10https://gerrit.wikimedia.org/r/1136755
[16:11:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmcs-dnsleaks: slow down --doublecheck [puppet] - 10https://gerrit.wikimedia.org/r/1136755 (owner: 10Andrew Bogott)
[16:12:47] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-internal: remove service catalog entry [puppet] - 10https://gerrit.wikimedia.org/r/1136756 (https://phabricator.wikimedia.org/T376151)
[16:12:49] <wikibugs>	 (03Merged) 10jenkins-bot: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[16:12:49] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-internal: rip out remaining logic/config [puppet] - 10https://gerrit.wikimedia.org/r/1136757 (https://phabricator.wikimedia.org/T376151)
[16:13:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75046 and previous config saved to /var/cache/conftool/dbconfig/20250415-161335-fceratto.json
[16:13:39] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:16:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1134 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:17:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1134 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:18:21] <wikibugs>	 (03PS2) 10Mhorsey: Release campaignEvents extension to azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805)
[16:18:39] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:20:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136754 (https://phabricator.wikimedia.org/T390805) (owner: 10Mhorsey)
[16:20:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10744550 (10cmooney) >>! In T374614#10707267, @cmooney wrote: >>>! In T374614#10147994, @ayounsi wrote: >> Short term I think if you add `[4Gbps]` to the interface...
[16:23:41] <wikibugs>	 (03PS3) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867)
[16:27:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2059.codfw.wmnet with OS bullseye
[16:28:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P75047 and previous config saved to /var/cache/conftool/dbconfig/20250415-162842-fceratto.json
[16:30:21] <wikibugs>	 (03PS1) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967)
[16:36:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:38:04] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "this currently breaks logging of x-cache-status for 301s responses generated by the `http` frontend" [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967) (owner: 10Fabfur)
[16:41:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744645 (10RHo) >>! In T391861#10744224, @MatthewVernon wrote: > @RHo can you approve this request, please? Once that's done,...
[16:42:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2098 to cirrussearch2098
[16:42:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:42:43] <wikibugs>	 (03PS1) 10Ssingh: wikimedia-dns.org: check: add HTTPS record (TTL to increase later) [dns] - 10https://gerrit.wikimedia.org/r/1136764
[16:43:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: check: add HTTPS record (TTL to increase later) [dns] - 10https://gerrit.wikimedia.org/r/1136764 (owner: 10Ssingh)
[16:43:50] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[16:43:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P75048 and previous config saved to /var/cache/conftool/dbconfig/20250415-164350-fceratto.json
[16:45:15] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:45:22] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from elastic2098 to cirrussearch2098
[16:45:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2098.codfw.wmnet on all recursors
[16:45:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2098.codfw.wmnet on all recursors
[16:45:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2102.codfw.wmnet on all recursors
[16:45:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2102.codfw.wmnet on all recursors
[16:46:02] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[16:46:05] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[16:46:22] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[16:46:43] <wikibugs>	 (03CR) 10Bking: [C:03+2] sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[16:48:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[16:50:53] <wikibugs>	 (03PS1) 10Dzahn: jenkins: ensure systemd service dir exists before override [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595)
[16:52:34] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "Thanks! I tried this but ran into errors. Pasted it here: https://phabricator.wikimedia.org/T391590#10744225" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[16:54:36] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[16:55:19] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "Reverting seems still easy enough if needed, afaict! lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[16:55:21] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[16:58:57] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[16:58:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75049 and previous config saved to /var/cache/conftool/dbconfig/20250415-165859-fceratto.json
[16:59:00] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[16:59:04] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:59:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[16:59:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1253.eqiad.wmnet with reason: Maintenance
[16:59:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1253 (T391056)', diff saved to https://phabricator.wikimedia.org/P75050 and previous config saved to /var/cache/conftool/dbconfig/20250415-165922-fceratto.json
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1700)
[17:01:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T391056)', diff saved to https://phabricator.wikimedia.org/P75051 and previous config saved to /var/cache/conftool/dbconfig/20250415-170132-fceratto.json
[17:03:04] <wikibugs>	 (03CR) 10Dzahn: [C:04-2] "replacing this approach with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136765" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[17:08:52] <wikibugs>	 (03PS1) 10Ssingh: Revert "wikimedia-dns.org: check: add HTTPS record (TTL to increase later)" [dns] - 10https://gerrit.wikimedia.org/r/1136768
[17:09:07] <wikibugs>	 (03CR) 10Ssingh: "This works but reverting till we actually finish other deployment." [dns] - 10https://gerrit.wikimedia.org/r/1136768 (owner: 10Ssingh)
[17:10:21] <wikibugs>	 (03PS4) 10Hnowlan: mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782)
[17:10:21] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate deleteExpiredUserImpactData to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1136770 (https://phabricator.wikimedia.org/T385782)
[17:11:32] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "wikimedia-dns.org: check: add HTTPS record (TTL to increase later)" [dns] - 10https://gerrit.wikimedia.org/r/1136768 (owner: 10Ssingh)
[17:11:41] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[17:13:39] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:14:08] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[17:16:06] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Gemfile: update rspec-puppet to 2.10.x [puppet] - 10https://gerrit.wikimedia.org/r/1136403 (owner: 10Hashar)
[17:16:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P75052 and previous config saved to /var/cache/conftool/dbconfig/20250415-171639-fceratto.json
[17:23:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744752 (10Ahoelzl) Approved from DPE DE.
[17:23:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861#10744754 (10Ahoelzl)
[17:23:40] <logmsgbot>	 !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@f650091]: Pickup latest artifacts. T391280.
[17:23:44] <stashbot>	 T391280: Modify table maintenance mechanism to support Iceberg's rewrite_position_delete_files() - https://phabricator.wikimedia.org/T391280
[17:24:23] <logmsgbot>	 !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@f650091]: Pickup latest artifacts. T391280. (duration: 01m 08s)
[17:31:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P75053 and previous config saved to /var/cache/conftool/dbconfig/20250415-173146-fceratto.json
[17:46:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T391056)', diff saved to https://phabricator.wikimedia.org/P75054 and previous config saved to /var/cache/conftool/dbconfig/20250415-174653-fceratto.json
[17:46:58] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:47:09] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[17:47:27] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2150.codfw.wmnet with reason: Maintenance
[17:47:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T391056)', diff saved to https://phabricator.wikimedia.org/P75055 and previous config saved to /var/cache/conftool/dbconfig/20250415-174734-fceratto.json
[17:54:43] <wikibugs>	 (03PS1) 10Ssingh: Revert^2 "P:durum: add conditional to enable ECH (durum2002)" [puppet] - 10https://gerrit.wikimedia.org/r/1136772
[17:57:02] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_eqiad and A:cp
[18:00:05] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_eqiad and A:cp
[18:00:05] <jouncebot>	 dduvall and brennen: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T1800).
[18:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[18:01:31] <sukhe>	 !log removing from reprepro -C component/nginx-ech libssl and openssl packages
[18:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:40] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006 (10RobH) 03NEW p:05Triage→03High
[18:04:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T391056)', diff saved to https://phabricator.wikimedia.org/P75056 and previous config saved to /var/cache/conftool/dbconfig/20250415-180400-fceratto.json
[18:04:05] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:04:32] <brennen>	 o/
[18:04:51] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007 (10RobH) 03NEW
[18:05:10] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[18:05:13] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[18:05:15] <brennen>	 (ah, currently blocked it seems.)
[18:07:13] <dduvall>	 brennen: o/ i don't think it's a complete blocker per se as it's intermittent
[18:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:10:15] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136776 (https://phabricator.wikimedia.org/T386220)
[18:10:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136776 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[18:11:04] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136776 (https://phabricator.wikimedia.org/T386220) (owner: 10TrainBranchBot)
[18:11:51] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10744966 (10RobH)
[18:13:34] <wikibugs>	 (03PS1) 10Jforrester: VE: Start setting wgVisualEditorMobileInsertMenu, default to off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136777 (https://phabricator.wikimedia.org/T388604)
[18:13:35] <wikibugs>	 (03PS1) 10Jforrester: VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145)
[18:13:39] <jinxer-wm>	 FIRING: ProbeDown: Service restbase1044-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1044-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:14:10] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10744975 (10RobH) @ayounsi & @cmooney:  Per our conversation today in our codfw/eqiad buildout meetings, this was brought up and I've created th...
[18:14:33] <wikibugs>	 (03CR) 10DLynch: [C:03+1] VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145) (owner: 10Jforrester)
[18:14:35] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10744979 (10RobH) @Jclark-ctr & @VRiley-WMF  Per today's meeting, one of the action items was to have an eqiad onsite detrmine how many free cro...
[18:19:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P75057 and previous config saved to /var/cache/conftool/dbconfig/20250415-181906-fceratto.json
[18:23:39] <jinxer-wm>	 RESOLVED: ProbeDown: Service restbase1044-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1044-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:23:46] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclients: fix DnsManager [puppet] - 10https://gerrit.wikimedia.org/r/1136780
[18:27:02] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] logging: Add context processor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza)
[18:28:07] <wikibugs>	 (03Abandoned) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[18:28:47] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10netops: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745024 (10cmooney) >>! In T392007#10744966, @RobH wrote: > Please detail via comment specifically how using D6 would cause a network imbalance...
[18:29:32] <logmsgbot>	 !log dduvall@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.25  refs T386220
[18:29:36] <stashbot>	 T386220: 1.44.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T386220
[18:30:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745029 (10RobH)
[18:31:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745033 (10RobH)
[18:34:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P75058 and previous config saved to /var/cache/conftool/dbconfig/20250415-183413-fceratto.json
[18:34:36] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] "(Needs to wait under the dependency is deployed, otherwise it throws exceptions)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136132 (https://phabricator.wikimedia.org/T142313) (owner: 10Gergő Tisza)
[18:49:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T391056)', diff saved to https://phabricator.wikimedia.org/P75059 and previous config saved to /var/cache/conftool/dbconfig/20250415-184921-fceratto.json
[18:49:25] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:49:38] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2159.codfw.wmnet with reason: Maintenance
[18:49:53] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance
[18:49:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10745097 (10RobH)
[18:49:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Migrate non-fundraising hosts out of eqiad D6 - https://phabricator.wikimedia.org/T390240#10745098 (10RobH)
[18:50:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T391056)', diff saved to https://phabricator.wikimedia.org/P75060 and previous config saved to /var/cache/conftool/dbconfig/20250415-185000-fceratto.json
[18:50:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: second frack parent tracking task - https://phabricator.wikimedia.org/T392006#10745104 (10RobH) Please note I've tied original task T390240 to this for ease of tracking.  If rack D6 is not selected (likely wont be) then I'll invalid...
[18:55:47] <wikibugs>	 (03PS2) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967)
[18:57:33] <wikibugs>	 (03PS3) 10Fabfur: cache: allow logging of x-cache-status also for silent-dropped reqs [puppet] - 10https://gerrit.wikimedia.org/r/1136761 (https://phabricator.wikimedia.org/T391967)
[18:58:09] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v10.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136785
[19:01:45] <wikibugs>	 (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v10.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136785 (owner: 10Volans)
[19:03:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row C - bking@cumin2002 - T388610
[19:03:53] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[19:04:24] <James_F>	 dduvall: Are you using the train window, or can I do a deploy? No available windows on Tuesday afternoons, sadly.
[19:05:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2082 to cirrussearch2082
[19:05:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:06:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T391056)', diff saved to https://phabricator.wikimedia.org/P75061 and previous config saved to /var/cache/conftool/dbconfig/20250415-190613-fceratto.json
[19:06:17] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:10:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2082 to cirrussearch2082 - bking@cumin2002"
[19:10:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2082 to cirrussearch2082 - bking@cumin2002"
[19:10:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:10:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2082
[19:10:39] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:10:56] <wikibugs>	 (03PS4) 10Jforrester: [wikifunctionswiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106)
[19:11:02] <wikibugs>	 (03PS4) 10Jforrester: [dagwiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106)
[19:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v10.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136785 (owner: 10Volans)
[19:11:35] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:11:47] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136789
[19:12:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745165 (10Jclark-ctr) @RobH  we have 1 free cross connect circuit id 21996480
[19:12:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136777 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[19:12:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145) (owner: 10Jforrester)
[19:12:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester)
[19:12:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester)
[19:13:06] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136791
[19:13:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745171 (10Jclark-ctr)
[19:13:37] <wikibugs>	 (03Merged) 10jenkins-bot: VE: Start setting wgVisualEditorMobileInsertMenu, default to off [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136777 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[19:13:41] <wikibugs>	 (03Merged) 10jenkins-bot: VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136778 (https://phabricator.wikimedia.org/T383145) (owner: 10Jforrester)
[19:13:45] <wikibugs>	 (03Merged) 10jenkins-bot: [wikifunctionswiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126661 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester)
[19:13:48] <wikibugs>	 (03Merged) 10jenkins-bot: [dagwiki] Enable Wikifunctions client mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126662 (https://phabricator.wikimedia.org/T383106) (owner: 10Jforrester)
[19:14:14] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136777|VE: Start setting wgVisualEditorMobileInsertMenu, default to off (T388604)]], [[gerrit:1136778|VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis (T383145 T388604)]], [[gerrit:1126661|[wikifunctionswiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1126662|[dagwiki] Enable Wikifunctions client mode (T383106)]
[19:14:14] <logmsgbot>	 ]
[19:14:21] <stashbot>	 T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604
[19:14:21] <stashbot>	 T383145: [Abstract Wikipedia] Adding Wikifunctions in VE (desktop + mobile) - https://phabricator.wikimedia.org/T383145
[19:14:21] <stashbot>	 T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106
[19:17:38] <dduvall>	 James_F: yeah, go ahead. train is done. looks ok
[19:17:55] <James_F>	 dduvall: Awesome. (And of course I now have a train-blocker, unrelated to this. Oy.)
[19:18:59] <dduvall>	 James_F: ok. thanks for dealing with that blocker
[19:19:20] <James_F>	 Sorry to be the creator of the code that's making the blockage. :_)
[19:19:49] <James_F>	 "19:15:47 [root] Sleeping for 5 minutes to allow swift eventual consistency, sorry. T390251", sigh.
[19:19:50] <stashbot>	 T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251
[19:19:59] <James_F>	 Oh, oops, sorry stashbot.
[19:21:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P75062 and previous config saved to /var/cache/conftool/dbconfig/20250415-192120-fceratto.json
[19:23:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[19:25:47] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136777|VE: Start setting wgVisualEditorMobileInsertMenu, default to off (T388604)]], [[gerrit:1136778|VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis (T383145 T388604)]], [[gerrit:1126661|[wikifunctionswiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1126662|[dagwiki] Enable Wikifunctions client mode (T383106)]] synced to t
[19:25:48] <logmsgbot>	 he testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:25:53] <stashbot>	 T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604
[19:25:54] <stashbot>	 T383145: [Abstract Wikipedia] Adding Wikifunctions in VE (desktop + mobile) - https://phabricator.wikimedia.org/T383145
[19:25:54] <stashbot>	 T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106
[19:26:03] <wikibugs>	 (03PS2) 10Fabfur: cache: use fqdn in syslog hostname [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571)
[19:27:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745240 (10RobH)
[19:28:43] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[19:28:46] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136679 (https://phabricator.wikimedia.org/T382571) (owner: 10Fabfur)
[19:33:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[19:35:09] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136777|VE: Start setting wgVisualEditorMobileInsertMenu, default to off (T388604)]], [[gerrit:1136778|VE: Set wgVisualEditorMobileInsertMenu true on Wikifunctions client wikis (T383145 T388604)]], [[gerrit:1126661|[wikifunctionswiki] Enable Wikifunctions client mode (T383106)]], [[gerrit:1126662|[dagwiki] Enable Wikifunctions client mode (T383106)
[19:35:09] <logmsgbot>	 ]] (duration: 20m 54s)
[19:35:14] <stashbot>	 T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604
[19:35:14] <stashbot>	 T383145: [Abstract Wikipedia] Adding Wikifunctions in VE (desktop + mobile) - https://phabricator.wikimedia.org/T383145
[19:35:15] <stashbot>	 T383106: [25Q3] Provide Wikifunctions integration in articles on Dagbani Wikipedia - https://phabricator.wikimedia.org/T383106
[19:36:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P75063 and previous config saved to /var/cache/conftool/dbconfig/20250415-193627-fceratto.json
[19:38:05] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10745274 (10RobH) Updates:  Work is scheduled for this afternoon, but the host is depooled so no maint window needed.  I've sent the engineer a detailed info breakdown on what to swap (pcie riser and slo...
[19:38:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:45:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2082
[19:45:40] <James_F>	 jouncebot: next
[19:45:40] <jouncebot>	 In 0 hour(s) and 14 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T2000)
[19:46:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2082 to cirrussearch2082
[19:46:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2082.codfw.wmnet on all recursors
[19:46:16] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2082.codfw.wmnet on all recursors
[19:46:38] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2082.codfw.wmnet with OS bullseye
[19:46:40] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet
[19:46:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2082
[19:47:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:48:06] <wikibugs>	 (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811)
[19:48:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:49:03] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:51:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2082 - bking@cumin2002"
[19:51:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T391056)', diff saved to https://phabricator.wikimedia.org/P75064 and previous config saved to /var/cache/conftool/dbconfig/20250415-195134-fceratto.json
[19:51:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2082 - bking@cumin2002"
[19:51:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:51:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2082.codfw.wmnet 87.32.192.10.in-addr.arpa 7.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:51:38] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:51:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2082.codfw.wmnet 87.32.192.10.in-addr.arpa 7.8.0.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:51:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2082
[19:51:50] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2168.codfw.wmnet with reason: Maintenance
[19:51:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T391056)', diff saved to https://phabricator.wikimedia.org/P75065 and previous config saved to /var/cache/conftool/dbconfig/20250415-195157-fceratto.json
[19:52:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2082
[19:52:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2082
[19:53:25] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] "forgot to send old comment" [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[19:54:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[19:55:40] <wikibugs>	 (03CR) 10Ryan Kemper: cirrussearch: Add new master-eligibles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[19:58:04] <wikibugs>	 (03CR) 10Ryan Kemper: cirrussearch: Add new master-eligibles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T2000).
[20:00:05] <jouncebot>	 robertsky: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:36] <wikibugs>	 (03PS2) 10Ryan Kemper: sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811)
[20:01:27] <James_F>	 I can take the window.
[20:01:39] <robertsky>	 i am around
[20:01:46] <James_F>	 robertsky: Excellent, let's do this.
[20:01:59] <robertsky>	 just woke up for this. :)
[20:02:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[20:02:49] <wikibugs>	 (03PS4) 10Ryan Kemper: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:02:59] <wikibugs>	 (03Merged) 10jenkins-bot: wikimaniawiki: fix add/remove groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136731 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky)
[20:03:07] <wikibugs>	 (03PS5) 10Ryan Kemper: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:03:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745411 (10Jclark-ctr)
[20:03:25] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136731|wikimaniawiki: fix add/remove groups (T389729)]]
[20:03:29] <stashbot>	 T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729
[20:03:49] <wikibugs>	 (03PS1) 10Volans: Upstream release v10.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1136800
[20:04:56] <wikibugs>	 (03CR) 10Bking: [C:03+1] "Conditional +1, if this works with test-cookbook we can go ahead and merge it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[20:05:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "Fixed a small error; this patch should be ready to ship now" [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:05:24] <wikibugs>	 (03CR) 10Volans: [C:03+2] Upstream release v10.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1136800 (owner: 10Volans)
[20:07:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2082.codfw.wmnet with reason: host reimage
[20:07:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[20:08:39] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:08:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T391056)', diff saved to https://phabricator.wikimedia.org/P75066 and previous config saved to /var/cache/conftool/dbconfig/20250415-200855-fceratto.json
[20:09:02] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:09:21] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 4 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[20:09:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: eqiad: determine second frack - https://phabricator.wikimedia.org/T392007#10745517 (10Jclark-ctr)
[20:10:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2082.codfw.wmnet with reason: host reimage
[20:10:45] <wikibugs>	 (03PS5) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882
[20:10:52] <wikibugs>	 (03CR) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[20:12:37] <wikibugs>	 (03PS3) 10Ryan Kemper: sre.elasticsearch.rolling-operation: refactor external cookbook invocations [cookbooks] - 10https://gerrit.wikimedia.org/r/1136796 (https://phabricator.wikimedia.org/T383811)
[20:13:34] <wikibugs>	 (03CR) 10Bking: [C:03+1] cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:13:42] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:14:32] <wikibugs>	 (03PS1) 10Jforrester: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014)
[20:14:42] <wikibugs>	 (03PS1) 10Jforrester: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014)
[20:14:54] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v10.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1136800 (owner: 10Volans)
[20:15:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:15:09] <logmsgbot>	 !log jforrester@deploy1003 robertsky, jforrester: Backport for [[gerrit:1136731|wikimaniawiki: fix add/remove groups (T389729)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:15:12] <stashbot>	 T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729
[20:15:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:15:23] <James_F>	 robertsky: Can you test to confirm it's working as planned?
[20:16:25] <robertsky>	 yes. changes are in. lgtm.
[20:17:57] <logmsgbot>	 !log jforrester@deploy1003 robertsky, jforrester: Continuing with sync
[20:18:01] <James_F>	 Excellent, thank you.
[20:24:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P75067 and previous config saved to /var/cache/conftool/dbconfig/20250415-202401-fceratto.json
[20:24:12] <wikibugs>	 06SRE-OnFire: Discover Phabricator changes needed for using Phabricator as incident response document - https://phabricator.wikimedia.org/T349120#10745567 (10BCornwall) 05Open→03Resolved a:03BCornwall Setting this as closed as it's basically done already
[20:24:29] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136731|wikimaniawiki: fix add/remove groups (T389729)]] (duration: 21m 04s)
[20:24:33] <stashbot>	 T389729: wikimaniawiki: namespaces for 2027-2028 and other adjustments - https://phabricator.wikimedia.org/T389729
[20:27:47] <volans>	 !log uploaded spicerack_10.1.0 to apt.wikimedia.org bullseye-wikimedia
[20:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:47] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:33:01] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10745595 (10Dzahn) 05Open→03Resolved a:03Dzahn Looking at the graph for the last 7 days there is nothing out of the ordinary anymore...
[20:33:17] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10745598 (10Dzahn) a:05Dzahn→03None
[20:35:06] <wikibugs>	 (03PS1) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014)
[20:35:15] <wikibugs>	 (03PS1) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014)
[20:35:47] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[20:36:50] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[20:37:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:37:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:37:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:37:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:37:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2082.codfw.wmnet with OS bullseye
[20:38:49] <wikibugs>	 (03Merged) 10jenkins-bot: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136802 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:39:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P75068 and previous config saved to /var/cache/conftool/dbconfig/20250415-203909-fceratto.json
[20:39:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:39:50] <wikibugs>	 (03CR) 10Scott French: [C:03+1] CampaignEvents: Migrate aggregateparticipantanswers-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1136734 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[20:39:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:41:02] <wikibugs>	 (03CR) 10Scott French: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[20:41:08] <wikibugs>	 (03PS2) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014)
[20:41:20] <wikibugs>	 (03PS2) 10Jforrester: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014)
[20:42:44] <wikibugs>	 (03Merged) 10jenkins-bot: FetchHandler: Disable on non-repo wikis [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136801 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:43:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:43:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:48:12] <wikibugs>	 (03Merged) 10jenkins-bot: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1136807 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:48:14] <wikibugs>	 (03Merged) 10jenkins-bot: FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136808 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester)
[20:48:44] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136802|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136807|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]], [[gerrit:1136801|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136808|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)
[20:48:44] <logmsgbot>	 ]]
[20:48:47] <stashbot>	 T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014
[20:51:14] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1180.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:52:36] <wikibugs>	 (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1136765/5306/releases1003.eqiad.wmnet/change.releases1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[20:53:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2103 to cirrussearch2103
[20:53:40] <wikibugs>	 (03PS2) 10Dzahn: jenkins: ensure systemd service dir exists before override [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595)
[20:53:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:54:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T391056)', diff saved to https://phabricator.wikimedia.org/P75069 and previous config saved to /var/cache/conftool/dbconfig/20250415-205416-fceratto.json
[20:54:19] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:54:21] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2182.codfw.wmnet with reason: Maintenance
[20:54:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75070 and previous config saved to /var/cache/conftool/dbconfig/20250415-205427-fceratto.json
[20:56:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3619 MB (3% inode=98%): /tmp 3619 MB (3% inode=98%): /var/tmp 3619 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[20:56:43] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1136765/5307/" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250415T2100)
[21:05:02] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1180.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:07:00] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1180.eqiad.wmnet with OS bullseye
[21:07:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10745687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1180...
[21:09:48] <jinxer-wm>	 FIRING: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:10:31] <wikibugs>	 (03PS3) 10Eevans: restbase: bootstrap restbase1045 (refresh for restbase1030) [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423)
[21:11:14] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans)
[21:11:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75071 and previous config saved to /var/cache/conftool/dbconfig/20250415-211152-fceratto.json
[21:11:56] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:13:13] <wikibugs>	 (03PS1) 10JHathaway: postfix: remove exim aliases [puppet] - 10https://gerrit.wikimedia.org/r/1136811
[21:13:27] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136811 (owner: 10JHathaway)
[21:15:43] <wikibugs>	 (03CR) 10Eevans: [C:03+2] restbase: bootstrap restbase1045 (refresh for restbase1030) [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans)
[21:16:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: remove exim aliases [puppet] - 10https://gerrit.wikimedia.org/r/1136811 (owner: 10JHathaway)
[21:20:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2103 to cirrussearch2103 - bking@cumin2002"
[21:21:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2103 to cirrussearch2103 - bking@cumin2002"
[21:21:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:21:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2103
[21:21:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2103
[21:22:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2103 to cirrussearch2103
[21:22:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2103.codfw.wmnet on all recursors
[21:22:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2103.codfw.wmnet on all recursors
[21:22:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2103.codfw.wmnet with OS bullseye
[21:22:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2103
[21:23:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:26:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P75072 and previous config saved to /var/cache/conftool/dbconfig/20250415-212659-fceratto.json
[21:27:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2103 - bking@cumin2002"
[21:27:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2103 - bking@cumin2002"
[21:27:15] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:27:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2103.codfw.wmnet 222.32.192.10.in-addr.arpa 2.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:27:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2103.codfw.wmnet 222.32.192.10.in-addr.arpa 2.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[21:27:19] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136802|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136807|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]], [[gerrit:1136801|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136808|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]] synced to
[21:27:19] <logmsgbot>	 the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:27:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2103
[21:27:25] <stashbot>	 T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014
[21:27:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2103
[21:27:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2103
[21:27:41] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[21:28:30] <wikibugs>	 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10745745 (10BCornwall) 05Open→03Stalled Indeed.... too bad. Hopefully we'll hear back sooner rather than later!
[21:30:59] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks! Nice to see the host list is now sorted, too :)" [puppet] - 10https://gerrit.wikimedia.org/r/1129177 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[21:41:36] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136802|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136807|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014)]], [[gerrit:1136801|FetchHandler: Disable on non-repo wikis (T392014)]], [[gerrit:1136808|FetchHandler: Don't read from the DB in getParamSettings on non-repo wikis either (T392014
[21:41:36] <logmsgbot>	 )]] (duration: 52m 52s)
[21:41:39] <stashbot>	 T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014
[21:42:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P75073 and previous config saved to /var/cache/conftool/dbconfig/20250415-214206-fceratto.json
[21:42:41] <logmsgbot>	 !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1045.eqiad.wmnet with reason: Bootstrapping — T389423
[21:42:44] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[21:44:48] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:44:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2103.codfw.wmnet with reason: host reimage
[21:46:39] <urandom>	 !log bootstrapping Cassandra/restbase1045-{a,b,c} — T389423
[21:46:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2103.codfw.wmnet with reason: host reimage
[21:48:30] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1180.eqiad.wmnet with OS bullseye
[21:48:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10745778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1180.eqi...
[21:50:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] mwopenstackclients: fix DnsManager [puppet] - 10https://gerrit.wikimedia.org/r/1136780 (owner: 10Andrew Bogott)
[21:53:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[21:53:39] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1045-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:54:48] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:57:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T391056)', diff saved to https://phabricator.wikimedia.org/P75074 and previous config saved to /var/cache/conftool/dbconfig/20250415-215714-fceratto.json
[21:57:18] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:57:30] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2198.codfw.wmnet with reason: Maintenance
[21:58:39] <jinxer-wm>	 FIRING: [4x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[22:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[22:07:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:10:38] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2200.codfw.wmnet with reason: Maintenance
[22:13:39] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:16:17] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3485 MB (3% inode=98%): /tmp 3485 MB (3% inode=98%): /var/tmp 3485 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[22:17:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2103.codfw.wmnet with OS bullseye
[22:22:50] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10745829 (10Eevans) >>! In T391544#10743933, @MatthewVernon wrote: > There's an LVM layer here too, isn't ther...
[22:23:09] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2208.codfw.wmnet with reason: Maintenance
[22:23:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T391056)', diff saved to https://phabricator.wikimedia.org/P75075 and previous config saved to /var/cache/conftool/dbconfig/20250415-222316-fceratto.json
[22:23:20] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[22:23:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:35:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2112 to cirrussearch2112
[22:35:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[22:38:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[22:39:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T391056)', diff saved to https://phabricator.wikimedia.org/P75076 and previous config saved to /var/cache/conftool/dbconfig/20250415-223949-fceratto.json
[22:39:53] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[22:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:43:39] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4047 is OK: HTTP OK: HTTP/1.1 200 OK - 48114 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[22:46:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:50:56] <wikibugs>	 (03CR) 10Cwhite: "On hold until 2025-04-30." [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[22:52:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2112 to cirrussearch2112 - bking@cumin2002"
[22:54:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P75077 and previous config saved to /var/cache/conftool/dbconfig/20250415-225456-fceratto.json
[22:57:07] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10745924 (10RobH) a:05RobH→03ssingh Ok, parts swapped and new PCIe riser and SSD detected (only change really is serial of the ssd in lshw output).  This is now ready to have puppet run and tenativel...
[23:09:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2112 to cirrussearch2112 - bking@cumin2002"
[23:09:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:09:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2112
[23:09:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2112
[23:09:48] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: citoid - citoid-requests - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[23:10:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2112 to cirrussearch2112
[23:10:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2112.codfw.wmnet on all recursors
[23:10:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2112.codfw.wmnet on all recursors
[23:10:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P75078 and previous config saved to /var/cache/conftool/dbconfig/20250415-231003-fceratto.json
[23:10:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2112.codfw.wmnet with OS bullseye
[23:10:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2112
[23:10:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2112
[23:12:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] jenkins: ensure systemd service dir exists before override [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[23:16:17] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3391 MB (3% inode=98%): /tmp 3391 MB (3% inode=98%): /var/tmp 3391 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[23:20:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "noop confirmed on releases* and contint*" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[23:24:31] <mutante>	 why do we alert on archiva disk space if it doesn'
[23:25:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T391056)', diff saved to https://phabricator.wikimedia.org/P75079 and previous config saved to /var/cache/conftool/dbconfig/20250415-232511-fceratto.json
[23:25:16] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[23:25:28] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2220.codfw.wmnet with reason: Maintenance
[23:25:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T391056)', diff saved to https://phabricator.wikimedia.org/P75080 and previous config saved to /var/cache/conftool/dbconfig/20250415-232535-fceratto.json
[23:25:35] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "manually deleted the directory and saw puppet re-create it on releases2003" [puppet] - 10https://gerrit.wikimedia.org/r/1136765 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[23:27:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2112.codfw.wmnet with reason: host reimage
[23:32:15] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2112.codfw.wmnet with reason: host reimage
[23:40:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136821
[23:40:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136821 (owner: 10TrainBranchBot)
[23:41:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T391056)', diff saved to https://phabricator.wikimedia.org/P75081 and previous config saved to /var/cache/conftool/dbconfig/20250415-234142-fceratto.json
[23:41:47] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[23:48:24] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9600 on cirrussearch2103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:48:24] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch2103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:48:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: prometheus-wmf-elasticsearch-exporter-9200.service crashloop on cirrussearch2103:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[23:50:38] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2103-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:52:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2112.codfw.wmnet with OS bullseye
[23:52:20] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136821 (owner: 10TrainBranchBot)
[23:53:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:56:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P75082 and previous config saved to /var/cache/conftool/dbconfig/20250415-235649-fceratto.json