[00:06:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:08:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169793 [00:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169793 (owner: 10TrainBranchBot) [00:17:31] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/1169668 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [00:20:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T399249)', diff saved to https://phabricator.wikimedia.org/P79143 and previous config saved to /var/cache/conftool/dbconfig/20250716-002031-marostegui.json [00:20:36] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [00:30:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169793 (owner: 10TrainBranchBot) [00:32:25] (03CR) 10BCornwall: [C:03+1] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169725 (https://phabricator.wikimedia.org/T399619) (owner: 10Gerrit maintenance bot) [00:35:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P79144 and previous config saved to /var/cache/conftool/dbconfig/20250716-003539-marostegui.json [00:41:04] (03CR) 10BCornwall: [C:03+1] "Some apprehension around `&all_sites` but this is very much clearer. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [00:50:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P79145 and previous config saved to /var/cache/conftool/dbconfig/20250716-005047-marostegui.json [00:51:08] (03CR) 10BCornwall: [C:03+1] hiera: service.yaml: use better aliasing for text/upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [00:51:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/440584b787f64ff4fb7c598bc088f2bdc3808425146c9e238e8eff4152a10d12/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:05:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T399249)', diff saved to https://phabricator.wikimedia.org/P79146 and previous config saved to /var/cache/conftool/dbconfig/20250716-010554-marostegui.json [01:05:59] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [01:06:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2223.codfw.wmnet with reason: Maintenance [01:06:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T399249)', diff saved to https://phabricator.wikimedia.org/P79147 and previous config saved to /var/cache/conftool/dbconfig/20250716-010617-marostegui.json [01:11:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:22:14] PROBLEM - MegaRAID on backup1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:22:16] ACKNOWLEDGEMENT - MegaRAID on backup1007 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T399671 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:22:27] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671 (10ops-monitoring-bot) 03NEW [01:34:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T399249)', diff saved to https://phabricator.wikimedia.org/P79148 and previous config saved to /var/cache/conftool/dbconfig/20250716-013410-marostegui.json [01:34:15] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [01:39:25] FIRING: [3x] ProbeDown: Service aqs1012-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:49:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P79149 and previous config saved to /var/cache/conftool/dbconfig/20250716-014918-marostegui.json [01:59:25] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:04:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P79150 and previous config saved to /var/cache/conftool/dbconfig/20250716-020426-marostegui.json [02:11:44] (03PS5) 10Krinkle: scap: Limit foreachwikiindblist and expanddblist to beta wikis in beta [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) [02:11:44] (03CR) 10Krinkle: "Found a better way." [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) (owner: 10Krinkle) [02:13:53] (03PS6) 10Krinkle: scap: Limit foreachwikiindblist and expanddblist to beta wikis in beta [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T125976) [02:15:56] (03PS7) 10Krinkle: scap: Limit foreachwikiindblist and expanddblist to beta wikis in beta [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) [02:17:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:08] (03PS8) 10Krinkle: scap: Limit foreachwikiindblist and expanddblist to beta wikis in beta [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) [02:19:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T399249)', diff saved to https://phabricator.wikimedia.org/P79151 and previous config saved to /var/cache/conftool/dbconfig/20250716-021933-marostegui.json [02:19:41] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [02:19:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2228.codfw.wmnet with reason: Maintenance [02:19:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T399249)', diff saved to https://phabricator.wikimedia.org/P79152 and previous config saved to /var/cache/conftool/dbconfig/20250716-021956-marostegui.json [02:21:06] PROBLEM - Disk space on dbprov2003 is CRITICAL: DISK CRITICAL - free space: /srv 379158MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [02:22:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:28:06] (03CR) 10Krinkle: "`" [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) (owner: 10Krinkle) [02:29:25] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:39:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:46:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T399249)', diff saved to https://phabricator.wikimedia.org/P79153 and previous config saved to /var/cache/conftool/dbconfig/20250716-024611-marostegui.json [02:46:16] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [03:01:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P79154 and previous config saved to /var/cache/conftool/dbconfig/20250716-030119-marostegui.json [03:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:12:32] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:16:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P79155 and previous config saved to /var/cache/conftool/dbconfig/20250716-031626-marostegui.json [03:17:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:21:06] RECOVERY - Disk space on dbprov2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [03:31:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T399249)', diff saved to https://phabricator.wikimedia.org/P79156 and previous config saved to /var/cache/conftool/dbconfig/20250716-033133-marostegui.json [03:31:38] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [04:06:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [04:56:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:02:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:04:41] 06SRE, 06Release-Engineering-Team: Archiva Mirror Maven Central cache no space left on device - https://phabricator.wikimedia.org/T399679 (10amastilovic) 03NEW [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:25] FIRING: [3x] ProbeDown: Service aqs1012-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:46:10] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2192 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1169812 (https://phabricator.wikimedia.org/T399680) [05:46:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1159.eqiad.wmnet with reason: Maintenance [05:46:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T399249)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250716-054645-marostegui.json [05:47:00] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [05:52:35] (03PS1) 10Marostegui: db1256: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169814 (https://phabricator.wikimedia.org/T399298) [05:53:08] (03CR) 10Marostegui: [C:03+2] db1256: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169814 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [05:54:05] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1256.eqiad.wmnet with reason: Maintenance [05:54:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1256 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79157 and previous config saved to /var/cache/conftool/dbconfig/20250716-055408-marostegui.json [05:59:25] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:59:38] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [06:00:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T0600) [06:01:08] (03CR) 10Arnaudb: [C:03+2] gerrit: remove changeMerge settings [puppet] - 10https://gerrit.wikimedia.org/r/1168619 (owner: 10Hashar) [06:01:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79159 and previous config saved to /var/cache/conftool/dbconfig/20250716-060148-root.json [06:03:53] !log Restart mariadb on pc5 T399540 [06:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:58] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [06:07:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-eqiad and NTT (2001:418:16::110) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Transit6&var-bgp_neighbor=NTT - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:09:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: maintenance [06:11:26] (03CR) 10Arnaudb: [C:03+2] Revert "Gerrit: Set cache for groups" [puppet] - 10https://gerrit.wikimedia.org/r/1169658 (owner: 10Hashar) [06:12:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and NTT (192.80.17.185) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:15:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T399249)', diff saved to https://phabricator.wikimedia.org/P79160 and previous config saved to /var/cache/conftool/dbconfig/20250716-061539-marostegui.json [06:15:43] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:16:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79161 and previous config saved to /var/cache/conftool/dbconfig/20250716-061653-root.json [06:19:27] !log Poweroff pc2015 for 10G migration T378715 [06:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:32] T378715: Possibility to transition some codfw data persistence hosts to 10G - https://phabricator.wikimedia.org/T378715 [06:20:56] (03CR) 10Arnaudb: [C:03+2] gerrit: remove GWT-only theme configuration [puppet] - 10https://gerrit.wikimedia.org/r/1169660 (owner: 10Hashar) [06:24:20] (03PS1) 10Marostegui: db2149: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169816 (https://phabricator.wikimedia.org/T399548) [06:24:56] (03CR) 10Marostegui: [C:03+2] db2149: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169816 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [06:25:33] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2149.codfw.wmnet with reason: Maintenance [06:25:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2149 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79162 and previous config saved to /var/cache/conftool/dbconfig/20250716-062537-marostegui.json [06:29:25] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:30:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P79163 and previous config saved to /var/cache/conftool/dbconfig/20250716-063046-marostegui.json [06:32:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79164 and previous config saved to /var/cache/conftool/dbconfig/20250716-063159-root.json [06:36:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79165 and previous config saved to /var/cache/conftool/dbconfig/20250716-063646-root.json [06:39:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:39:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Upgrade s6 codfw master [06:44:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Upgrade x3 codfw master [06:45:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P79166 and previous config saved to /var/cache/conftool/dbconfig/20250716-064553-marostegui.json [06:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1256 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79167 and previous config saved to /var/cache/conftool/dbconfig/20250716-064705-root.json [06:49:38] (03PS2) 10Ayounsi: Routed Ganeti: disable IPv4 ICMP redirects [puppet] - 10https://gerrit.wikimedia.org/r/1169663 (https://phabricator.wikimedia.org/T362392) [06:51:06] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [06:51:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79168 and previous config saved to /var/cache/conftool/dbconfig/20250716-065152-root.json [06:54:43] (03PS1) 10Marostegui: db1257: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170022 (https://phabricator.wikimedia.org/T399298) [06:55:23] (03CR) 10Marostegui: [C:03+2] db1257: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170022 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [06:55:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169752 (owner: 10Scott French) [06:55:54] (03CR) 10Ayounsi: [C:03+2] Routed Ganeti: disable IPv4 ICMP redirects [puppet] - 10https://gerrit.wikimedia.org/r/1169663 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [06:56:22] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1257.eqiad.wmnet with reason: Maintenance [06:56:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1257 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79169 and previous config saved to /var/cache/conftool/dbconfig/20250716-065626-marostegui.json [07:00:04] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T399249)', diff saved to https://phabricator.wikimedia.org/P79170 and previous config saved to /var/cache/conftool/dbconfig/20250716-070101-marostegui.json [07:01:11] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:01:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:01:21] (03CR) 10Giuseppe Lavagetto: [C:04-1] "While the patch in itself seems ok, looking at the ACL wikimedia_nets, it seems different from the one we have in haproxy, which doesn't i" [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [07:01:23] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:01:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T399249)', diff saved to https://phabricator.wikimedia.org/P79171 and previous config saved to /var/cache/conftool/dbconfig/20250716-070130-marostegui.json [07:02:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and NTT (192.80.17.185) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:03:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79172 and previous config saved to /var/cache/conftool/dbconfig/20250716-070338-root.json [07:06:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169162 (https://phabricator.wikimedia.org/T398640) (owner: 10Elukey) [07:07:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79173 and previous config saved to /var/cache/conftool/dbconfig/20250716-070659-root.json [07:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:18:04] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11007918 (10Nikerabbit) 05In progress→03Resolved [07:18:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79174 and previous config saved to /var/cache/conftool/dbconfig/20250716-071844-root.json [07:22:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2149 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79175 and previous config saved to /var/cache/conftool/dbconfig/20250716-072205-root.json [07:23:15] (03CR) 10Elukey: [C:03+1] "I added my concerns to the patch, but it is also fine to proceed if PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [07:24:11] (03CR) 10Hashar: [C:03+2] Move repository to gitlab [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169730 (https://phabricator.wikimedia.org/T399617) (owner: 10Ebernhardson) [07:29:43] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bookworm [07:30:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T399249)', diff saved to https://phabricator.wikimedia.org/P79176 and previous config saved to /var/cache/conftool/dbconfig/20250716-073031-marostegui.json [07:30:37] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:33:04] PROBLEM - Host gitlab-replica-a.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [07:33:38] ? [07:33:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79177 and previous config saved to /var/cache/conftool/dbconfig/20250716-073349-root.json [07:36:42] FIRING: [3x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:37:58] i cannot connect to gitlab1004 [07:41:42] FIRING: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:42:29] no login on serial console [07:42:39] ah, it is rebooting [07:43:30] jynus: it's being reimaged [07:43:32] see SAL [07:43:37] or above [07:43:42] oh, I didn't see it [07:44:39] AFAICT those alerts are not attached to the host so they were not silenced and would have required a manual silence [07:44:46] *not automatically silenced [07:45:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P79178 and previous config saved to /var/cache/conftool/dbconfig/20250716-074538-marostegui.json [07:45:47] it's ok, the log was invisible to me [07:46:35] !log jelto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [07:48:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1257 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79179 and previous config saved to /var/cache/conftool/dbconfig/20250716-074855-root.json [07:49:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2162 with weight 0 T399456', diff saved to https://phabricator.wikimedia.org/P79180 and previous config saved to /var/cache/conftool/dbconfig/20250716-074931-root.json [07:49:35] T399456: Switchover x3 master (db2241 -> db2162) - https://phabricator.wikimedia.org/T399456 [07:49:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Primary switchover x3 T399456 [07:50:10] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [07:50:34] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2162 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1169093 (https://phabricator.wikimedia.org/T399456) (owner: 10Gerrit maintenance bot) [07:53:10] (03PS1) 10Marostegui: db2241: Fix comment [puppet] - 10https://gerrit.wikimedia.org/r/1170082 (https://phabricator.wikimedia.org/T399456) [07:53:22] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1170082 (https://phabricator.wikimedia.org/T399456) (owner: 10Marostegui) [07:53:39] (03CR) 10Marostegui: [C:03+2] db2241: Fix comment [puppet] - 10https://gerrit.wikimedia.org/r/1170082 (https://phabricator.wikimedia.org/T399456) (owner: 10Marostegui) [07:54:29] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:54:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2162 to x3 primary T399456', diff saved to https://phabricator.wikimedia.org/P79181 and previous config saved to /var/cache/conftool/dbconfig/20250716-075448-marostegui.json [07:54:52] T399456: Switchover x3 master (db2241 -> db2162) - https://phabricator.wikimedia.org/T399456 [07:54:54] !log Starting x3 codfw failover from db2241 to db2162 - T399456 [07:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:23] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [07:55:29] (03PS3) 10Vgutierrez: cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) [07:55:29] (03PS3) 10Vgutierrez: varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) [07:55:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2241 T399456', diff saved to https://phabricator.wikimedia.org/P79182 and previous config saved to /var/cache/conftool/dbconfig/20250716-075534-marostegui.json [07:56:23] (03CR) 10Vgutierrez: "that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1169621/2..3/modules/profile/templates/cache/haproxy/tls_terminator.cfg.erb" [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [07:56:51] (03PS1) 10Marostegui: db2241: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170083 (https://phabricator.wikimedia.org/T399456) [07:57:57] (03CR) 10Marostegui: [C:03+2] db2241: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170083 (https://phabricator.wikimedia.org/T399456) (owner: 10Marostegui) [07:58:14] (03CR) 10Federico Ceratto: "LGTM, maybe it could be useful to add a link to docs on how to configure external-services" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169162 (https://phabricator.wikimedia.org/T398640) (owner: 10Elukey) [07:58:33] RECOVERY - Host gitlab-replica-a.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 110.24 ms [07:59:08] FIRING: [5x] ProbeDown: Service aqs1012-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:05] dancy and andre: MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T0800). Please do the needful. [08:00:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P79183 and previous config saved to /var/cache/conftool/dbconfig/20250716-080046-marostegui.json [08:04:21] (03CR) 10Giuseppe Lavagetto: [C:03+1] "I didn't check the WME ip ranges but otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [08:04:53] (03CR) 10Muehlenhoff: [C:03+2] Stop using debug repository on Buster [puppet] - 10https://gerrit.wikimedia.org/r/1169610 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff) [08:04:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79184 and previous config saved to /var/cache/conftool/dbconfig/20250716-080458-root.json [08:06:18] (03PS1) 10Aqu: Analytics: Refine post migration update [puppet] - 10https://gerrit.wikimedia.org/r/1170084 (https://phabricator.wikimedia.org/T369845) [08:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2229 with weight 0 T399533', diff saved to https://phabricator.wikimedia.org/P79185 and previous config saved to /var/cache/conftool/dbconfig/20250716-080639-root.json [08:06:42] FIRING: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:06:44] T399533: Switchover s6 master (db2214 -> db2229) - https://phabricator.wikimedia.org/T399533 [08:06:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s6 T399533 [08:06:53] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11008081 (10elukey) For some reason, on these nodes we have: ` 'COM1ConsoleRedirection', 'ConsoleRedirectionEMS', 'SOL_COM2ConsoleRedirection' ` [08:07:19] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2229 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1169306 (https://phabricator.wikimedia.org/T399533) (owner: 10Gerrit maintenance bot) [08:09:39] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170084 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:09:54] (03PS1) 10Elukey: WIP: sre.hosts.provision: add custom console redir settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [08:09:59] PROBLEM - Host gitlab-replica-a.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:11:58] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:12:40] !log Starting s6 codfw failover from db2214 to db2229 - T399533 [08:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:43] T399533: Switchover s6 master (db2214 -> db2229) - https://phabricator.wikimedia.org/T399533 [08:13:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2229 to s6 primary T399533', diff saved to https://phabricator.wikimedia.org/P79186 and previous config saved to /var/cache/conftool/dbconfig/20250716-081302-marostegui.json [08:13:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2214 T399533', diff saved to https://phabricator.wikimedia.org/P79187 and previous config saved to /var/cache/conftool/dbconfig/20250716-081350-marostegui.json [08:15:01] RECOVERY - Host gitlab-replica-a.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [08:15:35] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1004.wikimedia.org with OS bookworm [08:15:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T399249)', diff saved to https://phabricator.wikimedia.org/P79189 and previous config saved to /var/cache/conftool/dbconfig/20250716-081553-marostegui.json [08:15:58] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:16:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance [08:16:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2214.codfw.wmnet with reason: maintenance [08:16:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T399249)', diff saved to https://phabricator.wikimedia.org/P79190 and previous config saved to /var/cache/conftool/dbconfig/20250716-081615-marostegui.json [08:17:15] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:17:49] (03CR) 10Elukey: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169162 (https://phabricator.wikimedia.org/T398640) (owner: 10Elukey) [08:18:45] (03PS2) 10Elukey: WIP: sre.hosts.provision: add custom console redir settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [08:19:45] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11008124 (10elukey) I tested the above patch with test-cookbook and it seems to work, but of course I found another issue when configuring the BMC's network... [08:19:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11008126 (10jcrespo) The disk being rebuilt has 1 error and a S.M.A.R.T alert. But aside from that, there are also 2 disks with high error counts :-(: ` Enclosure Device ID: 32 Slot Number: 24 Enclosure po... [08:20:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79192 and previous config saved to /var/cache/conftool/dbconfig/20250716-082004-root.json [08:21:43] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bookworm [08:23:00] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backup1007.eqiad.wmnet with reason: Stop minio [08:23:09] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11008145 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=97c976e9-1bb4-4b64-b9b4-083b6aafa2fa) set by jynus@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Stop minio `... [08:24:30] (03PS1) 10Marostegui: db2156: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170086 (https://phabricator.wikimedia.org/T399548) [08:25:03] (03CR) 10Marostegui: [C:03+2] db2156: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170086 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [08:25:27] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2156.codfw.wmnet with reason: Maintenance [08:25:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2156 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79193 and previous config saved to /var/cache/conftool/dbconfig/20250716-082530-marostegui.json [08:26:29] PROBLEM - Host gitlab-replica-a.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:26:45] (03PS2) 10Aqu: Analytics: Refine post migration update [puppet] - 10https://gerrit.wikimedia.org/r/1170084 (https://phabricator.wikimedia.org/T369845) [08:27:21] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [08:29:10] (03PS3) 10Elukey: WIP: sre.hosts.provision: add custom settings for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1170085 (https://phabricator.wikimedia.org/T394357) [08:29:19] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170084 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:30:01] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:31:42] FIRING: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:32:36] Hi everyone, we are investigating why the `kafka_burrow_partition_lag{group="cpjobqueue-ORESFetchScoreJob"}` metric stopped being exported from the codfw site: https://prometheus-codfw.wikimedia.org/ops/graph?g0.expr=kafka_burrow_partition_lag%7Bgroup%3D%22cpjobqueue-ORESFetchScoreJob%22%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=2d [08:32:36] this is causing a recurring linting alert for the ML team as reported in: https://phabricator.wikimedia.org/T399683 [08:32:36] does anyone recall a change around that time that might have affected this job? any pointers would be greatly appreciated. [08:32:36] cc: elukey, isaranto ---^ [08:35:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79196 and previous config saved to /var/cache/conftool/dbconfig/20250716-083509-root.json [08:36:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79198 and previous config saved to /var/cache/conftool/dbconfig/20250716-083614-root.json [08:39:13] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:40:14] (03CR) 10Vgutierrez: [C:03+1] hiera: service.yaml: use better aliasing for text/upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [08:40:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:40:29] (03CR) 10Fabfur: [C:03+1] cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [08:42:02] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11008241 (10jcrespo) ` root@backup1007:~$ megacli -pdrbld -showprog -physdrv\[32:13\] -aALL Rebuild Progress on Device at Enclosure 32, Slot 13 Completed 22% in 397 Minu... [08:43:30] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#11008248 (10BTullis) Thanks @Jclark-ctr and apologies for the delay. We're OK with deleting the preserved cache on these an-worker data drives, because they are all individual raid0... [08:43:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T399249)', diff saved to https://phabricator.wikimedia.org/P79200 and previous config saved to /var/cache/conftool/dbconfig/20250716-084354-marostegui.json [08:43:58] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:44:00] (03PS3) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) [08:44:13] (03CR) 10Vgutierrez: [C:03+1] "thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [08:44:26] (03PS4) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) [08:44:27] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [08:44:57] RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1189 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [08:49:42] (03PS5) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) [08:50:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2241 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79201 and previous config saved to /var/cache/conftool/dbconfig/20250716-085015-root.json [08:51:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79202 and previous config saved to /var/cache/conftool/dbconfig/20250716-085119-root.json [08:51:42] RESOLVED: [6x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:53:29] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#11008254 (10BTullis) I have prepared the two disks as per the instructions at: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk `... [08:53:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1189.eqiad.wmnet [08:53:50] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#11008256 (10ops-monitoring-bot) Host an-worker1189.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting after replacing disk [08:55:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:55:41] (03PS1) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) [08:56:13] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6291/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [08:57:35] (03CR) 10FNegri: [V:03+1] openstack: nova: Load nf_conntrack module at boot (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [08:58:20] (03PS6) 10FNegri: openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) [08:59:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250716-085901-marostegui.json [08:59:53] (03PS2) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) [09:00:22] (03CR) 10FNegri: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6292/co" [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [09:01:33] !log jelto@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gitlab1004.wikimedia.org with OS bookworm [09:01:41] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1189.eqiad.wmnet [09:02:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:04:05] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bookworm [09:06:03] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:06:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79204 and previous config saved to /var/cache/conftool/dbconfig/20250716-090625-root.json [09:07:31] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:07:50] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:09:04] (03PS1) 10Vgutierrez: cache::haproxy: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1170093 (https://phabricator.wikimedia.org/T399058) [09:09:07] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:09:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170093 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [09:10:06] 10ops-codfw, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11008286 (10elukey) @Jhancock.wm I've provisioned the host, could you please check if everything looks good? [09:12:56] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1170093 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [09:13:18] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11008304 (10elukey) @Jhancock.wm Hi! I tried to run a customized version of the provision script (the same that worked for sretest2010) but for some reason the host doesn't seem to be network reachable... [09:14:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P79205 and previous config saved to /var/cache/conftool/dbconfig/20250716-091413-marostegui.json [09:16:30] !log jelto@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gitlab1004.wikimedia.org with OS bookworm [09:16:55] (03PS2) 10Elukey: python-webapp: add external-services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169162 (https://phabricator.wikimedia.org/T398640) [09:18:32] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab1004.wikimedia.org with OS bookworm [09:20:10] (03PS1) 10Muehlenhoff: Remove check_user script [puppet] - 10https://gerrit.wikimedia.org/r/1170096 (https://phabricator.wikimedia.org/T394072) [09:20:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:21:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79206 and previous config saved to /var/cache/conftool/dbconfig/20250716-092131-root.json [09:21:32] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [09:21:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170096 (https://phabricator.wikimedia.org/T394072) (owner: 10Muehlenhoff) [09:22:56] (03PS1) 10Marostegui: db2177: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170097 (https://phabricator.wikimedia.org/T399548) [09:23:27] (03CR) 10Marostegui: [C:03+2] db2177: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170097 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [09:24:17] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2177.codfw.wmnet with reason: Maintenance [09:24:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2177 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79207 and previous config saved to /var/cache/conftool/dbconfig/20250716-092420-marostegui.json [09:26:26] (03CR) 10Brouberol: [C:03+1] Analytics: Refine post migration update [puppet] - 10https://gerrit.wikimedia.org/r/1170084 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [09:26:43] (03CR) 10Muehlenhoff: "(The PCC failure for P5 is expected and irrelevant)" [puppet] - 10https://gerrit.wikimedia.org/r/1170096 (https://phabricator.wikimedia.org/T394072) (owner: 10Muehlenhoff) [09:27:05] (03CR) 10Brouberol: [C:03+2] Analytics: Refine post migration update [puppet] - 10https://gerrit.wikimedia.org/r/1170084 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [09:27:55] (03CR) 10FNegri: [V:03+1 C:03+2] openstack: nova: Load nf_conntrack module at boot [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [09:28:32] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1258 to x3 master [puppet] - 10https://gerrit.wikimedia.org/r/1170098 (https://phabricator.wikimedia.org/T399699) [09:28:37] (03PS1) 10Gerrit maintenance bot: wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170099 (https://phabricator.wikimedia.org/T399699) [09:29:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T399249)', diff saved to https://phabricator.wikimedia.org/P79208 and previous config saved to /var/cache/conftool/dbconfig/20250716-092919-marostegui.json [09:29:24] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:29:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance [09:29:40] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1169627 (owner: 10Muehlenhoff) [09:29:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T399249)', diff saved to https://phabricator.wikimedia.org/P79209 and previous config saved to /var/cache/conftool/dbconfig/20250716-092942-marostegui.json [09:30:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: maintenance [09:32:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Degraded RAID on an-worker1189 - https://phabricator.wikimedia.org/T398773#11008379 (10BTullis) 05Open→03Resolved [09:33:10] (03PS1) 10Marostegui: db1151: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1170102 [09:34:22] !log jelto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [09:34:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 25%: 10', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250716-093448-root.json [09:35:16] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1170102 (owner: 10Marostegui) [09:35:18] (03CR) 10Marostegui: [C:03+2] db1151: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1170102 (owner: 10Marostegui) [09:37:47] (03PS1) 10Marostegui: db1152: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1170103 [09:38:01] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1170103 (owner: 10Marostegui) [09:38:38] (03CR) 10Marostegui: [C:03+2] db1152: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1170103 (owner: 10Marostegui) [09:38:59] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: host reimage [09:46:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:47:56] RECOVERY - Host gitlab-replica-a.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [09:49:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79210 and previous config saved to /var/cache/conftool/dbconfig/20250716-094958-root.json [09:50:14] (03PS2) 10Tiziano Fogli: prom/metamonitor: add CNAMEs for metamonitoring endpoints [dns] - 10https://gerrit.wikimedia.org/r/1169668 (https://phabricator.wikimedia.org/T397003) [09:50:33] (03PS1) 10Tiziano Fogli: prom/metamonitor: make physical vhosts agnostic to the machine hostname [puppet] - 10https://gerrit.wikimedia.org/r/1170104 (https://phabricator.wikimedia.org/T397003) [09:51:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:51:39] (03CR) 10Giuseppe Lavagetto: [C:03+1] varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [09:52:10] (03PS4) 10Vgutierrez: varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) [09:54:42] FIRING: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T399249)', diff saved to https://phabricator.wikimedia.org/P79211 and previous config saved to /var/cache/conftool/dbconfig/20250716-095554-marostegui.json [09:55:59] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:56:12] jouncebot: nowandnext [09:56:13] For the next 0 hour(s) and 3 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T0800) [09:56:13] In 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1000) [09:57:37] (03PS1) 10Kevin Bazira: team-ml: use global deploy tag for ORESFetchScoreJobKafkaLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1170107 (https://phabricator.wikimedia.org/T399683) [09:59:25] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1000) [10:00:33] (03CR) 10Filippo Giunchedi: [C:03+1] prom/metamonitor: add CNAMEs for metamonitoring endpoints [dns] - 10https://gerrit.wikimedia.org/r/1169668 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [10:00:33] (03PS6) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) [10:02:29] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1004.wikimedia.org with OS bookworm [10:03:11] (03CR) 10CI reject: [V:04-1] statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [10:04:14] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [10:04:29] (03PS7) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) [10:04:48] (03CR) 10Bartosz Wójtowicz: statistics: Add Python script for model uploading to statistics machines. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [10:05:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79212 and previous config saved to /var/cache/conftool/dbconfig/20250716-100504-root.json [10:05:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: maintenance [10:05:57] (03PS4) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) [10:05:57] (03PS3) 10Alexandros Kosiaris: DNM: Prep patch for removal of old maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1169636 (https://phabricator.wikimedia.org/T381565) [10:06:54] (03PS1) 10Kevin Bazira: team-ml: use global deploy tag for ORESFetchScoreJobKafkaLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1170109 (https://phabricator.wikimedia.org/T399683) [10:07:08] (03CR) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [10:07:28] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [10:07:37] (03PS5) 10D3r1ck01: ParserCache: Enable purgePeriod for SqlBagOStuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167217 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [10:08:23] (03PS1) 10Btullis: Update one of the spark images to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170110 (https://phabricator.wikimedia.org/T390139) [10:08:52] (03CR) 10D3r1ck01: [C:03+1] "Looks good, thanks! Deploy at will 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167217 (https://phabricator.wikimedia.org/T398806) (owner: 10Ladsgroup) [10:08:52] (03PS2) 10Kevin Bazira: team-ml: use global deploy tag for ORESFetchScoreJobKafkaLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1170107 (https://phabricator.wikimedia.org/T399683) [10:09:38] (03PS1) 10Muehlenhoff: No longer apply the eventlogging-admins access group to perf and deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170111 (https://phabricator.wikimedia.org/T238230) [10:09:49] (03PS2) 10Muehlenhoff: No longer apply the eventlogging-admins access group to perf and deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170111 (https://phabricator.wikimedia.org/T238230) [10:10:18] (03CR) 10Muehlenhoff: [C:03+2] Deprecate dumpsdata-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1169627 (owner: 10Muehlenhoff) [10:10:23] (03PS2) 10Btullis: Update one of the spark images to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170110 (https://phabricator.wikimedia.org/T390139) [10:10:27] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [10:11:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P79213 and previous config saved to /var/cache/conftool/dbconfig/20250716-101102-marostegui.json [10:11:28] (03PS3) 10Btullis: Update one of the spark images to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170110 (https://phabricator.wikimedia.org/T390139) [10:11:44] (03CR) 10Btullis: [V:03+2 C:03+2] Update one of the spark images to golang 1.19 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1170110 (https://phabricator.wikimedia.org/T390139) (owner: 10Btullis) [10:13:26] (03CR) 10Ilias Sarantopoulos: [C:04-1] "We shouldn't disable the models that are already enabled. Other than that the rest of the patch looks fine!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [10:16:12] (03PS3) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) [10:19:00] (03CR) 10Gkyziridis: ores-extension: enable revertrisk filter for simplewiki and trwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [10:20:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: maintenance [10:20:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79214 and previous config saved to /var/cache/conftool/dbconfig/20250716-102009-root.json [10:22:06] (03CR) 10Gkyziridis: "I am not sure about the tests in jenkins. Gerrit does not report any errors but this seems strange: https://integration.wikimedia.org/ci/j" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [10:24:14] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [10:26:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P79215 and previous config saved to /var/cache/conftool/dbconfig/20250716-102609-marostegui.json [10:29:25] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:30:22] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [10:35:13] (03CR) 10Vgutierrez: [C:03+2] varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [10:36:38] RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2239) taken on 2025-07-16 07:40:34 (1198 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:39:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:41:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T399249)', diff saved to https://phabricator.wikimedia.org/P79216 and previous config saved to /var/cache/conftool/dbconfig/20250716-104117-marostegui.json [10:41:21] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:41:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1207.eqiad.wmnet with reason: Maintenance [10:41:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T399249)', diff saved to https://phabricator.wikimedia.org/P79217 and previous config saved to /var/cache/conftool/dbconfig/20250716-104139-marostegui.json [10:42:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:44:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [10:46:29] (03PS1) 10Btullis: Revert^2 "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170116 [10:47:17] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973#11008647 (10Volans) 05Open→03Resolved [10:48:30] (03CR) 10Btullis: [C:03+2] Revert^2 "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170116 (owner: 10Btullis) [10:48:43] !log btullis@dns1004 START - running authdns-update [10:49:12] (03CR) 10Volans: "Thanks for the review" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 (owner: 10Volans) [10:49:21] (03PS2) 10Volans: Data Persistence: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167898 [10:49:36] !log btullis@dns1004 END - running authdns-update [10:51:04] (03PS1) 10Effie Mouzeli: profile::hcaptcha::proxy: include nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1170117 (https://phabricator.wikimedia.org/T399211) [10:51:43] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170117 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [11:00:05] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1100). [11:06:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T399249)', diff saved to https://phabricator.wikimedia.org/P79218 and previous config saved to /var/cache/conftool/dbconfig/20250716-110610-marostegui.json [11:06:15] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:14:25] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#11008740 (10cmooney) Still not fixed :worried: ` 7/16/2025 9:03:59 AM We have identified higher level issue that is impacting additional services. At this tim... [11:15:57] (03PS1) 10Muehlenhoff: Stop applying maps-admins to maps Bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1170124 (https://phabricator.wikimedia.org/T381565) [11:21:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P79219 and previous config saved to /var/cache/conftool/dbconfig/20250716-112117-marostegui.json [11:21:53] (03PS1) 10Effie Mouzeli: profile::hcaptcha::proxy: stream error logs [puppet] - 10https://gerrit.wikimedia.org/r/1170125 (https://phabricator.wikimedia.org/T399211) [11:23:41] (03PS2) 10Muehlenhoff: Stop applying maps-admins to maps Bookworm roles [puppet] - 10https://gerrit.wikimedia.org/r/1170124 (https://phabricator.wikimedia.org/T381565) [11:25:39] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [11:25:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [11:25:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [11:26:20] (03Abandoned) 10Effie Mouzeli: dsh: remove testservers from scap destinations [puppet] - 10https://gerrit.wikimedia.org/r/1165492 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [11:26:55] (03PS2) 10Effie Mouzeli: dsh: remove testservers from scap destinations [puppet] - 10https://gerrit.wikimedia.org/r/1169673 (https://phabricator.wikimedia.org/T397498) [11:27:17] (03PS3) 10Effie Mouzeli: dsh: remove testservers from scap destinations 1 [puppet] - 10https://gerrit.wikimedia.org/r/1169673 (https://phabricator.wikimedia.org/T397498) [11:29:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [11:30:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [11:30:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [11:34:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [11:36:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P79220 and previous config saved to /var/cache/conftool/dbconfig/20250716-113624-marostegui.json [11:40:45] FIRING: [2x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-search@codfw is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [11:40:49] FIRING: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [11:44:49] (03PS1) 10Marostegui: db2190: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170129 (https://phabricator.wikimedia.org/T399548) [11:45:29] (03CR) 10Marostegui: [C:03+2] db2190: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170129 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [11:45:45] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [11:45:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:46:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:46:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2190 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79221 and previous config saved to /var/cache/conftool/dbconfig/20250716-114637-marostegui.json [11:47:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:49:16] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:49:50] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:51:06] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170092 (https://phabricator.wikimedia.org/T395668) (owner: 10Gkyziridis) [11:51:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T399249)', diff saved to https://phabricator.wikimedia.org/P79222 and previous config saved to /var/cache/conftool/dbconfig/20250716-115131-marostegui.json [11:51:36] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:51:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1216.eqiad.wmnet with reason: Maintenance [11:58:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79223 and previous config saved to /var/cache/conftool/dbconfig/20250716-115816-root.json [11:59:25] FIRING: [3x] ProbeDown: Service aqs1012-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:02:45] FIRING: [2x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [12:07:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [12:08:28] (03PS1) 10Marostegui: db1157: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170137 (https://phabricator.wikimedia.org/T399548) [12:09:04] (03CR) 10Marostegui: [C:03+2] db1157: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170137 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [12:09:57] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:10:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1157 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79224 and previous config saved to /var/cache/conftool/dbconfig/20250716-121000-marostegui.json [12:12:45] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [12:13:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79225 and previous config saved to /var/cache/conftool/dbconfig/20250716-121322-root.json [12:16:34] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:16:42] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:18:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:18:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1230.eqiad.wmnet with reason: Maintenance [12:19:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T399249)', diff saved to https://phabricator.wikimedia.org/P79226 and previous config saved to /var/cache/conftool/dbconfig/20250716-121900-marostegui.json [12:19:04] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:20:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79227 and previous config saved to /var/cache/conftool/dbconfig/20250716-122034-root.json [12:20:58] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-07-09-124522 to 2025-07-15-225151 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170138 (https://phabricator.wikimedia.org/T397674) [12:22:01] (03CR) 10Jforrester: [C:03+1] multiversion: Fix "Class Wikimedia\MWConfig\Exception not found" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737 (owner: 10Krinkle) [12:22:24] (03PS2) 10Jforrester: multiversion: Fix "Class Wikimedia\MWConfig\Exception not found" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737 (owner: 10Krinkle) [12:23:16] PROBLEM - MinIO server processes on backup1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name minio, args server https://wikitech.wikimedia.org/wiki/Media_storage/Backups [12:23:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:24:10] the minio is me, I will extend the downtime [12:24:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:24:16] (03CR) 10Jforrester: "Thanks! For future concerns, you don't need to do these as three patches any more, as scap-to-k8s is now atomic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169775 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe) [12:24:25] disk rebuild is at 40% [12:25:48] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on backup1007.eqiad.wmnet with reason: Stop minio [12:25:55] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11009010 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=47d2c76f-a36a-4674-8e51-a03fec07aaf0) set by jynus@cumin1003 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Stop... [12:26:08] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:26:13] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:28:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79228 and previous config saved to /var/cache/conftool/dbconfig/20250716-122827-root.json [12:28:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [12:31:04] (03CR) 10Elukey: [C:03+2] team-ml: use global deploy tag for ORESFetchScoreJobKafkaLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1170107 (https://phabricator.wikimedia.org/T399683) (owner: 10Kevin Bazira) [12:32:33] (03Merged) 10jenkins-bot: team-ml: use global deploy tag for ORESFetchScoreJobKafkaLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1170107 (https://phabricator.wikimedia.org/T399683) (owner: 10Kevin Bazira) [12:32:44] (03CR) 10Effie Mouzeli: [C:03+1] shellbox: bump image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169752 (owner: 10Scott French) [12:33:02] (03PS1) 10Btullis: Bump hive metastore parallelism in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170139 (https://phabricator.wikimedia.org/T399711) [12:34:11] (03CR) 10Effie Mouzeli: [C:03+1] httpd: Rebase on bookworm and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [12:34:40] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170139 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [12:34:45] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [12:35:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79229 and previous config saved to /var/cache/conftool/dbconfig/20250716-123540-root.json [12:36:17] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:36:28] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:36:35] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:36:46] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:38:44] (03PS1) 10Vgutierrez: cache::haproxy: Use trusted_request on silent-drops and bwlimits [puppet] - 10https://gerrit.wikimedia.org/r/1170141 (https://phabricator.wikimedia.org/T399058) [12:40:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:43:04] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:43:12] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:43:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79230 and previous config saved to /var/cache/conftool/dbconfig/20250716-124332-root.json [12:43:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T399249)', diff saved to https://phabricator.wikimedia.org/P79231 and previous config saved to /var/cache/conftool/dbconfig/20250716-124333-marostegui.json [12:43:38] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:45:51] (03PS2) 10Vgutierrez: cache::haproxy: Use trusted_request on silent-drops and bwlimits [puppet] - 10https://gerrit.wikimedia.org/r/1170141 (https://phabricator.wikimedia.org/T399058) [12:46:55] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170141 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [12:48:36] (03CR) 10Aqu: [C:03+1] Bump hive metastore parallelism in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170139 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [12:50:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79232 and previous config saved to /var/cache/conftool/dbconfig/20250716-125045-root.json [12:55:51] (03CR) 10Fabfur: [C:03+1] cache::haproxy: Use trusted_request on silent-drops and bwlimits [puppet] - 10https://gerrit.wikimedia.org/r/1170141 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [12:56:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [12:58:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P79234 and previous config saved to /var/cache/conftool/dbconfig/20250716-125840-marostegui.json [12:58:45] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [13:00:05] Urbanecm and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1300). [13:00:05] Hide_on_rosie: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:05:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79235 and previous config saved to /var/cache/conftool/dbconfig/20250716-130551-root.json [13:06:12] (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [13:06:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:09:45] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [13:09:50] CirrusSearch consumer-search@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [13:12:32] (03PS5) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) [13:13:22] (03CR) 10Alexandros Kosiaris: "Rebase fail on my side on PS4, fixed in PS5." [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [13:13:25] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [13:13:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P79236 and previous config saved to /var/cache/conftool/dbconfig/20250716-131347-marostegui.json [13:14:45] FIRING: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [13:15:01] (03PS1) 10Bking: cirrus-streaming-updater-producer: bump up taskManager memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170144 [13:15:26] (03PS1) 10Btullis: Bump the maximum pod size in the cirrus-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170145 (https://phabricator.wikimedia.org/T399721) [13:16:35] (03CR) 10Ssingh: [V:03+1] hiera: service.yaml: use better aliasing for text/upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [13:17:29] (03CR) 10Elukey: [C:03+1] "I liked it thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [13:17:51] (03CR) 10Bking: [C:03+2] "self-merging, as the service is hard down at the moment" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170144 (owner: 10Bking) [13:18:30] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11009220 (10FCeratto-WMF) a:05akosiaris→03None [13:19:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [13:19:26] !log bking@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:19:31] !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:23:49] (03CR) 10Bking: [C:03+2] Bump the maximum pod size in the cirrus-streaming-updater namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170145 (https://phabricator.wikimedia.org/T399721) (owner: 10Btullis) [13:24:08] FIRING: [5x] ProbeDown: Service aqs1012-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:33] TheresNoTime: Sorry for ping but can you please have a look on https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1300 [13:25:45] (My WikimediaDebug is ready) [13:26:05] !log bking@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:26:11] !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:26:56] !log bking@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:27:41] (03CR) 10Zoe: [C:03+2] "Ah heck, sorry, my bad. Absolutely disengaged my brain for that one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [13:27:43] (03PS12) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [13:27:54] (03CR) 10Alexandros Kosiaris: [C:03+2] maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [13:28:08] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [13:28:12] (03PS2) 10Tiziano Fogli: prom/metamonitor: make physical vhosts agnostic to the machine hostname [puppet] - 10https://gerrit.wikimedia.org/r/1170104 (https://phabricator.wikimedia.org/T397003) [13:28:39] (03CR) 10CI reject: [V:04-1] prom/metamonitor: make physical vhosts agnostic to the machine hostname [puppet] - 10https://gerrit.wikimedia.org/r/1170104 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:28:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T399249)', diff saved to https://phabricator.wikimedia.org/P79237 and previous config saved to /var/cache/conftool/dbconfig/20250716-132854-marostegui.json [13:28:58] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:29:07] !log bking@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:29:09] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:29:43] !log bking@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:29:58] (03PS13) 10Federico Ceratto: Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) [13:30:13] (03PS1) 10Peter Fischer: Increase producer task manager memory to work around prometheus OOO once more [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170146 (https://phabricator.wikimedia.org/T399721) [13:31:07] (03CR) 10Federico Ceratto: "Updated as discussed" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [13:31:17] (03CR) 10Btullis: [C:03+2] Increase producer task manager memory to work around prometheus OOO once more [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170146 (https://phabricator.wikimedia.org/T399721) (owner: 10Peter Fischer) [13:31:18] (03CR) 10Peter Fischer: [C:03+2] Increase producer task manager memory to work around prometheus OOO once more [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170146 (https://phabricator.wikimedia.org/T399721) (owner: 10Peter Fischer) [13:32:03] !log bking@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:32:14] !log bking@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:32:25] (03PS3) 10Tiziano Fogli: prom/metamonitor: make physical vhosts agnostic to the machine hostname [puppet] - 10https://gerrit.wikimedia.org/r/1170104 (https://phabricator.wikimedia.org/T397003) [13:32:40] !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:33:13] (03Merged) 10jenkins-bot: Increase producer task manager memory to work around prometheus OOO once more [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170146 (https://phabricator.wikimedia.org/T399721) (owner: 10Peter Fischer) [13:33:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:33:31] !log bking@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [13:35:37] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:35:55] (03CR) 10Marostegui: Add parsercache pooling/depooling cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [13:36:04] (03PS2) 10Hashar: Revert "gerrit: lower connections to Gitiles from 25 to 4" [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [13:36:24] (03CR) 10CI reject: [V:04-1] Add parsercache pooling/depooling cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [13:36:24] (03PS4) 10Filippo Giunchedi: prom/metamonitor: make physical vhosts agnostic to the machine hostname [puppet] - 10https://gerrit.wikimedia.org/r/1170104 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:36:38] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1170104 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:36:58] !log bking@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:36:59] (03CR) 10Vgutierrez: [C:03+2] cache::haproxy: Use trusted_request on silent-drops and bwlimits [puppet] - 10https://gerrit.wikimedia.org/r/1170141 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [13:37:03] !log bking@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:39:05] (03CR) 10Btullis: [V:03+1 C:03+2] Bump hive metastore parallelism in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170139 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [13:39:21] (03CR) 10Jelto: [C:03+2] Revert "gerrit: lower connections to Gitiles from 25 to 4" [puppet] - 10https://gerrit.wikimedia.org/r/1143081 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [13:39:44] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:39:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:39:55] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:40:00] RESOLVED: [3x] CirrusStreamingUpdaterClearWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [13:40:09] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: make physical vhosts agnostic to the machine hostname [puppet] - 10https://gerrit.wikimedia.org/r/1170104 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:40:10] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [13:40:15] (03CR) 10Marostegui: "Can we make the cookbook to log something more meaningful than: Cookbook sre.mysql.parsercache" [cookbooks] - 10https://gerrit.wikimedia.org/r/1165546 (https://phabricator.wikimedia.org/T388389) (owner: 10Federico Ceratto) [13:40:36] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:40:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:40:45] RESOLVED: [3x] CirrusStreamingUpdaterSetWeightedTagsTooLow: CirrusSearch consumer-cloudelastic@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [13:41:29] (03PS1) 10Elukey: Revert "services: configure tegola in codfw to use maps-test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170148 [13:44:10] (03CR) 10Elukey: [C:03+2] Revert "services: configure tegola in codfw to use maps-test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170148 (owner: 10Elukey) [13:44:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:48:32] (03PS1) 10Elukey: services: move tegola in staging to maps-test2* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170149 (https://phabricator.wikimedia.org/T381565) [13:49:44] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: add CNAMEs for metamonitoring endpoints [dns] - 10https://gerrit.wikimedia.org/r/1169668 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [13:50:14] !log tappof@dns1004 START - running authdns-update [13:51:11] !log tappof@dns1004 END - running authdns-update [13:51:53] (03PS2) 10Elukey: services: move tegola in staging to maps-test2* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170149 (https://phabricator.wikimedia.org/T381565) [13:53:19] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11009324 (10akosiaris) Hi, Thanks for tagging me in this one. This is more #infrastructure-foundations territory these days, so I am adding the relevant people as well for their... [13:53:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79238 and previous config saved to /var/cache/conftool/dbconfig/20250716-135352-root.json [13:54:09] (03CR) 10Alexandros Kosiaris: [C:03+1] services: move tegola in staging to maps-test2* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170149 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:54:57] FIRING: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:56:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [13:56:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T399249)', diff saved to https://phabricator.wikimedia.org/P79239 and previous config saved to /var/cache/conftool/dbconfig/20250716-135641-marostegui.json [13:56:45] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:59:16] (03CR) 10Elukey: [C:03+2] services: move tegola in staging to maps-test2* [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170149 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [13:59:25] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:59:59] (03PS1) 10Btullis: Bump hive metastore parallelism in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170152 (https://phabricator.wikimedia.org/T399711) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1400) [14:00:31] (03PS2) 10Btullis: Bump hive metastore parallelism in the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170152 (https://phabricator.wikimedia.org/T399711) [14:00:59] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-07-09-124522 to 2025-07-15-225151 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170138 (https://phabricator.wikimedia.org/T397674) (owner: 10Jforrester) [14:01:50] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1170152 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [14:02:39] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-07-09-124522 to 2025-07-15-225151 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170138 (https://phabricator.wikimedia.org/T397674) (owner: 10Jforrester) [14:02:45] (03CR) 10Btullis: [C:03+2] [analytics][refine]: Stop refining TwoColConflict* legacy EventLogging streams [puppet] - 10https://gerrit.wikimedia.org/r/1164356 (https://phabricator.wikimedia.org/T397611) (owner: 10Phuedx) [14:04:03] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:04:46] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:04:52] (03PS1) 10Elukey: services: fix Tegola's staging postgres config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170153 (https://phabricator.wikimedia.org/T381565) [14:05:15] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:05:43] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:05:51] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:06:07] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170153 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:06:44] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:06:57] (03Abandoned) 10CDobbins: fix if statement [puppet] - 10https://gerrit.wikimedia.org/r/1155293 (owner: 10CDobbins) [14:07:10] (03Abandoned) 10CDobbins: testing change's effects [puppet] - 10https://gerrit.wikimedia.org/r/1155296 (owner: 10CDobbins) [14:07:12] (03CR) 10Elukey: [C:03+2] services: fix Tegola's staging postgres config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170153 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:08:41] (03CR) 10Alexandros Kosiaris: [C:03+1] profile::hcaptcha::proxy: include nginx_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170117 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [14:08:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79240 and previous config saved to /var/cache/conftool/dbconfig/20250716-140858-root.json [14:09:54] (03CR) 10Alexandros Kosiaris: [C:03+1] profile::hcaptcha::proxy: stream error logs [puppet] - 10https://gerrit.wikimedia.org/r/1170125 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [14:10:42] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [14:12:47] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b3-magru [14:13:02] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru [14:14:28] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device asw1-b4-magru [14:14:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b4-magru [14:15:17] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:15:27] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:15:40] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c1-codfw [14:15:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c1-codfw [14:15:52] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c2-codfw [14:16:01] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c2-codfw [14:16:04] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c3-codfw [14:16:13] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c3-codfw [14:16:16] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c4-codfw [14:16:25] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c4-codfw [14:16:28] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c5-codfw [14:16:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c5-codfw [14:16:40] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c6-codfw [14:16:49] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c6-codfw [14:16:51] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-c7-codfw [14:17:00] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-c7-codfw [14:17:25] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d1-codfw [14:17:34] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d1-codfw [14:17:37] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d2-codfw [14:17:46] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d2-codfw [14:17:48] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d3-codfw [14:17:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d3-codfw [14:18:01] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d4-codfw [14:18:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d4-codfw [14:18:13] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d5-codfw [14:18:14] (03PS1) 10Marostegui: db2194: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170159 (https://phabricator.wikimedia.org/T399548) [14:18:22] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d5-codfw [14:18:24] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d6-codfw [14:18:34] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d6-codfw [14:18:36] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d7-codfw [14:18:45] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d7-codfw [14:18:48] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device lsw1-d8-codfw [14:18:52] (03CR) 10Marostegui: [C:03+2] db2194: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1170159 (https://phabricator.wikimedia.org/T399548) (owner: 10Marostegui) [14:18:57] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d8-codfw [14:19:00] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:19:04] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:19:08] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:19:35] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr1-magru [14:19:46] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2194.codfw.wmnet with reason: Maintenance [14:19:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79241 and previous config saved to /var/cache/conftool/dbconfig/20250716-141950-marostegui.json [14:20:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-magru [14:20:05] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device cr2-magru [14:20:33] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-magru [14:20:45] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [14:21:12] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device ssw1-d1-codfw [14:21:21] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-d1-codfw [14:21:28] !log cmooney@cumin1003 START - Cookbook sre.network.tls for network device ssw1-d8-codfw [14:21:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-d8-codfw [14:24:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79242 and previous config saved to /var/cache/conftool/dbconfig/20250716-142404-root.json [14:24:08] RESOLVED: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:24:20] (03CR) 10Arnaudb: [C:03+2] gerrit: enable monitoring for other instances [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [14:26:38] (03CR) 10Btullis: [V:03+1 C:03+2] Bump hive metastore parallelism in the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/1170152 (https://phabricator.wikimedia.org/T399711) (owner: 10Btullis) [14:29:08] FIRING: [3x] ProbeDown: Service aqs1012-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:25] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:29:49] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [14:29:57] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1430) [14:30:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79243 and previous config saved to /var/cache/conftool/dbconfig/20250716-143048-root.json [14:31:05] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [14:31:20] (03CR) 10Cathal Mooney: [C:03+1] "lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [14:32:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.199s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:34:08] FIRING: [3x] ProbeDown: Service aqs1012-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:26] (03CR) 10Btullis: [C:03+1] No longer apply the eventlogging-admins access group to perf and deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170111 (https://phabricator.wikimedia.org/T238230) (owner: 10Muehlenhoff) [14:35:30] (03PS2) 10Effie Mouzeli: profile::hcaptcha::proxy: include nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1170117 (https://phabricator.wikimedia.org/T399211) [14:36:17] (03CR) 10Effie Mouzeli: profile::hcaptcha::proxy: include nginx_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1170117 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [14:36:35] (03PS1) 10Btullis: Revert^3 "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170161 [14:37:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.046s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:38:08] (03CR) 10Btullis: [C:03+2] Revert^3 "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1170161 (owner: 10Btullis) [14:38:31] !log btullis@dns1004 START - running authdns-update [14:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:39:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79244 and previous config saved to /var/cache/conftool/dbconfig/20250716-143909-root.json [14:39:13] (03CR) 10Effie Mouzeli: [C:03+2] profile::hcaptcha::proxy: include nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1170117 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [14:39:29] (03PS2) 10Effie Mouzeli: profile::hcaptcha::proxy: stream error logs [puppet] - 10https://gerrit.wikimedia.org/r/1170125 (https://phabricator.wikimedia.org/T399211) [14:39:36] !log btullis@dns1004 END - running authdns-update [14:41:11] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [14:42:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 20.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:44:49] (03PS1) 10Elukey: services: add missing port to Tegola's staging tcp proxy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170167 (https://phabricator.wikimedia.org/T381565) [14:45:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79245 and previous config saved to /var/cache/conftool/dbconfig/20250716-144553-root.json [14:46:13] (03CR) 10Effie Mouzeli: [C:03+2] profile::hcaptcha::proxy: stream error logs [puppet] - 10https://gerrit.wikimedia.org/r/1170125 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [14:47:26] (03CR) 10Elukey: [C:03+2] services: add missing port to Tegola's staging tcp proxy config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170167 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:48:58] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [14:53:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Very slow data transfers during migrations affecting ganeti1047/ganeti1048 - https://phabricator.wikimedia.org/T397025#11009659 (10MoritzMuehlenhoff) 05Open→03Declined The issue vanished by itself. [14:54:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:54:42] FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:57:48] (03CR) 10Alexandros Kosiaris: [C:03+1] "yaml ftw?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170167 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:57:55] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11009697 (10FCeratto-WMF) Hello, we are in the process of discussing the requirements more in details within the team but I think I can anticipate: - We can get away without publi... [14:58:19] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11009699 (10FCeratto-WMF) [14:59:02] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [14:59:57] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11009700 (10akosiaris) [15:01:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79246 and previous config saved to /var/cache/conftool/dbconfig/20250716-150059-root.json [15:02:42] !log bking@apt1002 publish wmf-opensearch-search-plugins_1.3.20+8_amd64 to component/opensearch13 bullseye-wikimedia [15:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:03] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11009726 (10MoritzMuehlenhoff) Based on free capacity in the rows, best to use these rows (with 1. being the row with most free capacity): eqiad: 1. row B 2. row D and 3. row C... [15:07:04] 06SRE, 10vm-requests: eqiad: VMs requested for Data Persistence automation and testbeds - https://phabricator.wikimedia.org/T390087#11009727 (10akosiaris) Cool, thanks. > We don't have strict requirements around the intra DC availability zones. Fair enough. I looked a bit into free capacity in all the row gr... [15:07:54] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host pc2015 [15:08:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc2015 [15:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:29] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:09:42] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:43] (03CR) 10Scott French: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:09:59] (03CR) 10Scott French: [V:03+2] "Built and tested locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:10:20] (03CR) 10Scott French: [V:03+2 C:03+2] httpd: Rebase on bookworm and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:10:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T399249)', diff saved to https://phabricator.wikimedia.org/P79247 and previous config saved to /var/cache/conftool/dbconfig/20250716-151044-marostegui.json [15:10:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:10:50] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:11:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:12:02] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:15:28] (03PS1) 10Scott French: shellbox: revert to httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) [15:16:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79248 and previous config saved to /var/cache/conftool/dbconfig/20250716-151604-root.json [15:17:14] !log gmodena@deploy1003 Started deploy [analytics/refinery@dc1ba0e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@dc1ba0e3] [15:17:52] (03CR) 10Effie Mouzeli: [C:03+1] shellbox: revert to httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170173 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [15:18:01] !log gmodena@deploy1003 Finished deploy [analytics/refinery@dc1ba0e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@dc1ba0e3] (duration: 00m 47s) [15:18:36] !log gmodena@deploy1003 Started deploy [analytics/refinery@dc1ba0e]: Regular analytics weekly train [analytics/refinery@dc1ba0e3] [15:18:39] (03PS1) 10Jgiannelos: Configure stream for parser cache change events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170174 (https://phabricator.wikimedia.org/T397072) [15:22:16] !log gmodena@deploy1003 Finished deploy [analytics/refinery@dc1ba0e]: Regular analytics weekly train [analytics/refinery@dc1ba0e3] (duration: 03m 39s) [15:22:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [15:22:42] !log gmodena@deploy1003 Started deploy [analytics/refinery@dc1ba0e] (thin): Regular analytics weekly train THIN [analytics/refinery@dc1ba0e3] [15:23:49] !log gmodena@deploy1003 Finished deploy [analytics/refinery@dc1ba0e] (thin): Regular analytics weekly train THIN [analytics/refinery@dc1ba0e3] (duration: 01m 06s) [15:24:42] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P79249 and previous config saved to /var/cache/conftool/dbconfig/20250716-152551-marostegui.json [15:26:37] (03PS3) 10Cwhite: logstash: rename filter-on-templates.rb [puppet] - 10https://gerrit.wikimedia.org/r/1167943 (https://phabricator.wikimedia.org/T234565) [15:26:44] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11009783 (10Dzahn) Hello, I see this ticket is resolved now. I have been watching it... [15:27:25] RESOLVED: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [15:28:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:28:44] Deployment mw-api-ext.eqiad.main in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-ext&var-deployment=mw-api-ext.eqiad.main - ... [15:28:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:29:36] (03CR) 10Ottomata: [C:03+1] No longer apply the eventlogging-admins access group to perf and deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/1170111 (https://phabricator.wikimedia.org/T238230) (owner: 10Muehlenhoff) [15:29:54] (03CR) 10Cwhite: [C:03+2] logstash: rename filter-on-templates.rb [puppet] - 10https://gerrit.wikimedia.org/r/1167943 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:34:20] (03CR) 10Ottomata: "Thank you I would appreciate that! I am just back to work after 2 months parental leave and am just catching up on things." [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:35:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 998.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:35:53] (03CR) 10Dzahn: [C:03+2] "just needed to add a variant and now the build job builds https://gerrit.wikimedia.org/r/c/operations/container/codesearch/+/1169785" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [15:40:36] jouncebot: nowandnext [15:40:36] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [15:40:36] In 1 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1700) [15:40:48] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169752 (owner: 10Scott French) [15:40:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P79251 and previous config saved to /var/cache/conftool/dbconfig/20250716-154058-marostegui.json [15:44:54] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T399734 (10phaultfinder) 03NEW [15:45:33] (03CR) 10Scott French: [C:03+2] shellbox: bump image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169752 (owner: 10Scott French) [15:48:37] (03Merged) 10jenkins-bot: shellbox: bump image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169752 (owner: 10Scott French) [15:54:32] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [15:54:58] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [15:55:04] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [15:55:16] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [15:55:22] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [15:55:34] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [15:55:41] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:55:56] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:56:03] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [15:56:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T399249)', diff saved to https://phabricator.wikimedia.org/P79252 and previous config saved to /var/cache/conftool/dbconfig/20250716-155605-marostegui.json [15:56:15] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:56:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:56:18] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [15:56:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [15:56:25] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [15:56:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T399249)', diff saved to https://phabricator.wikimedia.org/P79253 and previous config saved to /var/cache/conftool/dbconfig/20250716-155628-marostegui.json [15:56:49] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [15:58:35] (03CR) 10BCornwall: [C:03+1] hiera: service.yaml: use better aliasing for text/upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [15:59:16] (03CR) 10BCornwall: [C:03+1] wmnet: Update x3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1170099 (https://phabricator.wikimedia.org/T399699) (owner: 10Gerrit maintenance bot) [16:02:26] (03CR) 10Urbanecm: [C:04-1] "for visibility, until this can be clarified" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [16:14:08] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:14:42] FIRING: [3x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:17:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:22:30] (03PS1) 10Zabe: Set categorylinks to read new on shwiki and srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170185 (https://phabricator.wikimedia.org/T397912) [16:22:45] FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:32:45] FIRING: [3x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [16:33:06] jouncebot: nowandnext [16:33:06] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [16:33:06] In 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1700) [16:33:14] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on shwiki and srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170185 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [16:34:08] (03Merged) 10jenkins-bot: Set categorylinks to read new on shwiki and srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170185 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [16:36:06] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1170185|Set categorylinks to read new on shwiki and srwiki (T397912)]] [16:36:13] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [16:38:20] !log zabe@deploy1003 zabe: Backport for [[gerrit:1170185|Set categorylinks to read new on shwiki and srwiki (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:38:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [16:38:44] Deployment mw-api-ext.eqiad.main in mw-api-ext at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-ext&var-deployment=mw-api-ext.eqiad.main - ... [16:38:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:39:07] !log zabe@deploy1003 zabe: Continuing with sync [16:40:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext releases routed via main (k8s) 1.415s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:43:11] !log restart corto on alert1002 [16:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:39] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170185|Set categorylinks to read new on shwiki and srwiki (T397912)]] (duration: 08m 33s) [16:44:44] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [16:47:57] (03CR) 10BCornwall: varnish: Implement translation analytics vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [16:49:04] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo) [16:49:15] (03CR) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [16:52:17] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [16:54:08] (03CR) 10Elukey: [C:03+1] raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1700) [17:00:33] (03PS1) 10Effie Mouzeli: prometheus::ops add job to scrape hCaptcha proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) [17:02:23] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [17:05:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T399249)', diff saved to https://phabricator.wikimedia.org/P79254 and previous config saved to /var/cache/conftool/dbconfig/20250716-170530-marostegui.json [17:05:34] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:07:14] (03PS1) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [17:13:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:13:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:15:15] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [17:19:08] (03PS2) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [17:20:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P79255 and previous config saved to /var/cache/conftool/dbconfig/20250716-172037-marostegui.json [17:23:21] (03PS3) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [17:23:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 22.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:35:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P79256 and previous config saved to /var/cache/conftool/dbconfig/20250716-173545-marostegui.json [17:39:34] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bookworm [17:43:53] (03PS4) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [17:47:53] !log modify BGP attributes to swing pfw1-codfw.wikimedia.org traffic from cr1-codfw to cr2-codfw T399221 [17:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:57] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [17:48:30] (03PS5) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [17:50:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T399249)', diff saved to https://phabricator.wikimedia.org/P79257 and previous config saved to /var/cache/conftool/dbconfig/20250716-175052-marostegui.json [17:50:57] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:51:08] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance [17:51:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T399249)', diff saved to https://phabricator.wikimedia.org/P79258 and previous config saved to /var/cache/conftool/dbconfig/20250716-175115-marostegui.json [17:52:52] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on cr1-codfw with reason: downtime router before sfp swap [17:53:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010414 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5332ad34-f45a-4d5c-9180-ace7ebb578e8) set by cmooney@cumin1003 for 0:15:00 on 1 host(... [17:58:45] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [17:58:57] (03PS6) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [17:59:32] FIRING: ErrorBudgetBurn: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:00:04] dancy and andre: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T1800). [18:00:29] o/ [18:04:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [18:04:15] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [18:04:20] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [18:06:18] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010441 (10cmooney) The replacement optic module arrived on site in the past hour and we have replaced it now. I have un-drained the Arelion backhaul circuit fo... [18:08:42] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170194 (https://phabricator.wikimedia.org/T392180) [18:08:43] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170194 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [18:08:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010458 (10cmooney) iperf test is also clean: ` cmooney@cp5017:~$ iperf -s -i1 -u -w512k ------------------------------------------------------------ Server list... [18:09:32] RESOLVED: ErrorBudgetBurn: search-update-lag codfw - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:09:36] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170194 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [18:12:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010479 (10cmooney) p:05High→03Medium [18:12:38] dancy: once the dust settles in group1, would it be alright if I sneak in some shellbox updates? (i.e., not mediawiki, but a direct service dependency) [18:13:01] (these are partially done, and got interrupted by an incident earlier) [18:13:33] Sure. [18:14:49] dancy: great, let me know when you're comfortable with that moving forward [18:16:54] 06SRE: let users talk with cortobot in private - https://phabricator.wikimedia.org/T399753 (10Dzahn) 03NEW [18:18:08] 06SRE: let (trusted) users talk with cortobot in private - https://phabricator.wikimedia.org/T399753#11010501 (10Dzahn) [18:18:33] 06SRE: let (trusted) users talk with cortobot in private - https://phabricator.wikimedia.org/T399753#11010505 (10Dzahn) [18:21:05] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.10 refs T392180 [18:21:09] T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180 [18:21:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bookworm [18:22:50] (03PS1) 10Bvibber: API action=chartinfo internal helper for Charts stats [extensions/Chart] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1170197 (https://phabricator.wikimedia.org/T393950) [18:23:10] (03PS1) 10Bvibber: API action=chartinfo internal helper for Charts stats [extensions/Chart] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170199 (https://phabricator.wikimedia.org/T393950) [18:26:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Chart] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1170197 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [18:26:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Chart] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170199 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [18:29:24] swfrench-wmf: You should be good to go [18:29:43] dancy: ack, thanks! [18:33:34] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox: apply [18:34:12] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [18:34:25] FIRING: ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1012-b:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:44] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [18:34:59] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [18:35:31] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [18:35:46] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [18:36:18] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:36:35] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:37:07] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [18:37:30] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [18:38:02] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [18:38:38] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [18:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:42:55] (03CR) 10Scott French: [C:03+1] "Sounds good. I got a bit delayed by other issues, but I'll aim to do this soon." [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:45:59] (03CR) 10Dzahn: [C:03+1] "being bold and running with Timo's +1" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [18:49:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T399249)', diff saved to https://phabricator.wikimedia.org/P79259 and previous config saved to /var/cache/conftool/dbconfig/20250716-184927-marostegui.json [18:49:31] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:51:48] (03CR) 10Dzahn: [C:03+1] "claime: could I ask for one more redirects.dat deploy? (no rush at all)" [puppet] - 10https://gerrit.wikimedia.org/r/1167306 (https://phabricator.wikimedia.org/T119846) (owner: 10Dzahn) [18:52:29] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox: apply [18:53:10] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [18:53:41] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [18:53:58] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [18:54:29] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [18:54:42] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [18:55:14] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [18:55:31] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [18:56:02] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [18:56:25] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [18:56:56] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [18:57:36] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [18:58:02] !log updated all shellbox instances to 2025-07-15-174312 images [18:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:24] FIRING: ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:03:04] swfrench-wmf: Lemme know when you're done. I need to run an image build command. [19:04:12] dancy: I think you should be good to go - my changes are done, and I'm mainly just monitoring for issues to shake out at the moment. [19:04:28] OK.. proceeding [19:04:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P79260 and previous config saved to /var/cache/conftool/dbconfig/20250716-190434-marostegui.json [19:04:40] !log dancy@deploy1003 Started scap build-images: (no justification provided) [19:05:27] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:05:30] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:05:31] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:05:33] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:05:34] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:05:37] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:06:50] !log dancy@deploy1003 build-images aborted: (no justification provided) (duration: 02m 09s) [19:08:32] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:08:34] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:08:35] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:08:37] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:08:38] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:08:41] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:08:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [19:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:14:51] !log dancy@deploy1003 Started scap build-images: Testing T398873 [19:15:05] T398873: Move nightly image build from releases-jenkins to deployment.eqiad.wmnet - https://phabricator.wikimedia.org/T398873 [19:17:44] (03PS1) 10Sbisson: CX Translation::getStatus: Fix method to properly return the status [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170202 (https://phabricator.wikimedia.org/T399732) [19:18:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170202 (https://phabricator.wikimedia.org/T399732) (owner: 10Sbisson) [19:18:44] (03PS7) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [19:19:25] !log dancy@deploy1003 Finished scap build-images: Testing T398873 (duration: 04m 34s) [19:19:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P79261 and previous config saved to /var/cache/conftool/dbconfig/20250716-191942-marostegui.json [19:21:40] !log dancy@deploy1003 Started scap build-images: testing [19:22:52] !log dancy@deploy1003 Finished scap build-images: testing (duration: 01m 11s) [19:23:18] (03PS1) 10DDesouza: miscweb: bump to latest images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170204 (https://phabricator.wikimedia.org/T398303) [19:25:29] (03CR) 10DDesouza: [C:03+2] miscweb: bump to latest images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170204 (https://phabricator.wikimedia.org/T398303) (owner: 10DDesouza) [19:26:18] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:26:20] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:26:22] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:26:23] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:26:25] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:26:27] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:26:54] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:26:56] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:26:58] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:26:59] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:27:01] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:27:03] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:27:12] (03PS4) 10Ssingh: hiera: service.yaml: use better aliasing for text/upload [puppet] - 10https://gerrit.wikimedia.org/r/1168192 [19:27:21] (03CR) 10Scott French: [C:03+2] configcluster.yaml - remove eventlogging from profile::etcd::tlsproxy::acls [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:27:50] (03CR) 10Ssingh: hiera: service.yaml: use better aliasing for text/upload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [19:27:51] (03Merged) 10jenkins-bot: miscweb: bump to latest images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170204 (https://phabricator.wikimedia.org/T398303) (owner: 10DDesouza) [19:28:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11010782 (10VRiley-WMF) @elukey I believe we were getting this error due to it not being in a 10 gig rack. I have updated the location of ganeti1053. Will... [19:30:00] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:30:15] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:30:16] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:30:18] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6295/" [puppet] - 10https://gerrit.wikimedia.org/r/1168192 (owner: 10Ssingh) [19:30:24] RESOLVED: ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:30:32] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:30:33] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:30:49] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:31:11] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:31:23] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:31:24] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:31:38] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:31:39] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:31:53] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:32:50] (03PS8) 10Krinkle: [WIP] beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [19:34:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T399249)', diff saved to https://phabricator.wikimedia.org/P79263 and previous config saved to /var/cache/conftool/dbconfig/20250716-193449-marostegui.json [19:34:54] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:35:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [19:47:40] (03PS1) 10DDesouza: Pre-deploy Readers Use Cases Survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170206 (https://phabricator.wikimedia.org/T399736) [19:49:40] (03PS1) 10Ebernhardson: cirrus: configure managed cluster list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170207 [19:50:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170207 (owner: 10Ebernhardson) [19:50:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170206 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [19:52:11] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [19:52:59] (03PS9) 10Krinkle: beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [19:56:53] (03PS10) 10Krinkle: beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [19:58:34] (03PS11) 10Krinkle: beta: redirect misc *.beta.wmflabs.org to *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1170188 (https://phabricator.wikimedia.org/T289318) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T2000). Please do the needful. [20:00:05] bvibber, stephanebisson, ebernhardson, and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:28] o/ i'm here, can either spiderpig my own or if someone else wants to do a big batch [20:00:46] \o [20:00:53] o/ [20:01:07] mine is super trivial, only effects maint scripts. Can probably ship with whatever [20:01:37] o/ [20:03:33] we want separate deploys for the others? [20:03:41] always nice to test distinctly ;) [20:03:46] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:03:50] or if the pre-deploy doesn't need test i can bundle that [20:04:03] I need to test mine carefully, I'd rather do it on its own [20:04:29] ok i'll do my Chart api and ebernhardson's config change together first, then we move on to the others [20:05:09] thanks [20:05:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:06:26] bvibber: mine (the pre-deploy) can be tested but also is trivial and shouldn't affect anything as coverage is set to 0 [20:06:46] ok i'm having some trouble with deps on mine so i'm going to pull it and just do the config first [20:06:55] ah great i'll combine those two [20:07:21] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11010915 (10Jclark-ctr) @jcrespo this server is currently out of warranty We might need to purchase replacement drives for this. I can check when i arrive on site tomorrow. [20:07:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170207 (owner: 10Ebernhardson) [20:07:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170206 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [20:07:51] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on backup1007 - https://phabricator.wikimedia.org/T399671#11010916 (10Jclark-ctr) a:03Jclark-ctr [20:08:22] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [20:08:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11010917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [20:08:38] (03Merged) 10jenkins-bot: cirrus: configure managed cluster list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170207 (owner: 10Ebernhardson) [20:08:45] (03Merged) 10jenkins-bot: Pre-deploy Readers Use Cases Survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170206 (https://phabricator.wikimedia.org/T399736) (owner: 10DDesouza) [20:09:08] FIRING: [3x] ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:08] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:09:09] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1170207|cirrus: configure managed cluster list]], [[gerrit:1170206|Pre-deploy Readers Use Cases Survey v2 (T399736)]] [20:09:13] T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736 [20:11:33] !log bvibber@deploy1003 ebernhardson, dani, bvibber: Backport for [[gerrit:1170207|cirrus: configure managed cluster list]], [[gerrit:1170206|Pre-deploy Readers Use Cases Survey v2 (T399736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:11:53] it looks good [20:11:57] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:13] :/ [20:12:26] !log bvibber@deploy1003 ebernhardson, dani, bvibber: Continuing with sync [20:13:36] ok few more minutes it should be done with this, then i'll do stephanebisson's 1170202 [20:13:45] then i'll fix the deps on mine ;) [20:14:46] (03PS1) 10Krinkle: beta: Remove routing for *.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170208 (https://phabricator.wikimedia.org/T289318) [20:14:57] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit-replica in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:16:10] !incidents [20:16:11] 6474 (UNACKED) ProbeDown sre (103.102.166.224 ip4 text-https:443 probes/service http_text-https_ip4 eqsin) [20:16:15] !ack 474 [20:16:15] Attempt to ack incident 474 failed. [20:16:19] !ack 6474 [20:16:19] 6474 (ACKED) ProbeDown sre (103.102.166.224 ip4 text-https:443 probes/service http_text-https_ip4 eqsin) [20:16:28] text-lb eqsin ok [20:16:49] I wonder if this is related to the link we switched back to Arelion [20:16:57] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:17:34] well, there it goes, you beat me to it [20:18:07] !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@bc4d35c]: push updated rdf-spark-tools 0.3.159 artifact [20:18:12] yeah I am kinda convinced it's related to that [20:18:13] see [20:18:14] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170207|cirrus: configure managed cluster list]], [[gerrit:1170206|Pre-deploy Readers Use Cases Survey v2 (T399736)]] (duration: 09m 04s) [20:18:18] https://grafana.wikimedia.org/goto/chXhVpUHR?orgId=1 [20:18:18] T399736: Open-ended survey of English Wikipedia readers v2 - https://phabricator.wikimedia.org/T399736 [20:18:27] !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@bc4d35c]: push updated rdf-spark-tools 0.3.159 artifact (duration: 00m 20s) [20:18:46] no other change in traffic pattern [20:18:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170202 (https://phabricator.wikimedia.org/T399732) (owner: 10Sbisson) [20:18:50] thats not good sukhe [20:18:57] couldnt see a pattern [20:19:04] vgutierrez: yeah I am going to revert back to the ulsfo link [20:19:51] in the middle of deployments..was still trying to determine if those should be stopped [20:19:54] please do, sukhe [20:20:45] but if it already resolved before you revert.. was it just the switching itself? [20:20:58] (03Merged) 10jenkins-bot: CX Translation::getStatus: Fix method to properly return the status [extensions/ContentTranslation] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170202 (https://phabricator.wikimedia.org/T399732) (owner: 10Sbisson) [20:21:01] mutante: well the lag before that indicates some issues [20:21:08] ats ttfb metrics are awful [20:21:08] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T399734#11010938 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [20:21:20] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1170202|CX Translation::getStatus: Fix method to properly return the status (T399732)]] [20:21:21] yeah I am going to do it [20:21:24] T399732: CX desktop editor: Draft restoration fails - https://phabricator.wikimedia.org/T399732 [20:21:24] oh, so this is just kafka [20:21:32] https://grafana.wikimedia.org/goto/im1B4tUNg?orgId=1 [20:21:32] starting now, reverting back to via ulsfo [20:21:36] thanks [20:21:43] not just kafka [20:21:58] all traffic between eqsin and codfw [20:22:32] !log drain IC-331929 Arelion eqsin->codfw [20:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:02] !log sukhe@cumin1003:~$ homer cr1-codfw* commit "drain eqsin transport" [20:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:35] !log bvibber@deploy1003 bvibber, sbisson: Backport for [[gerrit:1170202|CX Translation::getStatus: Fix method to properly return the status (T399732)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:23:57] !log sukhe@cumin1003:~$ homer cr3-eqsin* commit "drain codfw transport" [20:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:08] FIRING: [3x] ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:25:46] re: aqs1012 -> https://phabricator.wikimedia.org/T396970 [20:25:56] stephanebisson: how's it look on test server? [20:26:24] bvibber looks good, go for it [20:26:27] awesome [20:26:30] !log bvibber@deploy1003 bvibber, sbisson: Continuing with sync [20:26:42] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1012.eqiad.wmnet with reason: T396970 [20:26:46] T396970: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970 [20:27:00] ok while that's running lemme fix my gerrit dependencies [20:27:13] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11010946 (10Dzahn) ` 20:24 <+jinxer-wm> FIRING: [3x] ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown 20:25 < mutante> re:... [20:27:29] ok. we are back to eqsin -> ulsfo -> codfw [20:27:43] topranks: ^ just as an FYI, we had issues with Arelion link again even though it was stable for a few hours [20:27:54] great, glad that was fixable so quick [20:29:08] FIRING: [3x] ProbeDown: Service aqs1012-b:9042 has failed probes (tcp_cassandra_b_cql_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:29:23] hmmm :( [20:29:30] yeah :) [20:29:38] jinxer-wm: I just downtimed that, wtf [20:29:39] sukhe: thanks for taking care, all ok now? [20:29:47] topranks: yeah, drained it and back to ulsfo so all good [20:30:06] will keep an eye out anyway but was just an FYI for you for tomorrow. run! [20:30:38] ok [20:30:39] ok looks like my dep *is* correct it just didn't grok cross-repo deps correctly? :D [20:31:28] we may need to depool eqsin and retry the iperf tests to get some stats [20:31:35] but we can look tomorrow [20:31:38] ok, let's do that tomorrow I guess [20:31:38] yeah [20:31:44] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11010957 (10Dzahn) more alerts for host "Service aqs1012-b:9042". Unclear how that can be properly downtimed. Added the short lived silence in web UI of alerts.wikimedia.org [20:31:58] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170202|CX Translation::getStatus: Fix method to properly return the status (T399732)]] (duration: 10m 38s) [20:32:02] T399732: CX desktop editor: Draft restoration fails - https://phabricator.wikimedia.org/T399732 [20:32:05] both purged and ATS TTFB back to regular levels: https://grafana.wikimedia.org/goto/guzSIpUNR?orgId=1 [20:32:10] I will update task on when I think we observed it [20:32:19] stephanebisson: should be all done! enjoy :) [20:32:29] bvibber TY [20:32:29] [20:32:32] :D [20:33:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1170197 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [20:33:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170199 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [20:33:18] bvibber: thanks! [20:33:28] :) [20:33:59] (03PS1) 10DDesouza: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170210 (https://phabricator.wikimedia.org/T219903) [20:34:02] (03Merged) 10jenkins-bot: API action=chartinfo internal helper for Charts stats [extensions/Chart] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1170197 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [20:34:08] RESOLVED: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:34:22] ^ cool [20:34:42] pheew:) [20:34:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11010962 (10ssingh) Things were stable for a few hours even after @cmooney made the fix above but starting ~20:00 UTC, we had a page for text-https in eqsin and a... [20:36:15] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170210 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [20:38:30] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1170210 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [20:38:42] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [20:38:44] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [20:38:45] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:38:47] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:38:48] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:38:51] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:39:02] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [20:39:16] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [20:39:18] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:39:33] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:39:35] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:39:52] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:40:57] (03Merged) 10jenkins-bot: API action=chartinfo internal helper for Charts stats [extensions/Chart] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1170199 (https://phabricator.wikimedia.org/T393950) (owner: 10Bvibber) [20:41:20] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1170197|API action=chartinfo internal helper for Charts stats (T393950)]], [[gerrit:1170199|API action=chartinfo internal helper for Charts stats (T393950)]] [20:41:25] T393950: Metrics for when new charts are created and embedded - https://phabricator.wikimedia.org/T393950 [20:51:25] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2207.codfw.wmnet with reason: Maintenance [20:51:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T399249)', diff saved to https://phabricator.wikimedia.org/P79265 and previous config saved to /var/cache/conftool/dbconfig/20250716-205132-marostegui.json [20:51:36] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T2100) [21:09:23] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1170197|API action=chartinfo internal helper for Charts stats (T393950)]], [[gerrit:1170199|API action=chartinfo internal helper for Charts stats (T393950)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:09:28] T393950: Metrics for when new charts are created and embedded - https://phabricator.wikimedia.org/T393950 [21:10:05] looks good :D [21:10:16] !log bvibber@deploy1003 bvibber: Continuing with sync [21:14:26] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Wed 13 Aug 2025 08:40:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [21:18:22] mutante: apologies about the aqs alerts, I thought I had those silenced/downtimed [21:18:48] they're like cockroaches, they're touch to get rid of... [21:24:05] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170197|API action=chartinfo internal helper for Charts stats (T393950)]], [[gerrit:1170199|API action=chartinfo internal helper for Charts stats (T393950)]] (duration: 42m 44s) [21:24:09] T393950: Metrics for when new charts are created and embedded - https://phabricator.wikimedia.org/T393950 [21:24:13] whee [21:24:22] they sure do take longer when you changed an i18n message haha [21:28:35] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1053.eqiad.wmnet with OS bookworm [21:28:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11011102 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed... [21:35:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:38:44] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ganeti1053.eqiad.wmnet with OS bookworm [21:38:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#11011153 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm [21:43:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:48:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:48:49] (03PS2) 10Scott French: php8.3: initial release of 8.3 image stack [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) [21:48:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T399249)', diff saved to https://phabricator.wikimedia.org/P79266 and previous config saved to /var/cache/conftool/dbconfig/20250716-214857-marostegui.json [21:49:04] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:58:49] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250716T2200) [22:01:17] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1165153 (https://phabricator.wikimedia.org/T398246) (owner: 10Scott French) [22:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P79267 and previous config saved to /var/cache/conftool/dbconfig/20250716-220405-marostegui.json [22:04:28] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#11011234 (10cmooney) I've updated the ticket with Arelion to advise we have been able to replace the optic, and despite the apparat improvement at first we still... [22:05:35] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [22:09:08] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:11:29] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [22:19:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P79268 and previous config saved to /var/cache/conftool/dbconfig/20250716-221912-marostegui.json [22:21:51] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778 (10RobH) 03NEW [22:22:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:23:08] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11011288 (10RobH) a:03bking @bking or @btullis, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operat... [22:23:25] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11011293 (10RobH) [22:24:28] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779 (10RobH) 03NEW [22:24:55] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11011312 (10RobH) [22:25:33] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11011313 (10RobH) a:03bking @bking or @btullis, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operat... [22:26:14] (03CR) 10Scott French: "Neat! I've never seen this class-based target enumeration before. In any case, this looks good as long as you wire the nginx job definitio" [puppet] - 10https://gerrit.wikimedia.org/r/1170186 (https://phabricator.wikimedia.org/T399211) (owner: 10Effie Mouzeli) [22:28:31] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1053.eqiad.wmnet with reason: host reimage [22:34:12] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1053.eqiad.wmnet with reason: host reimage [22:34:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T399249)', diff saved to https://phabricator.wikimedia.org/P79269 and previous config saved to /var/cache/conftool/dbconfig/20250716-223419-marostegui.json [22:34:24] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:34:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2225.codfw.wmnet with reason: Maintenance [22:34:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2225 (T399249)', diff saved to https://phabricator.wikimedia.org/P79270 and previous config saved to /var/cache/conftool/dbconfig/20250716-223442-marostegui.json [22:39:04] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:48:57] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:52:02] vriley@cumin1002 reimage (PID 643342) is awaiting input [22:54:08] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:35:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225 (T399249)', diff saved to https://phabricator.wikimedia.org/P79271 and previous config saved to /var/cache/conftool/dbconfig/20250716-233522-marostegui.json [23:35:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:38:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170218 [23:38:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170218 (owner: 10TrainBranchBot) [23:48:05] (03CR) 10BryanDavis: [C:03+1] "This looks like it will work great while we are still using "old style" mwmaint processes in Beta Cluster. When we finally catch up to FY2" [puppet] - 10https://gerrit.wikimedia.org/r/941479 (https://phabricator.wikimedia.org/T357877) (owner: 10Krinkle) [23:50:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1170218 (owner: 10TrainBranchBot) [23:50:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2225', diff saved to https://phabricator.wikimedia.org/P79273 and previous config saved to /var/cache/conftool/dbconfig/20250716-235029-marostegui.json