[01:35:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [02:02:10] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:27] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:12] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:10] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:23:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:26:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2165.codfw.wmnet with reason: Maintenance [05:26:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2165.codfw.wmnet with reason: Maintenance [05:27:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:30:14] (03PS1) 10Marostegui: db1193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028104 [05:32:57] (03CR) 10Marostegui: [C:03+2] db1193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028104 (owner: 10Marostegui) [05:35:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [05:42:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:50:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T363977 [05:50:08] T363977: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T363977 [05:50:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2161 with weight 0 T363977', diff saved to https://phabricator.wikimedia.org/P61868 and previous config saved to /var/cache/conftool/dbconfig/20240506-055013-root.json [05:50:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T363977 [05:51:20] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1025913 (https://phabricator.wikimedia.org/T363977) (owner: 10Gerrit maintenance bot) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:25] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:51] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:57] !log Starting s8 codfw failover from db2165 to db2161 - T363977 [06:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:00] T363977: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T363977 [06:13:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2161 to s8 primary T363977', diff saved to https://phabricator.wikimedia.org/P61869 and previous config saved to /var/cache/conftool/dbconfig/20240506-061311-marostegui.json [06:14:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2165 T363977', diff saved to https://phabricator.wikimedia.org/P61870 and previous config saved to /var/cache/conftool/dbconfig/20240506-061416-root.json [06:16:15] (03PS1) 10Marostegui: db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028118 [06:16:46] (03CR) 10Marostegui: [C:03+2] db2165: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028118 (owner: 10Marostegui) [06:17:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2165.codfw.wmnet with OS bookworm [06:17:34] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [06:28:04] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/1026439 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [06:28:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1193', diff saved to https://phabricator.wikimedia.org/P61871 and previous config saved to /var/cache/conftool/dbconfig/20240506-062814-root.json [06:28:25] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete certs for ldap-corp [puppet] - 10https://gerrit.wikimedia.org/r/1026797 (https://phabricator.wikimedia.org/T323820) (owner: 10Muehlenhoff) [06:29:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, I'll merge when the install7001 setup is complete" [homer/public] - 10https://gerrit.wikimedia.org/r/1026945 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [06:30:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1193.eqiad.wmnet with OS bookworm [06:30:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2102.codfw.wmnet with reason: Maintenance [06:30:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2102.codfw.wmnet with reason: Maintenance [06:31:37] (03PS1) 10Marostegui: Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027732 [06:35:06] (03PS1) 10Muehlenhoff: Make install7001 an installserver [puppet] - 10https://gerrit.wikimedia.org/r/1028229 (https://phabricator.wikimedia.org/T364016) [06:35:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2165.codfw.wmnet with reason: host reimage [06:37:08] (03CR) 10Muehlenhoff: [C:03+2] Make install7001 an installserver [puppet] - 10https://gerrit.wikimedia.org/r/1028229 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [06:37:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2165.codfw.wmnet with reason: host reimage [06:41:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:41:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:41:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T361627)', diff saved to https://phabricator.wikimedia.org/P61872 and previous config saved to /var/cache/conftool/dbconfig/20240506-064121-marostegui.json [06:41:24] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:43:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage [06:46:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1193.eqiad.wmnet with reason: host reimage [06:51:34] (03PS1) 10Muehlenhoff: Add dummy keytab for install7001 [labs/private] - 10https://gerrit.wikimedia.org/r/1028236 (https://phabricator.wikimedia.org/T364016) [06:52:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T361627)', diff saved to https://phabricator.wikimedia.org/P61873 and previous config saved to /var/cache/conftool/dbconfig/20240506-065239-marostegui.json [06:52:44] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:53:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61874 and previous config saved to /var/cache/conftool/dbconfig/20240506-065351-root.json [06:53:57] (03CR) 10Marostegui: [C:03+2] Revert "db2165: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027732 (owner: 10Marostegui) [06:54:18] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dummy keytab for install7001 [labs/private] - 10https://gerrit.wikimedia.org/r/1028236 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [06:56:23] (03PS1) 10Marostegui: Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027733 [06:58:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2165.codfw.wmnet with OS bookworm [07:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:27] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:03:53] (03PS1) 10Muehlenhoff: Enable install7001 as webproxy in magru [dns] - 10https://gerrit.wikimedia.org/r/1028245 (https://phabricator.wikimedia.org/T364016) [07:06:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61875 and previous config saved to /var/cache/conftool/dbconfig/20240506-070621-root.json [07:06:23] (03CR) 10Marostegui: [C:03+2] Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027733 (owner: 10Marostegui) [07:07:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1193.eqiad.wmnet with OS bookworm [07:07:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P61876 and previous config saved to /var/cache/conftool/dbconfig/20240506-070748-marostegui.json [07:08:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61877 and previous config saved to /var/cache/conftool/dbconfig/20240506-070857-root.json [07:10:01] (03PS1) 10Marostegui: es1040: Remove setup status [puppet] - 10https://gerrit.wikimedia.org/r/1028254 [07:10:51] (03CR) 10Hashar: "I have checked the xff log bucket via `mwlog1002.eqiad.wmnet` in `/srv/mw-log/xff.log` and we still have entries for mw-jobrunner.discover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [07:10:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1020', diff saved to https://phabricator.wikimedia.org/P61878 and previous config saved to /var/cache/conftool/dbconfig/20240506-071051-root.json [07:10:57] (03CR) 10Marostegui: [C:03+2] es1040: Remove setup status [puppet] - 10https://gerrit.wikimedia.org/r/1028254 (owner: 10Marostegui) [07:11:06] moritzm: ok to merge your patch? [07:12:45] (03CR) 10Muehlenhoff: [C:03+2] Enable install7001 as webproxy in magru [dns] - 10https://gerrit.wikimedia.org/r/1028245 (https://phabricator.wikimedia.org/T364016) (owner: 10Muehlenhoff) [07:13:01] (03PS1) 10Marostegui: es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028259 [07:13:36] (03CR) 10Marostegui: [C:03+2] es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028259 (owner: 10Marostegui) [07:13:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1020.eqiad.wmnet with OS bookworm [07:15:01] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:49] marostegui: sorry, yes please [07:18:00] moritzm: doing it [07:18:06] thx [07:21:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61879 and previous config saved to /var/cache/conftool/dbconfig/20240506-072127-root.json [07:22:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P61880 and previous config saved to /var/cache/conftool/dbconfig/20240506-072255-marostegui.json [07:24:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61881 and previous config saved to /var/cache/conftool/dbconfig/20240506-072403-root.json [07:25:01] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:23] PROBLEM - TFTP service on install7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [07:28:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1020.eqiad.wmnet with reason: host reimage [07:32:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1020.eqiad.wmnet with reason: host reimage [07:36:02] (03CR) 10JMeybohm: [V:03+1 C:03+1] "Yes. It's basically the firewall rule to allow access to the TCP ports" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026441 (https://phabricator.wikimedia.org/T350034) (owner: 10JMeybohm) [07:36:07] (03CR) 10JMeybohm: [V:03+1 C:03+2] wikifunctions: Allow prometheus to scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026441 (https://phabricator.wikimedia.org/T350034) (owner: 10JMeybohm) [07:36:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61882 and previous config saved to /var/cache/conftool/dbconfig/20240506-073633-root.json [07:37:12] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache webproxy on magru recursors [07:37:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) webproxy on magru recursors [07:37:17] (03Merged) 10jenkins-bot: wikifunctions: Allow prometheus to scrape metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026441 (https://phabricator.wikimedia.org/T350034) (owner: 10JMeybohm) [07:38:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T361627)', diff saved to https://phabricator.wikimedia.org/P61883 and previous config saved to /var/cache/conftool/dbconfig/20240506-073803-marostegui.json [07:38:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2130.codfw.wmnet with reason: Maintenance [07:38:08] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:38:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2130.codfw.wmnet with reason: Maintenance [07:38:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T361627)', diff saved to https://phabricator.wikimedia.org/P61884 and previous config saved to /var/cache/conftool/dbconfig/20240506-073826-marostegui.json [07:39:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61885 and previous config saved to /var/cache/conftool/dbconfig/20240506-073909-root.json [07:39:35] (03CR) 10Muehlenhoff: [C:03+2] "Looks good. The server is setup now, I'm merging." [puppet] - 10https://gerrit.wikimedia.org/r/1026944 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [07:45:10] (03PS1) 10Marostegui: Revert "es1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027734 [07:49:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T361627)', diff saved to https://phabricator.wikimedia.org/P61886 and previous config saved to /var/cache/conftool/dbconfig/20240506-074945-marostegui.json [07:49:48] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:51:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61887 and previous config saved to /var/cache/conftool/dbconfig/20240506-075139-root.json [07:51:55] (03CR) 10JMeybohm: [C:03+1] Remove golang 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (owner: 10Elukey) [07:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:54:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61888 and previous config saved to /var/cache/conftool/dbconfig/20240506-075414-root.json [07:56:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1020.eqiad.wmnet with OS bookworm [08:00:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61889 and previous config saved to /var/cache/conftool/dbconfig/20240506-080012-root.json [08:00:21] (03CR) 10Marostegui: [C:03+2] Revert "es1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027734 (owner: 10Marostegui) [08:00:34] (03CR) 10Muehlenhoff: [C:03+1] "golang 1.14 fails to be build because backports has been archived for Buster. Looos good to remove. You can also link to T362518 in the co" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (owner: 10Elukey) [08:01:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:30] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:46] (03PS1) 10Marostegui: es1020: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1028365 (https://phabricator.wikimedia.org/T364289) [08:03:17] (03CR) 10Marostegui: [C:03+2] es1020: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1028365 (https://phabricator.wikimedia.org/T364289) (owner: 10Marostegui) [08:04:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1025 T364289', diff saved to https://phabricator.wikimedia.org/P61890 and previous config saved to /var/cache/conftool/dbconfig/20240506-080423-root.json [08:04:27] T364289: Reimage external store hosts with Bookworm - https://phabricator.wikimedia.org/T364289 [08:04:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P61891 and previous config saved to /var/cache/conftool/dbconfig/20240506-080452-marostegui.json [08:05:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1025.eqiad.wmnet with OS bookworm [08:06:17] (03PS1) 10Marostegui: es1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028366 [08:06:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61892 and previous config saved to /var/cache/conftool/dbconfig/20240506-080645-root.json [08:06:50] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9773672 (10JMeybohm) >>! In T362938#9764496, @Jhancock.wm wrote: > @JMeybohm papaul helped me identify the missing disk. I replaced it with a compatible drive. please let me know if that fixed the issue. Tha... [08:07:07] (03CR) 10Marostegui: [C:03+2] es1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028366 (owner: 10Marostegui) [08:07:25] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [08:09:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61893 and previous config saved to /var/cache/conftool/dbconfig/20240506-080920-root.json [08:15:10] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2127.codfw.wmnet [08:15:12] (03CR) 10Filippo Giunchedi: [C:03+1] Update prometheus config to reflect matomo profile change [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [08:15:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61894 and previous config saved to /var/cache/conftool/dbconfig/20240506-081518-root.json [08:16:08] (03PS1) 10Muehlenhoff: Switch db2127 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028368 (https://phabricator.wikimedia.org/T349619) [08:20:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P61895 and previous config saved to /var/cache/conftool/dbconfig/20240506-082000-marostegui.json [08:20:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1025.eqiad.wmnet with reason: host reimage [08:21:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61896 and previous config saved to /var/cache/conftool/dbconfig/20240506-082151-root.json [08:22:16] (03CR) 10Muehlenhoff: [C:03+2] Switch db2127 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028368 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:22:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1025.eqiad.wmnet with reason: host reimage [08:24:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61897 and previous config saved to /var/cache/conftool/dbconfig/20240506-082426-root.json [08:30:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61898 and previous config saved to /var/cache/conftool/dbconfig/20240506-083024-root.json [08:31:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2127.codfw.wmnet [08:35:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T361627)', diff saved to https://phabricator.wikimedia.org/P61899 and previous config saved to /var/cache/conftool/dbconfig/20240506-083507-marostegui.json [08:35:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:35:15] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:35:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [08:36:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61900 and previous config saved to /var/cache/conftool/dbconfig/20240506-083657-root.json [08:38:13] (03PS1) 10Marostegui: Revert "es1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027735 [08:40:58] (03PS59) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [08:44:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [08:44:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [08:44:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T361627)', diff saved to https://phabricator.wikimedia.org/P61901 and previous config saved to /var/cache/conftool/dbconfig/20240506-084422-marostegui.json [08:44:25] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:45:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61902 and previous config saved to /var/cache/conftool/dbconfig/20240506-084530-root.json [08:45:32] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub certs [labs/private] - 10https://gerrit.wikimedia.org/r/1026806 (owner: 10Muehlenhoff) [08:48:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1025.eqiad.wmnet with OS bookworm [08:55:34] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9773787 (10fgiunchedi) Thank you all for looking into this! Generally LGTM, the only thing I would have done differently is to copy the partitioning from existing disks and then add the first partition to the r... [08:56:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T361627)', diff saved to https://phabricator.wikimedia.org/P61903 and previous config saved to /var/cache/conftool/dbconfig/20240506-085612-marostegui.json [08:56:17] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:57:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2149.codfw.wmnet [08:58:45] (03PS1) 10Muehlenhoff: Switch db2149 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028372 (https://phabricator.wikimedia.org/T349619) [09:00:18] 06SRE, 06SRE Observability: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924#9773807 (10fgiunchedi) Thank you for the investigation @Scott_French ! That sounds sensible to me and I'm happy to review patches for the o11y bits; on the general conf... [09:00:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61904 and previous config saved to /var/cache/conftool/dbconfig/20240506-090035-root.json [09:04:30] (03CR) 10Muehlenhoff: [C:03+2] Switch db2149 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028372 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:06:56] (03CR) 10Marostegui: [C:03+2] Revert "es1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027735 (owner: 10Marostegui) [09:07:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61905 and previous config saved to /var/cache/conftool/dbconfig/20240506-090736-root.json [09:08:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1178', diff saved to https://phabricator.wikimedia.org/P61906 and previous config saved to /var/cache/conftool/dbconfig/20240506-090759-root.json [09:10:43] (03PS1) 10Marostegui: db1178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028376 [09:11:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1178.eqiad.wmnet with OS bookworm [09:11:20] (03CR) 10Marostegui: [C:03+2] db1178: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028376 (owner: 10Marostegui) [09:11:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P61907 and previous config saved to /var/cache/conftool/dbconfig/20240506-091120-marostegui.json [09:12:35] (03CR) 10Zabe: [C:03+2] "Alright" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027562 (owner: 10Zabe) [09:15:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61908 and previous config saved to /var/cache/conftool/dbconfig/20240506-091541-root.json [09:17:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2149.codfw.wmnet [09:17:35] (03PS1) 10JMeybohm: Bump eventgate-main resources and replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028383 [09:18:40] (03CR) 10JMeybohm: [C:03+1] prometheus: use longer-expiration pki client certs for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [09:19:47] (03CR) 10JMeybohm: [C:03+2] Bump eventgate-main resources and replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028383 (owner: 10JMeybohm) [09:20:54] (03Merged) 10jenkins-bot: Bump eventgate-main resources and replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028383 (owner: 10JMeybohm) [09:21:17] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [09:22:07] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [09:22:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61909 and previous config saved to /var/cache/conftool/dbconfig/20240506-092244-root.json [09:23:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:46] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028388 [09:23:50] jouncebot: nowandnext [09:23:50] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [09:23:50] In 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1000) [09:24:04] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:25:37] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [09:25:40] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [09:25:54] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [09:26:03] (03CR) 10Muehlenhoff: [C:03+2] sites: update installserver for magru [homer/public] - 10https://gerrit.wikimedia.org/r/1026945 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [09:26:07] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:26:13] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [09:26:16] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [09:26:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P61910 and previous config saved to /var/cache/conftool/dbconfig/20240506-092627-marostegui.json [09:26:28] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [09:26:32] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [09:26:42] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [09:27:06] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [09:28:54] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [09:29:08] !log uploaded openjdk-8 8u412-ga-1~deb10u1 to buster-wikimedia (forward port of latest Java 8 security updates) [09:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61911 and previous config saved to /var/cache/conftool/dbconfig/20240506-093047-root.json [09:31:14] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028388 (owner: 10DCausse) [09:32:17] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028388 (owner: 10DCausse) [09:35:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [09:37:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61912 and previous config saved to /var/cache/conftool/dbconfig/20240506-093749-root.json [09:39:09] (03PS2) 10Elukey: Remove golang 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (https://phabricator.wikimedia.org/T362518) [09:39:18] (03CR) 10Elukey: [V:03+2 C:03+2] Remove golang 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (https://phabricator.wikimedia.org/T362518) (owner: 10Elukey) [09:39:53] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:40:38] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [09:40:45] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:41:04] (03PS3) 10Elukey: Remove golang 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (https://phabricator.wikimedia.org/T362518) [09:41:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T361627)', diff saved to https://phabricator.wikimedia.org/P61913 and previous config saved to /var/cache/conftool/dbconfig/20240506-094135-marostegui.json [09:41:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:41:38] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:41:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [09:41:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T361627)', diff saved to https://phabricator.wikimedia.org/P61914 and previous config saved to /var/cache/conftool/dbconfig/20240506-094158-marostegui.json [09:42:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (https://phabricator.wikimedia.org/T362518) (owner: 10Elukey) [09:42:18] (03PS4) 10Elukey: Remove golang 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (https://phabricator.wikimedia.org/T362518) [09:42:40] (03CR) 10Elukey: "Forgot the changelog :) Rebased also on weekly rebuilds!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026976 (https://phabricator.wikimedia.org/T362518) (owner: 10Elukey) [09:43:34] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:43:43] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:46:27] 10ops-eqiad, 06DBA: db1178 not booting up - https://phabricator.wikimedia.org/T364300 (10Marostegui) 03NEW [09:48:50] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway) [09:52:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61915 and previous config saved to /var/cache/conftool/dbconfig/20240506-095255-root.json [09:53:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T361627)', diff saved to https://phabricator.wikimedia.org/P61916 and previous config saved to /var/cache/conftool/dbconfig/20240506-095302-marostegui.json [09:53:05] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1000) [10:00:07] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2177.codfw.wmnet [10:03:28] (03PS1) 10Muehlenhoff: Switch db2177 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028400 (https://phabricator.wikimedia.org/T349619) [10:06:35] (03CR) 10Muehlenhoff: [C:03+2] Switch db2177 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028400 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:08:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61917 and previous config saved to /var/cache/conftool/dbconfig/20240506-100801-root.json [10:08:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P61918 and previous config saved to /var/cache/conftool/dbconfig/20240506-100809-marostegui.json [10:10:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2177.codfw.wmnet [10:11:24] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2190.codfw.wmnet [10:13:36] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [10:13:43] (03CR) 10CI reject: [V:04-1] [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [10:16:40] (03PS1) 10Muehlenhoff: Switch db2190 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028406 (https://phabricator.wikimedia.org/T349619) [10:19:05] (03PS1) 10Marostegui: es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028407 [10:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2024', diff saved to https://phabricator.wikimedia.org/P61919 and previous config saved to /var/cache/conftool/dbconfig/20240506-101911-root.json [10:19:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give some weight to es2023', diff saved to https://phabricator.wikimedia.org/P61920 and previous config saved to /var/cache/conftool/dbconfig/20240506-101934-marostegui.json [10:19:50] (03CR) 10Marostegui: [C:03+2] es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028407 (owner: 10Marostegui) [10:21:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2024.codfw.wmnet with OS bookworm [10:23:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61921 and previous config saved to /var/cache/conftool/dbconfig/20240506-102307-root.json [10:23:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P61922 and previous config saved to /var/cache/conftool/dbconfig/20240506-102317-marostegui.json [10:23:35] (03CR) 10Muehlenhoff: [C:03+2] Switch db2190 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028406 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:31:27] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1178.eqiad.wmnet with OS bookworm [10:36:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2190.codfw.wmnet [10:36:25] (03PS1) 10Marostegui: Revert "es2024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027737 [10:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61923 and previous config saved to /var/cache/conftool/dbconfig/20240506-103814-root.json [10:38:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T361627)', diff saved to https://phabricator.wikimedia.org/P61924 and previous config saved to /var/cache/conftool/dbconfig/20240506-103825-marostegui.json [10:38:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [10:38:28] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:38:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [10:38:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T361627)', diff saved to https://phabricator.wikimedia.org/P61925 and previous config saved to /var/cache/conftool/dbconfig/20240506-103848-marostegui.json [10:42:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2024.codfw.wmnet with reason: host reimage [10:44:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2024.codfw.wmnet with reason: host reimage [10:49:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T361627)', diff saved to https://phabricator.wikimedia.org/P61926 and previous config saved to /var/cache/conftool/dbconfig/20240506-104925-marostegui.json [10:49:29] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:55:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:58:45] FIRING: [2x] Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b3-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:00:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:01:27] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:03:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2194.codfw.wmnet [11:04:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P61927 and previous config saved to /var/cache/conftool/dbconfig/20240506-110433-marostegui.json [11:04:51] (03PS1) 10Muehlenhoff: Switch db2194 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028427 (https://phabricator.wikimedia.org/T349619) [11:08:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:08:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2024.codfw.wmnet with OS bookworm [11:09:44] (03CR) 10Muehlenhoff: [C:03+2] Switch db2194 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028427 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:11:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [11:12:15] FIRING: PHPFPMTooBusy: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:12:15] FIRING: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver POST/200: 0.5859348746505147s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatency [11:13:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:16:32] !incidents [11:16:32] 4655 (ACKED) PHPFPMTooBusy api_appserver sre (php7.4-fpm.service eqiad) [11:16:32] 4654 (RESOLVED) PHPFPMTooBusy api_appserver sre (php7.4-fpm.service eqiad) [11:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:17:15] RESOLVED: MediaWikiLatencyExceeded: Average latency high: eqiad api_appserver POST/200: ... [11:17:15] 0.9095980530139138s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=api_appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:18:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:19:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P61928 and previous config saved to /var/cache/conftool/dbconfig/20240506-111940-marostegui.json [11:26:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [11:27:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2194.codfw.wmnet [11:30:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org [11:34:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T361627)', diff saved to https://phabricator.wikimedia.org/P61929 and previous config saved to /var/cache/conftool/dbconfig/20240506-113448-marostegui.json [11:34:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [11:34:51] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:35:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [11:35:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T361627)', diff saved to https://phabricator.wikimedia.org/P61930 and previous config saved to /var/cache/conftool/dbconfig/20240506-113511-marostegui.json [11:36:08] (03CR) 10Marostegui: [C:03+2] Revert "es2024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027737 (owner: 10Marostegui) [11:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61931 and previous config saved to /var/cache/conftool/dbconfig/20240506-113636-root.json [11:41:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [11:43:25] FIRING: [2x] ProbeDown: Service vrts2001:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2001:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T361627)', diff saved to https://phabricator.wikimedia.org/P61932 and previous config saved to /var/cache/conftool/dbconfig/20240506-114459-marostegui.json [11:45:04] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:47:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [11:48:20] RESOLVED: [2x] ProbeDown: Service vrts2001:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts2001:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61933 and previous config saved to /var/cache/conftool/dbconfig/20240506-115142-root.json [11:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:58:28] jouncebot: nowandnext [11:58:28] No deployments scheduled for the next 1 hour(s) and 1 minute(s) [11:58:28] In 1 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1300) [11:58:40] (03CR) 10Urbanecm: [C:03+2] iglwiki: Enable GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027237 (https://phabricator.wikimedia.org/T364130) (owner: 10Urbanecm) [11:58:41] (03CR) 10Urbanecm: [C:03+2] Backport several WikimediaMessages patches [extensions/WikimediaMessages] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1027571 (https://phabricator.wikimedia.org/T217451) (owner: 10Urbanecm) [11:59:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027237 (https://phabricator.wikimedia.org/T364130) (owner: 10Urbanecm) [11:59:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1027571 (https://phabricator.wikimedia.org/T217451) (owner: 10Urbanecm) [12:00:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P61934 and previous config saved to /var/cache/conftool/dbconfig/20240506-120007-marostegui.json [12:00:44] (03Merged) 10jenkins-bot: iglwiki: Enable GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027237 (https://phabricator.wikimedia.org/T364130) (owner: 10Urbanecm) [12:06:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61935 and previous config saved to /var/cache/conftool/dbconfig/20240506-120648-root.json [12:10:24] (03CR) 10JMeybohm: [C:03+1] Add node20 production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff) [12:15:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P61936 and previous config saved to /var/cache/conftool/dbconfig/20240506-121515-marostegui.json [12:16:05] (03CR) 10JMeybohm: [C:03+1] "Perfect, ๐Ÿšข it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [12:19:29] (03Merged) 10jenkins-bot: Backport several WikimediaMessages patches [extensions/WikimediaMessages] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1027571 (https://phabricator.wikimedia.org/T217451) (owner: 10Urbanecm) [12:19:37] finally merged [12:19:54] error: cannot open /srv/mediawiki-staging/php-1.43.0-wmf.3/.git/modules/extensions/Math/FETCH_HEAD: Permission denied [12:19:57] this doesn't look good... [12:20:55] (03PS1) 10Filippo Giunchedi: site: add prometheus7001 [puppet] - 10https://gerrit.wikimedia.org/r/1028456 (https://phabricator.wikimedia.org/T364016) [12:21:17] -rw-r--r-- 1 cdanis deployment 20K May 2 19:45 /srv/mediawiki-staging/php-1.43.0-wmf.3/.git/modules/extensions/Math/FETCH_HEAD [12:21:30] cdanis: something (umask?) went wrong when you were working with the host [12:21:32] lemme fix that [12:21:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61937 and previous config saved to /var/cache/conftool/dbconfig/20240506-122154-root.json [12:21:56] !log [urbanecm@deploy1002 ~]$ sudo /usr/local/sbin/fix-staging-perms # fixing permissions [12:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:08] ...chmod: missing operand after โ€˜g+sโ€™ [12:22:08] grr [12:24:57] (03CR) 10Filippo Giunchedi: [C:03+2] site: add prometheus7001 [puppet] - 10https://gerrit.wikimedia.org/r/1028456 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [12:25:42] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1027237|iglwiki: Enable GrowthExperiments (T364130)]], [[gerrit:1027571|Backport several WikimediaMessages patches (T217451 T362538 T364213 T315774 T364269)]] [12:25:57] T364130: Enable Growth mentorship at igl.wikipedia.org - https://phabricator.wikimedia.org/T364130 [12:25:58] T217451: Remove RCFilters Guided Tours - https://phabricator.wikimedia.org/T217451 [12:25:58] T362538: Disable or redirect the feedback link for the IP Info infobox - https://phabricator.wikimedia.org/T362538 [12:25:58] T364213: Remove feedback link on Special:Investigate - https://phabricator.wikimedia.org/T364213 [12:25:59] T315774: WikimediaMessages has no PHPUnit tests - https://phabricator.wikimedia.org/T315774 [12:25:59] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [12:26:33] !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet [12:26:34] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [12:27:21] tnx godog for prometheus7001! [12:27:30] sure np fabfur [12:27:38] !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=inference,name=eqiad [12:27:55] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=iglwiki growthexperiments # T364130 [12:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:06] I'll have a look if we need to update some hiera around to make magru hosts point to the correct prom server now (after we confirm all is fine) [12:28:08] !log depool inference-eqiad temporarily to move all services to mw-api-int-ro [12:28:46] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist 'wikipedia - closed - private' extensions/WikimediaMaintenance/createExtensionTables.php growthexperiments # to ensure Growth tables are everywhere, cf. Icf99dc23a7) [12:29:53] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:30:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T361627)', diff saved to https://phabricator.wikimedia.org/P61938 and previous config saved to /var/cache/conftool/dbconfig/20240506-123022-marostegui.json [12:30:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [12:30:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [12:30:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:30:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:31:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T361627)', diff saved to https://phabricator.wikimedia.org/P61939 and previous config saved to /var/cache/conftool/dbconfig/20240506-123102-marostegui.json [12:31:46] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:31:46] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:31:46] !log filippo@cumin1002 START - Cookbook sre.dns.wipe-cache prometheus7001.magru.wmnet on all recursors [12:31:49] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7001.magru.wmnet on all recursors [12:32:10] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:33:01] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [12:33:27] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus7001.magru.wmnet with OS bullseye [12:35:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2205.codfw.wmnet [12:37:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61940 and previous config saved to /var/cache/conftool/dbconfig/20240506-123700-root.json [12:37:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1024733 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [12:37:20] (03PS2) 10JMeybohm: New version of base.certificates module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026859 (https://phabricator.wikimedia.org/T362310) [12:37:20] (03PS2) 10JMeybohm: Make base.certificates compatible with chart modules and scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) [12:37:20] (03PS3) 10JMeybohm: New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) [12:37:20] (03PS5) 10JMeybohm: Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) [12:37:44] (03CR) 10JMeybohm: Add new chart: ratelimit (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:38:30] (03CR) 10CI reject: [V:04-1] New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:40:21] (03CR) 10JMeybohm: Add new chart: ratelimit (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:40:40] !incidents [12:40:40] 4655 (RESOLVED) PHPFPMTooBusy api_appserver sre (php7.4-fpm.service eqiad) [12:40:41] 4654 (RESOLVED) PHPFPMTooBusy api_appserver sre (php7.4-fpm.service eqiad) [12:40:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T361627)', diff saved to https://phabricator.wikimedia.org/P61941 and previous config saved to /var/cache/conftool/dbconfig/20240506-124049-marostegui.json [12:41:59] (03PS1) 10Filippo Giunchedi: wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) [12:43:45] RESOLVED: [2x] Not accepting/receiving prefixes from anycast BGP peer: Device asw1-b3-magru.mgmt.magru.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [12:44:54] (03Abandoned) 10Ssingh: Revert "Revert "magru: depool geoip/text*"" [dns] - 10https://gerrit.wikimedia.org/r/1026627 (owner: 10Ssingh) [12:44:54] (03PS2) 10Filippo Giunchedi: wmnet: add prometheus.svc.magru [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) [12:45:38] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1027237|iglwiki: Enable GrowthExperiments (T364130)]], [[gerrit:1027571|Backport several WikimediaMessages patches (T217451 T362538 T364213 T315774 T364269)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:45:46] T364130: Enable Growth mentorship at igl.wikipedia.org - https://phabricator.wikimedia.org/T364130 [12:45:47] T217451: Remove RCFilters Guided Tours - https://phabricator.wikimedia.org/T217451 [12:45:47] T362538: Disable or redirect the feedback link for the IP Info infobox - https://phabricator.wikimedia.org/T362538 [12:45:48] T364213: Remove feedback link on Special:Investigate - https://phabricator.wikimedia.org/T364213 [12:45:48] T315774: WikimediaMessages has no PHPUnit tests - https://phabricator.wikimedia.org/T315774 [12:45:48] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [12:46:36] (03PS1) 10Muehlenhoff: Switch db2205 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028465 (https://phabricator.wikimedia.org/T349619) [12:47:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1024734 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [12:48:02] (03CR) 10Muehlenhoff: [C:03+2] Switch db2205 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028465 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:51:11] (03CR) 10Ssingh: [C:03+1] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/1028464 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [12:51:26] !log urbanecm@deploy1002 Sync cancelled. [12:51:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:51:53] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1027237|iglwiki: Enable GrowthExperiments (T364130)]], [[gerrit:1027571|Backport several WikimediaMessages patches (T217451 T362538 T364213 T315774 T364269)]] [12:51:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:51:58] i...did not cancel anything [12:51:59] anyway [12:52:02] third try the charm? [12:52:05] T364130: Enable Growth mentorship at igl.wikipedia.org - https://phabricator.wikimedia.org/T364130 [12:52:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61942 and previous config saved to /var/cache/conftool/dbconfig/20240506-125206-root.json [12:52:07] T217451: Remove RCFilters Guided Tours - https://phabricator.wikimedia.org/T217451 [12:52:07] T362538: Disable or redirect the feedback link for the IP Info infobox - https://phabricator.wikimedia.org/T362538 [12:52:07] T364213: Remove feedback link on Special:Investigate - https://phabricator.wikimedia.org/T364213 [12:52:08] T315774: WikimediaMessages has no PHPUnit tests - https://phabricator.wikimedia.org/T315774 [12:52:08] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [12:52:14] (03CR) 10Stevemunene: [C:03+2] Setup kubeconfigs for datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025792 (https://phabricator.wikimedia.org/T363832) (owner: 10Stevemunene) [12:52:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2205.codfw.wmnet [12:53:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:54:52] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:55:15] (03PS1) 10Ssingh: admin_state: remove geoip depool for magru [dns] - 10https://gerrit.wikimedia.org/r/1028470 [12:55:26] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:43] (03CR) 10Ladsgroup: [C:03+1] create wikipedia-it-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1025871 (https://phabricator.wikimedia.org/T363825) (owner: 10Dzahn) [12:55:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P61943 and previous config saved to /var/cache/conftool/dbconfig/20240506-125556-marostegui.json [12:56:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [12:56:55] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1027237|iglwiki: Enable GrowthExperiments (T364130)]], [[gerrit:1027571|Backport several WikimediaMessages patches (T217451 T362538 T364213 T315774 T364269)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:57:05] T364130: Enable Growth mentorship at igl.wikipedia.org - https://phabricator.wikimedia.org/T364130 [12:57:06] T217451: Remove RCFilters Guided Tours - https://phabricator.wikimedia.org/T217451 [12:57:06] T362538: Disable or redirect the feedback link for the IP Info infobox - https://phabricator.wikimedia.org/T362538 [12:57:06] T364213: Remove feedback link on Special:Investigate - https://phabricator.wikimedia.org/T364213 [12:57:07] T315774: WikimediaMessages has no PHPUnit tests - https://phabricator.wikimedia.org/T315774 [12:57:07] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [12:57:10] !log urbanecm@deploy1002 urbanecm: Continuing with sync [12:57:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2209.codfw.wmnet [12:58:37] (03PS1) 10Muehlenhoff: Switch db2209 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028476 (https://phabricator.wikimedia.org/T349619) [12:58:39] urbanecm: uhh interesting. all I did was `scap backport` [12:58:44] (03CR) 10Ssingh: [C:03+2] admin_state: remove geoip depool for magru [dns] - 10https://gerrit.wikimedia.org/r/1028470 (owner: 10Ssingh) [12:58:54] cdanis: maybe scap backport has weird umask settings somewhere then? [12:58:59] possibly [12:59:06] or my shell configuration is at fault somehow [12:59:28] !log running authdns-update for removing depooling magru geoip/* [12:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1300). [13:00:05] esanders and Superzerocool: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] i'll deploy today [13:00:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:00:37] (actually, still finishing up my patch) [13:00:52] edsanders: Superzerocool: hello! around? :) [13:01:01] Hi urbanecm :)= [13:01:09] (03PS3) 10Superzerocool: eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) [13:01:40] (03CR) 10Urbanecm: [C:03+2] eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) (owner: 10Superzerocool) [13:01:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [13:01:48] Superzerocool: looks good, i'll deploy that in a moment. [13:02:25] thanks urbanecm :), my first patch after a looong time (before pandemics), so this is new for me... again :) [13:02:31] (03Merged) 10jenkins-bot: eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026691 (https://phabricator.wikimedia.org/T364039) (owner: 10Superzerocool) [13:02:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:02:45] in this case, i don't actually need anything from you, as the patch cannot be tested or something :) [13:02:53] so should be seamless (for you) [13:03:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:03:45] edsanders: around? :) [13:03:45] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:03:58] yay!, thanks urbanecm... [13:04:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:05:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:05:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:07:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61944 and previous config saved to /var/cache/conftool/dbconfig/20240506-130712-root.json [13:07:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:09:13] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:11:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P61945 and previous config saved to /var/cache/conftool/dbconfig/20240506-131104-marostegui.json [13:12:21] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2271.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2331.codfw.wmnet, mw2392.codfw.wmnet, mw2393.codfw.wmnet, mw2338.codfw.wmnet, mw2325.codfw.wmnet, mw2275.codfw.wmnet, mw2408.codfw.wmnet, mw2269.codfw.wmnet, mw2327.codfw.wmnet, mw2433.codfw.wmnet, mw2270.codfw.wmnet, mw2441.codfw.wmnet, mw2274.codfw.wmnet, mw [13:12:21] fw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2268.codfw.wmnet, mw2273.codfw.wmnet, mw2276.codfw.wmnet, mw2432.codfw.wmnet, mw2329.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2284.codfw.wmnet, mw2286.codfw.wmnet, mw2328.codfw.wmnet, mw2326.codfw.wmnet, mw2298.codfw.wmnet, mw2288.codfw.wmnet, mw2261.codfw.wmnet, mw2324.codfw.wmnet, mw2283.codfw.wmnet, mw2397.codfw.wmnet, mw2358.codfw.wmnet, mw2330.codfw.wmnet, mw [13:12:21] fw.wmnet, mw2400.codfw.wmnet, mw2405.codfw.wmnet, mw2402.codfw.wmnet, mw2287.codfw.wmnet, mw2403.codfw.wmnet, mw2285.codfw.wmnet, mw2323.codfw.wmnet, mw2398.codfw.wmnet, mw2299.codfw.wm https://wikitech.wikimedia.org/wiki/PyBal [13:12:32] uhh [13:12:45] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1365.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1372.eqiad.wmnet, mw1429.eqiad.wmnet, mw1401.eqiad.wmnet, mw1403.eqiad.wmnet, mw1373.eqiad.wmnet, mw1364.eqiad.wmnet, mw1436.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1358.eqiad.wmnet, mw1447.eqiad.wmnet, mw1426.eqiad.wmnet, mw1489.eqiad [13:12:45] mw1490.eqiad.wmnet, mw1443.eqiad.wmnet, mw1444.eqiad.wmnet, mw1359.eqiad.wmnet, mw1428.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:12:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:54] that does not look good [13:12:56] that looks worrying [13:12:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1365.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1372.eqiad.wmnet, mw1429.eqiad.wmnet, mw1401.eqiad.wmnet, mw1403.eqiad.wmnet, mw1373.eqiad.wmnet, mw1364.eqiad.wmnet, mw1436.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1358.eqiad.wmnet, mw1489.eqiad.wmnet, mw1410.eqiad.wmnet, mw1426.eqiad [13:12:57] mw1490.eqiad.wmnet, mw1443.eqiad.wmnet, mw1444.eqiad.wmnet, mw1359.eqiad.wmnet, mw1428.eqiad.wmnet, mw1450.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:13:07] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2271.codfw.wmnet, mw2438.codfw.wmnet, mw2331.codfw.wmnet, mw2392.codfw.wmnet, mw2338.codfw.wmnet, mw2275.codfw.wmnet, mw2269.codfw.wmnet, mw2327.codfw.wmnet, mw2433.codfw.wmnet, mw2270.codfw.wmnet, mw2274.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2407.codfw.wmnet, mw2268.codfw.wmnet, mw2273.codfw.wmnet, mw [13:13:07] fw.wmnet, mw2432.codfw.wmnet, mw2329.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2284.codfw.wmnet, mw2286.codfw.wmnet, mw2328.codfw.wmnet, mw2326.codfw.wmnet, mw2298.codfw.wmnet, mw2288.codfw.wmnet, mw2261.codfw.wmnet, mw2283.codfw.wmnet, mw2358.codfw.wmnet, mw2330.codfw.wmnet, mw2402.codfw.wmnet, mw2287.codfw.wmnet, mw2285.codfw.wmnet, mw2404.codfw.wmnet, mw2323.codfw.wmnet, mw2398.codfw.wmnet, mw2299.codfw.wmnet, mw [13:13:07] fw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:13:18] er [13:15:21] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:15:29] do we have any issue in rendering the wikis? It is weird that pybal complains and we don't get a page for anything [13:15:35] like latency/50x/etc.. [13:15:42] (03PS1) 10Brouberol: global_config: Only expose the IP of the analytics meta master [puppet] - 10https://gerrit.wikimedia.org/r/1028486 (https://phabricator.wikimedia.org/T361955) [13:15:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:15:55] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1027237|iglwiki: Enable GrowthExperiments (T364130)]], [[gerrit:1027571|Backport several WikimediaMessages patches (T217451 T362538 T364213 T315774 T364269)]] (duration: 24m 01s) [13:15:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:16:01] 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9774327 (10jcrespo) Here it is the 2 file versions (with the hash it can be checked they are the same files): sha1 | 36eb92a8d41c46e574f4e54ba352ec53211cdfbc {F50516466} s... [13:16:07] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:16:08] T364130: Enable Growth mentorship at igl.wikipedia.org - https://phabricator.wikimedia.org/T364130 [13:16:08] T217451: Remove RCFilters Guided Tours - https://phabricator.wikimedia.org/T217451 [13:16:08] T362538: Disable or redirect the feedback link for the IP Info infobox - https://phabricator.wikimedia.org/T362538 [13:16:10] T364213: Remove feedback link on Special:Investigate - https://phabricator.wikimedia.org/T364213 [13:16:12] T315774: WikimediaMessages has no PHPUnit tests - https://phabricator.wikimedia.org/T315774 [13:16:12] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [13:16:50] seems like those all were baremetal appservers? (and not mw-on-k8s hosts) [13:17:33] sukhe: taavi: elukey: i assume y'all want to investigate what happened, before trying to run scap backport for something else? [13:17:33] (i'm on the ferry home from the hackathon with rather not-good mobile network connection :/) [13:17:44] urbanecm: yes please :) [13:17:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:58] taavi: safe trip then! [13:18:00] * urbanecm waits [13:18:13] sukhe: o/ do you have a min to check what pybal was complaining about? [13:18:36] yeah checking! [13:18:40] <3 [13:18:54] godog I think on magru hieradata we already have prometheus7001 in all required places, don't know if sukhe can confirm, but apparently we don't need to update anything [13:19:07] urbanecm: if you have time can we check together if mediawiki showed any sign of distress? 50x, high latency, etc.. [13:19:12] also fatals [13:19:21] fabfur: yeah, he is working with that assumption [13:19:44] skimming through the MW logs [13:19:45] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2275/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028486 (https://phabricator.wikimedia.org/T361955) (owner: 10Brouberol) [13:19:56] so I see a ton of errors in a short timeframe, around 13:10 UTC [13:20:08] https://logstash.wikimedia.org/goto/f958a3a7af59c33ac2f2e903b3b2c3cd [13:20:24] matches [13:20:24] May 06 13:10:17 lvs2014 pybal[3504007]: [appservers-https_443] ERROR: Monitoring instance ProxyFetch reports server mw2309.codfw.wmnet (enabled/up/pooled) down: 500 Internal Server Error [13:20:32] (03PS5) 10Dreamrimmer: [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) [13:21:21] (03CR) 10CI reject: [V:04-1] [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [13:21:46] fabfur: nice, thank you [13:21:55] https://grafana.wikimedia.org/goto/mj3dwDLIR?orgId=1 [13:22:22] ok so we had a brief outage :( [13:23:21] taavi: remember that most mw* hosts are actually k8s-ized now [13:23:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:48] (03PS1) 10Marostegui: es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028491 [13:24:06] cdanis: yes, but a) none of the hosts mentioned the alert were kube* hosts and b) i spot-checked a few of them and they seemed to all be non-k8s [13:24:19] ah okay fair enough :) [13:24:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2021', diff saved to https://phabricator.wikimedia.org/P61946 and previous config saved to /var/cache/conftool/dbconfig/20240506-132424-root.json [13:24:30] the errors on the hosts also appear to be k8s hosts [13:24:43] (03CR) 10Marostegui: [C:03+2] es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028491 (owner: 10Marostegui) [13:24:59] and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1027571 changes constructor signature for a very frequently called thing [13:25:18] oh [13:25:21] that patch explains [13:25:29] so i think the mw errors are my fault, should've deployed the patch in safer way :/ [13:25:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2021.codfw.wmnet with OS bookworm [13:25:51] whether the pybal stuff is caused by MW stopping to work temporarily is beyond my expertise [13:25:52] (03PS6) 10Dreamrimmer: [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) [13:25:57] urbanecm: nono don't worry it happens! If we have explained the issue and if we are now in a good/safe position let's resume syncing [13:26:07] so what happened here is that the extension.json change was applied immediately, but the old class was replaced by the new one much later since php-fpm restarts are a separate check [13:26:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T361627)', diff saved to https://phabricator.wikimedia.org/P61947 and previous config saved to /var/cache/conftool/dbconfig/20240506-132612-marostegui.json [13:26:13] yeah... [13:26:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [13:26:15] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:26:18] urbanecm: yep yep basically it tried to healthcheck the appservers that returned 500 [13:26:21] urbanecm: all good! thanks for looking [13:26:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [13:26:31] super thanks all [13:26:32] okay, let's resume then [13:26:35] but on mw-on-k8s, it's just replacing the container image, so it's atomic [13:26:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T361627)', diff saved to https://phabricator.wikimedia.org/P61948 and previous config saved to /var/cache/conftool/dbconfig/20240506-132635-marostegui.json [13:26:39] rest of the patches do not have this issue [13:26:41] (03CR) 10CI reject: [V:04-1] [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [13:27:23] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1026691|eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon (T364039)]] [13:27:33] T364039: Lift IP cap on this dates 14/05; 28/05; 04/06; 11/06; 18/06; 25/06 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T364039 [13:27:48] urbanecm: it happens.. I'm pretty sure I've taken the wikis properly down the same way before mw-on-k8s was a thing at least once [13:27:54] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=eqiad [13:28:44] taavi: i used to pay close attention to this, when it affected most complex backports. now that extension.json is the only non-atomic thing to be synced, i tend to forget about it more often. [13:29:32] (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [13:29:49] (03CR) 10Anzx: [ruwiki] Limitate the use of the ContentTranslation tool (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [13:30:36] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9774366 (10elukey) And eqiad migrated as well, all done :) [13:30:52] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9774368 (10elukey) [13:32:34] jouncebot: nowandnext [13:32:34] For the next 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1300) [13:32:34] In 1 hour(s) and 57 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1530) [13:33:12] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS bookworm [13:35:00] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:35:10] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028495 [13:35:56] !log Run `mwscript userOptions.php --wiki=testwiki --delete` for "rcenhancedfilters-seen-tour", "wlenhancedfilters-seen-tour", "rcenhancedfilters-tried-highlight", "rcenhancedfilters-seen-highlight-button-counter" (T364269) [13:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:59] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [13:37:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T361627)', diff saved to https://phabricator.wikimedia.org/P61949 and previous config saved to /var/cache/conftool/dbconfig/20240506-133728-marostegui.json [13:37:36] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:37:40] urbanecm: yes, if you still are [13:37:46] edsanders: yep yep [13:37:56] currently finishing syncing a diff patch [13:38:29] (03PS2) 10Esanders: Release DT visual enhancements to all except Wikipedia/Commons/Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026605 (https://phabricator.wikimedia.org/T352087) [13:38:33] (03CR) 10Urbanecm: [C:03+2] Release DT visual enhancements to all except Wikipedia/Commons/Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026605 (https://phabricator.wikimedia.org/T352087) (owner: 10Esanders) [13:38:41] merging, will deploy shortly [13:39:32] (03Merged) 10jenkins-bot: Release DT visual enhancements to all except Wikipedia/Commons/Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026605 (https://phabricator.wikimedia.org/T352087) (owner: 10Esanders) [13:39:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw1408 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:39:54] PROBLEM - Check whether ferm is active by checking the default input chain on mw1451 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:41:12] (03PS29) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [13:42:51] (03CR) 10Brouberol: "bking: thanks for the review! It took me a while and a fair amount of attempts, but I've managed to write a Puppet function that queries t" [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:44:38] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1026691|eswiki, commonswiki wikidatawiki: lift IP cap for edit-a-thon (T364039)]] (duration: 17m 14s) [13:44:40] T364039: Lift IP cap on this dates 14/05; 28/05; 04/06; 11/06; 18/06; 25/06 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T364039 [13:44:41] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2279/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:44:44] finally [13:45:03] edsanders: deploying your patch now [13:45:10] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1026605|Release DT visual enhancements to all except Wikipedia/Commons/Wikidata (T352087)]] [13:45:14] T352087: [MILESTONE] Offer Usability Improvements as default-on feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T352087 [13:47:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2021.codfw.wmnet with reason: host reimage [13:47:10] (03PS30) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [13:48:19] (03PS31) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [13:48:47] (03PS32) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [13:49:34] !log urbanecm@deploy1002 esanders and urbanecm: Backport for [[gerrit:1026605|Release DT visual enhancements to all except Wikipedia/Commons/Wikidata (T352087)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:49:40] edsanders: can you take a look please? [13:50:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2021.codfw.wmnet with reason: host reimage [13:51:12] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1004.eqiad.wmnet with reason: host reimage [13:52:14] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2281/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [13:52:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P61950 and previous config saved to /var/cache/conftool/dbconfig/20240506-135238-marostegui.json [13:53:45] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1004.eqiad.wmnet with reason: host reimage [13:54:01] !log filippo@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7001.magru.wmnet with OS bullseye [13:54:01] !log filippo@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7001.magru.wmnet [13:55:15] (03PS1) 10Marostegui: Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027742 [13:56:02] (03PS7) 10Dreamrimmer: [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) [13:57:21] edsanders: how is the testing looking please? [13:59:05] (03PS33) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [13:59:36] (03PS1) 10Filippo Giunchedi: site: provision prometheus7001 with insetup [puppet] - 10https://gerrit.wikimedia.org/r/1028501 (https://phabricator.wikimedia.org/T364016) [13:59:38] (03PS1) 10Filippo Giunchedi: prometheus: use datacenters for snmp_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1028502 (https://phabricator.wikimedia.org/T364016) [13:59:39] (03PS1) 10Filippo Giunchedi: grafana: add magru prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1028503 (https://phabricator.wikimedia.org/T364016) [13:59:41] (03PS1) 10Filippo Giunchedi: trafficserver: add prometheus-magru.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1028504 (https://phabricator.wikimedia.org/T364016) [14:01:06] edsanders: hello? [14:01:19] (03CR) 10Muehlenhoff: [C:03+1] site: provision prometheus7001 with insetup [puppet] - 10https://gerrit.wikimedia.org/r/1028501 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:01:25] i am going to revert should there be no response in the next couple of mins [14:01:44] (03CR) 10Filippo Giunchedi: [C:03+2] site: provision prometheus7001 with insetup [puppet] - 10https://gerrit.wikimedia.org/r/1028501 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:02:31] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2282/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:03:15] (03PS2) 10Ssingh: geo-maps: define initial mapping for South America (magru) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) [14:03:36] (03CR) 10Brouberol: [V:03+1] "Ok, this is _finally_ working as I expect it to be." [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:03:49] (03PS1) 10Vgutierrez: hiera: Enable benthos on ncredir@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1028506 (https://phabricator.wikimedia.org/T362776) [14:04:17] !log filippo@cumin1002 START - Cookbook sre.hosts.decommission for hosts prometheus7001.magru.wmnet [14:04:22] (03PS8) 10Dreamrimmer: [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) [14:04:57] (03CR) 10Ssingh: geo-maps: define initial mapping for South America (magru) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:05:27] (03CR) 10Hashar: [C:04-1] Gerrit: update mail soy templates to match upstream (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox) [14:06:04] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 5 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028506 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:07:08] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable benthos on ncredir@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1028506 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:07:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P61951 and previous config saved to /var/cache/conftool/dbconfig/20240506-140745-marostegui.json [14:08:31] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [14:08:58] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2284/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028502 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:09:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw1408 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:09:55] RECOVERY - Check whether ferm is active by checking the default input chain on mw1451 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:10:00] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC checks out https://puppet-compiler.wmflabs.org/output/1028502/2284/netmon1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1028502 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [14:11:03] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002" [14:11:54] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: prometheus7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - filippo@cumin1002" [14:11:54] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:11:55] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus7001.magru.wmnet [14:15:46] (03PS3) 10Andrew Bogott: Revert "wmcs VM backups: move all backups to one host" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) [14:15:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2021.codfw.wmnet with OS bookworm [14:16:27] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:30] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1004.eqiad.wmnet with OS bookworm [14:17:10] !log filippo@cumin1002 START - Cookbook sre.ganeti.makevm for new host prometheus7001.magru.wmnet [14:17:11] !log filippo@cumin1002 START - Cookbook sre.dns.netbox [14:18:08] 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9774478 (10Urbanecm_WMF) I uploaded both of the recovered file versions under https://commons.wikimedia.org/wiki/File:Gnome-edit-delete.svg to Commons. [14:19:10] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [14:19:48] (03CR) 10Muehlenhoff: [C:03+1] "Approved in this week's SRE IF meeting" [puppet] - 10https://gerrit.wikimedia.org/r/1026194 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [14:20:06] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7001.magru.wmnet - filippo@cumin1002" [14:20:06] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:20:06] !log filippo@cumin1002 START - Cookbook sre.dns.wipe-cache prometheus7001.magru.wmnet on all recursors [14:20:09] !log filippo@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7001.magru.wmnet on all recursors [14:20:12] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Revert "wmcs VM backups: move all backups to one host" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) (owner: 10Andrew Bogott) [14:20:22] (03CR) 10Andrew Bogott: Revert "wmcs VM backups: move all backups to one host" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) (owner: 10Andrew Bogott) [14:20:30] !log filippo@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [14:20:48] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028495 (owner: 10DCausse) [14:21:20] !log filippo@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7001.magru.wmnet - filippo@cumin1002" [14:22:01] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028495 (owner: 10DCausse) [14:22:17] (03PS1) 10CDanis: add btop to standard packages for bookworm+ [puppet] - 10https://gerrit.wikimedia.org/r/1028512 [14:22:37] (03CR) 10Brouberol: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [14:22:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T361627)', diff saved to https://phabricator.wikimedia.org/P61952 and previous config saved to /var/cache/conftool/dbconfig/20240506-142253-marostegui.json [14:22:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:22:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:23:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [14:23:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T361627)', diff saved to https://phabricator.wikimedia.org/P61953 and previous config saved to /var/cache/conftool/dbconfig/20240506-142316-marostegui.json [14:23:18] (03CR) 10Marostegui: [C:03+2] Revert "es2021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1027742 (owner: 10Marostegui) [14:23:20] (03PS5) 10Sohom Datta: Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [14:23:24] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus7001.magru.wmnet with OS bullseye [14:23:36] (03PS4) 10Andrew Bogott: Revert "wmcs VM backups: move all backups to one host" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) [14:23:39] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:23:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61954 and previous config saved to /var/cache/conftool/dbconfig/20240506-142344-root.json [14:23:55] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:24:44] (03CR) 10Sohom Datta: [C:03+1] [ruwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [14:25:32] edsanders: no response => reverting [14:25:35] !log urbanecm@deploy1002 Sync cancelled. [14:26:08] (03PS1) 10Urbanecm: Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027476 [14:26:09] (03CR) 10TrainBranchBot: "urbanecm@deploy1002 created a revert of this change as Ie8b4b2b356837873033038664fd61b2a5681ac73" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026605 (https://phabricator.wikimedia.org/T352087) (owner: 10Esanders) [14:26:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027476 (owner: 10Urbanecm) [14:26:33] (03PS1) 10Marostegui: es2021: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1028513 [14:26:47] (03PS9) 10Sohom Datta: [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [14:26:53] (03CR) 10CI reject: [V:04-1] es2021: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1028513 (owner: 10Marostegui) [14:27:17] (03Merged) 10jenkins-bot: Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027476 (owner: 10Urbanecm) [14:27:34] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1027476|Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"]] [14:28:14] (03PS2) 10Marostegui: es2021: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1028513 (https://phabricator.wikimedia.org/T364289) [14:28:18] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:28:27] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:29:23] (03PS1) 10Ssingh: hiera: remove per-site override for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1028514 [14:29:33] (03CR) 10Marostegui: [C:03+2] es2021: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1028513 (https://phabricator.wikimedia.org/T364289) (owner: 10Marostegui) [14:30:31] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2285/console" [puppet] - 10https://gerrit.wikimedia.org/r/1028514 (owner: 10Ssingh) [14:31:50] (03CR) 10Muehlenhoff: [C:03+1] "Sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/1028512 (owner: 10CDanis) [14:32:07] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1027476|Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:32:14] (03CR) 10Muehlenhoff: [C:03+2] Switch db2209 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1028476 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:32:44] !log urbanecm@deploy1002 urbanecm: Continuing with sync [14:34:14] (03CR) 10CDanis: [C:03+2] add btop to standard packages for bookworm+ [puppet] - 10https://gerrit.wikimedia.org/r/1028512 (owner: 10CDanis) [14:34:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T361627)', diff saved to https://phabricator.wikimedia.org/P61955 and previous config saved to /var/cache/conftool/dbconfig/20240506-143453-marostegui.json [14:34:57] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:36:27] FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:00] (03PS10) 10Sohom Datta: [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [14:38:23] (03PS1) 10Urbanecm: userOptions.php: Actually batch deletion [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1027744 (https://phabricator.wikimedia.org/T364311) [14:38:40] (03CR) 10Urbanecm: [C:03+2] userOptions.php: Actually batch deletion [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1027744 (https://phabricator.wikimedia.org/T364311) (owner: 10Urbanecm) [14:38:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61956 and previous config saved to /var/cache/conftool/dbconfig/20240506-143850-root.json [14:39:12] 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9774546 (10jcrespo) >>! In T363995#9763970, @MatthewVernon wrote: > This leaves me with very little idea of what happened when. It looks like it ought to be possible to extract the object... [14:39:13] PROBLEM - Check whether ferm is active by checking the default input chain on parse1016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:39:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:39:55] (03PS2) 10Ssingh: hiera: remove per-site override for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1028514 [14:40:21] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1036 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:41:02] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2286/console" [puppet] - 10https://gerrit.wikimedia.org/r/1028514 (owner: 10Ssingh) [14:43:59] (03PS1) 10Elukey: amd-pytorch21: fix the ROCm version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) [14:44:35] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028519 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [14:45:37] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:45:45] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1027476|Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"]] (duration: 18m 11s) [14:45:49] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:46:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1027744 (https://phabricator.wikimedia.org/T364311) (owner: 10Urbanecm) [14:46:50] (03PS1) 10Brouberol: aliases: add datacenter-scoped cumin aliases for flink zk ensembles [puppet] - 10https://gerrit.wikimedia.org/r/1028520 (https://phabricator.wikimedia.org/T363975) [14:46:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2209.codfw.wmnet [14:48:54] (03PS1) 10Brouberol: zookeeper: use datacenter-local aliases for flink ensembles [cookbooks] - 10https://gerrit.wikimedia.org/r/1028521 (https://phabricator.wikimedia.org/T363975) [14:50:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P61957 and previous config saved to /var/cache/conftool/dbconfig/20240506-145001-marostegui.json [14:51:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9774602 (10Jhancock.wm) a:03Jhancock.wm [14:51:24] !log brouberol@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster cloudelastic: restart to pick up new JDK - brouberol@cumin2002 - T363975 [14:52:31] (03CR) 10JHathaway: [C:03+2] pcc: fix delete-canceled-pcc-run-dirs timer [puppet] - 10https://gerrit.wikimedia.org/r/1027008 (https://phabricator.wikimedia.org/T364173) (owner: 10JHathaway) [14:53:11] (03CR) 10JHathaway: [C:03+2] postfix: sasl ensure auth user matches from [puppet] - 10https://gerrit.wikimedia.org/r/1024733 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [14:53:20] (03CR) 10JHathaway: [C:03+2] postfix: add sasl auth header [puppet] - 10https://gerrit.wikimedia.org/r/1024734 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [14:53:21] (03PS1) 10Jdrewniak: [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) [14:53:39] (03PS2) 10JHathaway: postfix: add sasl auth header [puppet] - 10https://gerrit.wikimedia.org/r/1024734 (https://phabricator.wikimedia.org/T317574) [14:53:44] (03CR) 10JHathaway: [V:03+2 C:03+2] postfix: add sasl auth header [puppet] - 10https://gerrit.wikimedia.org/r/1024734 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [14:53:53] (03CR) 10JHathaway: [C:03+2] puppetdb: remove unused hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1026977 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [14:53:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61958 and previous config saved to /var/cache/conftool/dbconfig/20240506-145356-root.json [14:54:02] !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host prometheus7001.magru.wmnet with OS bullseye [14:54:02] !log filippo@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host prometheus7001.magru.wmnet [14:54:32] (03CR) 10CI reject: [V:04-1] [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [14:55:09] (03PS2) 10Jdrewniak: [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) [14:55:32] (03CR) 10Scott French: [C:03+2] mathoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:55:55] (03CR) 10CI reject: [V:04-1] [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [14:55:59] !log installing less security updates [14:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:10] (03PS3) 10Jdrewniak: [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) [14:56:37] (03Merged) 10jenkins-bot: mathoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1027050 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:00:13] FIRING: [6x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:33] (03PS1) 10Fabfur: cumin:magru: added cumin aliases for magru DC [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) [15:01:25] (03CR) 10Muehlenhoff: [C:03+1] cumin:magru: added cumin aliases for magru DC [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) (owner: 10Fabfur) [15:01:52] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [15:02:14] (03CR) 10Filippo Giunchedi: "This and the corresponding private.git change broke puppet on thanos-fe hosts, please revert both changes until thanos-swift uses PKI" [puppet] - 10https://gerrit.wikimedia.org/r/1026625 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [15:02:18] (03CR) 10Ssingh: cumin:magru: added cumin aliases for magru DC (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) (owner: 10Fabfur) [15:03:12] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [15:03:27] (03Merged) 10jenkins-bot: userOptions.php: Actually batch deletion [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1027744 (https://phabricator.wikimedia.org/T364311) (owner: 10Urbanecm) [15:03:45] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1027744|userOptions.php: Actually batch deletion (T364311)]] [15:03:57] T364311: userOptions.php fails to delete very large number of DB rows - https://phabricator.wikimedia.org/T364311 [15:04:49] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2112.codfw.wmnet - https://phabricator.wikimedia.org/T362793#9774649 (10Marostegui) @ABran-WMF this host was still present in zarcillo. [15:05:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P61959 and previous config saved to /var/cache/conftool/dbconfig/20240506-150508-marostegui.json [15:06:17] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2107.codfw.wmnet - https://phabricator.wikimedia.org/T362798#9774655 (10Marostegui) @ABran-WMF this host was still present in zarcillo. [15:06:18] (03PS2) 10Fabfur: cumin:magru: added cumin aliases for magru DC [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) [15:06:31] 10ops-codfw, 06SRE, 10decommission-hardware: decommission db2103.codfw.wmnet - https://phabricator.wikimedia.org/T362801#9774658 (10Marostegui) @ABran-WMF this host was still present in zarcillo [15:07:13] (03CR) 10Fabfur: cumin:magru: added cumin aliases for magru DC (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) (owner: 10Fabfur) [15:08:02] (03CR) 10Muehlenhoff: [C:03+1] cumin:magru: added cumin aliases for magru DC [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) (owner: 10Fabfur) [15:09:03] (03PS1) 10Herron: pyrra: separate slo definitions from filesystem class [puppet] - 10https://gerrit.wikimedia.org/r/1028524 (https://phabricator.wikimedia.org/T302995) [15:09:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61960 and previous config saved to /var/cache/conftool/dbconfig/20240506-150902-root.json [15:09:14] RECOVERY - Check whether ferm is active by checking the default input chain on parse1016 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:09:30] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2287/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) (owner: 10Fabfur) [15:09:39] !log mwmaint1002: mwscript userOptions.php --wiki=loginwiki --delete rcenhancedfilters-seen-tour # T364269 [15:09:40] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [15:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:44] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [15:10:33] (03CR) 10Ssingh: [C:03+1] cumin:magru: added cumin aliases for magru DC [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) (owner: 10Fabfur) [15:11:13] !log brouberol@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster cloudelastic: restart to pick up new JDK - brouberol@cumin2002 - T363975 [15:11:58] (03CR) 10Fabfur: [V:03+1 C:03+2] cumin:magru: added cumin aliases for magru DC [puppet] - 10https://gerrit.wikimedia.org/r/1028523 (https://phabricator.wikimedia.org/T346722) (owner: 10Fabfur) [15:12:06] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [15:12:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw1491 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:13:49] (03CR) 10Herron: [C:03+2] pyrra: separate slo definitions from filesystem class [puppet] - 10https://gerrit.wikimedia.org/r/1028524 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:14:26] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete cert [puppet] - 10https://gerrit.wikimedia.org/r/1026438 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [15:16:01] !log [urbanecm@mwmaint1002 ~]$ mwscript userOptions.php --wiki=enwiki --delete rcenhancedfilters-seen-tour # T364269 [15:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:05] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [15:16:12] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s2 userOptions.php --delete rcenhancedfilters-seen-tour # T364269 [15:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:26] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s{3-8} userOptions.php --delete rcenhancedfilters-seen-tour # T364269 [15:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:01] (03PS1) 10Muehlenhoff: New Cumin alias for analytics mariadb nodes [puppet] - 10https://gerrit.wikimedia.org/r/1028526 [15:18:56] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9774724 (10Jhancock.wm) Forgot I left it there. All yours now! [15:19:44] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1028503 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [15:20:08] ACKNOWLEDGEMENT - MD RAID on mw2382 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T364317 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:20:15] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T364317 (10ops-monitoring-bot) 03NEW [15:20:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T361627)', diff saved to https://phabricator.wikimedia.org/P61961 and previous config saved to /var/cache/conftool/dbconfig/20240506-152016-marostegui.json [15:20:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [15:20:19] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:20:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [15:20:36] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1027744|userOptions.php: Actually batch deletion (T364311)]] (duration: 16m 51s) [15:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T361627)', diff saved to https://phabricator.wikimedia.org/P61962 and previous config saved to /var/cache/conftool/dbconfig/20240506-152040-marostegui.json [15:20:42] T364311: userOptions.php fails to delete very large number of DB rows - https://phabricator.wikimedia.org/T364311 [15:20:47] urbanecm: sorry, had to deal with family stuff - will reschedule for later [15:21:13] edsanders: no worries. hope you're fine! [15:21:18] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) (owner: 10Andrew Bogott) [15:21:19] yeah - all good [15:23:14] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T364317#9774750 (10andrea.denisse) โ†’14Duplicate dup:03T362938 [15:23:28] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9774748 (10andrea.denisse) [15:24:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61963 and previous config saved to /var/cache/conftool/dbconfig/20240506-152408-root.json [15:24:14] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9774754 (10andrea.denisse) Hi team, this alert keeps firing since April 18th. [15:24:36] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [15:25:49] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [15:30:04] jan_drewniak: Time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1530). [15:31:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T361627)', diff saved to https://phabricator.wikimedia.org/P61964 and previous config saved to /var/cache/conftool/dbconfig/20240506-153101-marostegui.json [15:31:19] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:32:54] (03CR) 10Andrea Denisse: [C:03+2] "Hi, the `thanos-swift` hosts use a different set of certificates that were not removed from the private repository so I don't think this i" [puppet] - 10https://gerrit.wikimedia.org/r/1026625 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [15:33:42] (03PS2) 10Brouberol: zookeeper: use datacenter-local aliases for flink ensembles [cookbooks] - 10https://gerrit.wikimedia.org/r/1028521 (https://phabricator.wikimedia.org/T363975) [15:37:01] (03PS1) 10JMeybohm: ratelimit: Update ratelimit service to git 3fcc360 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028532 (https://phabricator.wikimedia.org/T362310) [15:38:17] (03CR) 10JMeybohm: "== Step 0: scanning /home/jayme/code/wmf/operations/docker-images/production-images/images ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1028532 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:39:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61965 and previous config saved to /var/cache/conftool/dbconfig/20240506-153914-root.json [15:39:52] (03CR) 10Filippo Giunchedi: "Thank you Andrea! Indeed another valid solution would be to make thanos-fe stop referencing ssl/thanos-query.discovery.wmnet.key" [puppet] - 10https://gerrit.wikimedia.org/r/1026625 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [15:40:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1036 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:42:56] RECOVERY - Check whether ferm is active by checking the default input chain on mw1491 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:42:56] (03CR) 10JMeybohm: "@btullis@wikimedia.org is breaks compatibility with the spark-operator chart in favor of compatibility with out standard modules and scaff" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026860 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:46:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P61966 and previous config saved to /var/cache/conftool/dbconfig/20240506-154608-marostegui.json [15:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:54:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61967 and previous config saved to /var/cache/conftool/dbconfig/20240506-155420-root.json [15:54:47] (03CR) 10Andrea Denisse: [C:03+2] "I agree, I think the root cause of the issue is that after the cergen to CFSSL migration the thanos-fe hosts were referencing the old cert" [puppet] - 10https://gerrit.wikimedia.org/r/1026625 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [15:55:00] (03PS1) 10JMeybohm: Revert "Remove mw2382 as kubernetes node to prevent scap failures" [puppet] - 10https://gerrit.wikimedia.org/r/1028566 [15:55:00] (03CR) 10Vgutierrez: "please do not merge it till the applayer service is up & running" [puppet] - 10https://gerrit.wikimedia.org/r/1028504 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [15:56:54] (03CR) 10JMeybohm: [C:03+2] Revert "Remove mw2382 as kubernetes node to prevent scap failures" [puppet] - 10https://gerrit.wikimedia.org/r/1028566 (owner: 10JMeybohm) [15:57:54] 06SRE, 06SRE Observability: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924#9774906 (10Scott_French) Thanks, @fgiunchedi. Revisiting the patch I put together last week, there are two ways to go at this that come to mind - one similar to what w... [15:59:37] (03PS6) 10Dreamrimmer: Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) [16:01:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P61968 and previous config saved to /var/cache/conftool/dbconfig/20240506-160116-marostegui.json [16:03:53] (03PS2) 10Andrea Denisse: thanos: Update certificate names for Thanos hosts to match CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [16:08:22] (03PS1) 10Andrea Denisse: Revert "ssl: Remove unnecessary dummy key from thanos-query hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1028567 [16:11:14] (03CR) 10Andrew Bogott: puppetserver-deploy-code: bail out if current branch is not 'production' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026682 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [16:11:44] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Revert "ssl: Remove unnecessary dummy key from thanos-query hosts" [labs/private] - 10https://gerrit.wikimedia.org/r/1028567 (owner: 10Andrea Denisse) [16:13:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:16:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T361627)', diff saved to https://phabricator.wikimedia.org/P61969 and previous config saved to /var/cache/conftool/dbconfig/20240506-161624-marostegui.json [16:16:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [16:16:30] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:16:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [16:17:51] (03PS3) 10Andrea Denisse: thanos: Update certificate names for Thanos hosts to match CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [16:18:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:25:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance [16:25:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance [16:25:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T361627)', diff saved to https://phabricator.wikimedia.org/P61970 and previous config saved to /var/cache/conftool/dbconfig/20240506-162528-marostegui.json [16:25:31] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:30:46] (03CR) 10Vgutierrez: [C:03+1] "nice catch :)" [puppet] - 10https://gerrit.wikimedia.org/r/1028514 (owner: 10Ssingh) [16:33:16] (03CR) 10BCornwall: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1028514 (owner: 10Ssingh) [16:33:28] (03PS1) 10Elukey: ml-services: tune autoscaling for damaging, goodfaith and reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028552 (https://phabricator.wikimedia.org/T363336) [16:35:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T361627)', diff saved to https://phabricator.wikimedia.org/P61971 and previous config saved to /var/cache/conftool/dbconfig/20240506-163540-marostegui.json [16:35:47] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:36:59] (03PS4) 10Andrea Denisse: thanos: Update certificate names for Thanos hosts to match CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [16:37:09] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=mw2382.codfw.wmnet [16:38:24] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9775036 (10JMeybohm) 05Openโ†’03Resolved a:03JMeybohm >>! In T362938#9774724, @Jhancock.wm wrote: > Forgot I left it there. All yours now! Thanks. There was mdadm metadata sill on the "new" disk, I... [16:38:45] !log jayme@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw2382.codfw.wmnet [16:38:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2382.codfw.wmnet [16:41:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:42:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:44:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:45:53] !log disable puppet on A:ncredir to merge CR 1028514 [16:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:11] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: remove per-site override for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/1028514 (owner: 10Ssingh) [16:49:17] (03PS5) 10Andrea Denisse: thanos: Update certificate names for Thanos hosts to match CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [16:50:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P61972 and previous config saved to /var/cache/conftool/dbconfig/20240506-165048-marostegui.json [16:52:29] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:52:34] !log sudo cumin 'A:ncredir' 'run-puppet-agent --enable "merging CR 1028514"' [16:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:12] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1028546/2297/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:55:38] (03PS5) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) [16:56:23] (03PS1) 10Herron: pyrra: onboard etcd request/latency SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1028555 (https://phabricator.wikimedia.org/T302995) [16:58:10] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:08] (03PS1) 10JMeybohm: Add new mesh.configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028557 (https://phabricator.wikimedia.org/T362310) [16:59:09] (03PS1) 10JMeybohm: mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1700) [17:00:04] ryankemper: gettimeofday() says it's time for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T1700) [17:01:08] (03CR) 10Andrea Denisse: [C:03+1] "Nice change, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1028502 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [17:05:09] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9775097 (10andrea.denisse) @fgiunchedi Good to know, thank you. Do you think we should do the syncing again to the new drive? [17:05:26] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P61973 and previous config saved to /var/cache/conftool/dbconfig/20240506-170556-marostegui.json [17:07:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.348 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:17:24] (03PS6) 10Andrea Denisse: thanos: Update certificate names for Thanos hosts to match CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [17:20:29] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:21:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T361627)', diff saved to https://phabricator.wikimedia.org/P61974 and previous config saved to /var/cache/conftool/dbconfig/20240506-172103-marostegui.json [17:21:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [17:21:07] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:21:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [17:21:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T361627)', diff saved to https://phabricator.wikimedia.org/P61975 and previous config saved to /var/cache/conftool/dbconfig/20240506-172126-marostegui.json [17:31:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T361627)', diff saved to https://phabricator.wikimedia.org/P61976 and previous config saved to /var/cache/conftool/dbconfig/20240506-173143-marostegui.json [17:31:47] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:32:22] 10ops-magru: remote hands directions for racking and cabling magru - https://phabricator.wikimedia.org/T363368#9775258 (10RobH) 05Openโ†’03Declined Items arrived and Jenn/Papaul got them all cabled up. No remote hands directions needed! [17:32:33] (03PS7) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [17:33:24] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9775267 (10RobH) 05Openโ†’03Resolved a:03RobH [17:33:32] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9775264 (10RobH) 05Openโ†’03Resolved a:03RobH [17:35:50] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:36:51] (03PS8) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [17:39:57] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:41:15] (03CR) 10Andrea Denisse: [V:03+1] "Hello team, I noticed that Puppet was failing on the thanos-fe hosts because only the thanos-be hosts were migrated to CFSSL." [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:46:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P61977 and previous config saved to /var/cache/conftool/dbconfig/20240506-174651-marostegui.json [17:48:04] (03PS1) 10Scott French: confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 [17:48:04] (03PS1) 10Scott French: confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 [17:50:30] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:52:28] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [17:53:49] 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9775321 (10jcrespo) It was failing back in 2021: ` [2021-12-11 05:13:49,322] ERROR:swiftclient.service Object GET failed: http://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-pu... [17:55:28] (03CR) 10Scott French: "PCC diff for puppetmaster1001: https://puppet-compiler.wmflabs.org/output/1028560/2301/" [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (owner: 10Scott French) [17:59:07] (03PS2) 10Scott French: confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) [17:59:09] (03PS2) 10Scott French: confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) [18:01:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P61978 and previous config saved to /var/cache/conftool/dbconfig/20240506-180158-marostegui.json [18:09:14] 10ops-eqiad, 06SRE, 06DBA: db1178 not booting up - https://phabricator.wikimedia.org/T364300#9775360 (10VRiley-WMF) a:03VRiley-WMF [18:12:34] 06SRE, 06SRE Observability, 13Patch-For-Review: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924#9775362 (10Scott_French) Alright, giving the two scripts the same view of how the state file should be named (without dealing with confd's staged... [18:17:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T361627)', diff saved to https://phabricator.wikimedia.org/P61979 and previous config saved to /var/cache/conftool/dbconfig/20240506-181706-marostegui.json [18:17:11] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:18:28] !log [urbanecm@mwmaint1002 ~]$ mwscript userOptions.php --wiki=loginwiki --delete wlenhancedfilters-seen-tour # T364269 [18:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:30] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [18:20:12] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s2 userOptions.php --delete wlenhancedfilters-seen-tour # T364269 [18:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:34] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s4 userOptions.php --delete wlenhancedfilters-seen-tour # T364269 [18:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:45] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s5 userOptions.php --delete wlenhancedfilters-seen-tour # T364269 [18:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:53] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s6 userOptions.php --delete wlenhancedfilters-seen-tour # T364269 [18:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:17] (03CR) 10Herron: [C:03+1] grafana: add magru prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1028503 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [18:35:51] (03CR) 10Herron: [C:03+1] "๐Ÿงผ" [puppet] - 10https://gerrit.wikimedia.org/r/1028502 (https://phabricator.wikimedia.org/T364016) (owner: 10Filippo Giunchedi) [18:51:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:56:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:01:27] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:01:41] (03PS2) 10Herron: pyrra: onboard etcd request/latency SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1028555 (https://phabricator.wikimedia.org/T302995) [19:04:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:05:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9775442 (10Papaul) [19:06:27] (03CR) 10Dzahn: "guess I forgot the ".m." variant this time" [dns] - 10https://gerrit.wikimedia.org/r/1025871 (https://phabricator.wikimedia.org/T363825) (owner: 10Dzahn) [19:06:30] (03CR) 10Herron: [C:03+2] "self merging to onboard this from grizzly, cc for awareness" [puppet] - 10https://gerrit.wikimedia.org/r/1028555 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:07:39] (03PS2) 10Dzahn: create wikipedia-it-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1025871 (https://phabricator.wikimedia.org/T363825) [19:08:32] (03CR) 10Dzahn: [C:03+2] create wikipedia-it-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1025871 (https://phabricator.wikimedia.org/T363825) (owner: 10Dzahn) [19:08:44] (03PS3) 10Dzahn: create wikipedia-it-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1025871 (https://phabricator.wikimedia.org/T363825) [19:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:09:56] (03PS1) 10Andrea Denisse: Revert "ssl: Delete unused certificate for the thanos-query hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1028570 [19:15:58] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s3 userOptions.php --delete wlenhancedfilters-seen-tour # T364269 [19:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:01] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [19:16:38] (03PS1) 10Ahmon Dancy: coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 [19:17:07] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s5 userOptions.php --delete rcenhancedfilters-tried-highlight # T364269 [19:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:17:38] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s6 userOptions.php --delete rcenhancedfilters-tried-highlight # T364269 [19:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:51] (03CR) 10Andrea Denisse: [C:03+2] Revert "ssl: Delete unused certificate for the thanos-query hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1028570 (owner: 10Andrea Denisse) [19:19:15] (03CR) 10CI reject: [V:04-1] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [19:19:25] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1025871 (https://phabricator.wikimedia.org/T363825) (owner: 10Dzahn) [19:21:03] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s6 userOptions.php --delete rcenhancedfilters-seen-highlight-button-counter # T364269 [19:21:06] (03PS2) 10Ahmon Dancy: coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 [19:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:07] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [19:21:31] (03CR) 10Andrea Denisse: [C:03+2] "Hi, I've reverted this change along with the CRT change and restored the certificates in the private repository until we merge patch #1028" [puppet] - 10https://gerrit.wikimedia.org/r/1026625 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:22:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:22:59] (03PS19) 10CDobbins: purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) [19:23:30] !log [urbanecm@mwmaint1002 ~]$ mwscript userOptions.php --wiki=enwiki --delete wlenhancedfilters-seen-tour # T364269 [19:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:10] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s5 userOptions.php --delete rcenhancedfilters-seen-highlight-button-counter # T364269 [19:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:36] (03CR) 10Ssingh: "LGTM! Let's plan to merge this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [19:26:37] (03PS1) 10Herron: pyrra: etcd: add namespace [puppet] - 10https://gerrit.wikimedia.org/r/1028586 [19:26:50] (03PS2) 10Herron: pyrra: etcd: add namespace [puppet] - 10https://gerrit.wikimedia.org/r/1028586 [19:27:19] (03CR) 10Dzahn: "make sense. but do we want to set an absolute value in bytes then? or none at all?" [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [19:27:51] (03PS1) 10Esanders: Revert "Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028571 (https://phabricator.wikimedia.org/T352087) [19:28:13] (03CR) 10Ssingh: [C:03+1] purged: add PKI cert handling [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [19:28:52] (03PS3) 10Dzahn: delete civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1026986 [19:28:57] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn) [19:30:16] 10SRE-swift-storage, 06Commons: Commons: File:Gnome-edit-delete.svg not found - https://phabricator.wikimedia.org/T363995#9775646 (10Krinkle) The burst of DELETE for thumbnails is presumably an editor using the "purge" action on-wiki, in an attempt to fix the problem. I did not another purge on May 5th, not k... [19:30:56] (03PS2) 10Esanders: Revert "Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028571 (https://phabricator.wikimedia.org/T352087) [19:31:03] (03PS4) 10Paladox: Gerrit: update mail soy templates to match upstream [puppet] - 10https://gerrit.wikimedia.org/r/1027726 [19:31:06] (03CR) 10Paladox: Gerrit: update mail soy templates to match upstream (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox) [19:31:33] (03CR) 10Dzahn: [C:03+1] "the part that, per code comment, a cert was reused for another service here seems to have complicated the matter. maybe that should be sep" [puppet] - 10https://gerrit.wikimedia.org/r/1025879 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:37:51] (03CR) 10Dzahn: [V:03+1] "https://github.com/GerritCodeReview/gerrit/blob/v3.8.5/resources/com/google/gerrit/server/mail/ChangeSubject.soy" [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox) [19:39:37] 10ops-eqiad, 06SRE, 06DBA: db1178 not booting up - https://phabricator.wikimedia.org/T364300#9775662 (10VRiley-WMF) Noticed that the server was boot looping at the memory stage. Tried a flea power drain and there was no change. Then moved onto the base configuration and slowly added components back into the... [19:39:38] (03CR) 10Dzahn: [V:03+1 C:03+1] "I confirm this looks like 3.8.5 upstream - https://github.com/GerritCodeReview/gerrit/blob/v3.8.5/resources/com/google/gerrit/server/mail/" [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox) [19:40:19] (03CR) 10Dzahn: [V:03+1 C:03+2] Gerrit: update mail soy templates to match upstream [puppet] - 10https://gerrit.wikimedia.org/r/1027726 (owner: 10Paladox) [19:43:03] (03CR) 10Ssingh: "A bit late sorry but templates/155.80.208.in-addr.arpa:11 1H IN PTR civicrm-old.wikimedia.org can also be removed!" [dns] - 10https://gerrit.wikimedia.org/r/1026986 (owner: 10Dzahn) [19:46:26] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist s7 userOptions.php --delete wlenhancedfilters-seen-tour # T364269 [19:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:29] T364269: Drop user properties related to RC tours - https://phabricator.wikimedia.org/T364269 [19:47:06] (03PS1) 10Dzahn: remove PTR for civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1028590 [19:47:56] (03PS2) 10Dzahn: remove PTR for civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1028590 [19:49:03] (03CR) 10Ssingh: [C:03+1] remove PTR for civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1028590 (owner: 10Dzahn) [19:49:26] (03CR) 10Dzahn: [C:03+2] remove PTR for civicrm-old.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1028590 (owner: 10Dzahn) [19:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:58:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9775697 (10Papaul) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T2000). nyaa~ [20:00:05] jan_drewniak and esanders: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] o/ [20:00:23] also present [20:02:30] edsanders: I see yours is a config change, I can deploy that (and I'll do mine after) :) [20:02:43] thanks [20:03:39] (03CR) 10Herron: [C:03+2] pyrra: etcd: add namespace [puppet] - 10https://gerrit.wikimedia.org/r/1028586 (owner: 10Herron) [20:04:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028571 (https://phabricator.wikimedia.org/T352087) (owner: 10Esanders) [20:05:39] (03Merged) 10jenkins-bot: Revert "Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028571 (https://phabricator.wikimedia.org/T352087) (owner: 10Esanders) [20:05:55] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1028571|Revert "Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"" (T352087)]] [20:05:59] T352087: [MILESTONE] Offer Usability Improvements as default-on feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T352087 [20:09:25] !log jdrewniak@deploy1002 esanders and jdrewniak: Backport for [[gerrit:1028571|Revert "Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"" (T352087)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:04] edsanders: the patch is ready to test [20:10:07] thanks [20:14:05] jan_drewniak: looks good [20:14:44] edsanders: ok continuing with sync [20:14:48] !log jdrewniak@deploy1002 esanders and jdrewniak: Continuing with sync [20:18:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:48] PROBLEM - Check whether ferm is active by checking the default input chain on mw1458 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:22:14] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1027 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:23:42] (03CR) 10Jdrewniak: [C:03+2] [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [20:27:29] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1028571|Revert "Revert "Release DT visual enhancements to all except Wikipedia/Commons/Wikidata"" (T352087)]] (duration: 21m 33s) [20:27:32] T352087: [MILESTONE] Offer Usability Improvements as default-on feature at Phase 2 wikis (desktop) - https://phabricator.wikimedia.org/T352087 [20:28:09] edsanders: ok patch is live :) [20:28:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [20:29:15] (03PS4) 10Jdrewniak: [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) [20:29:26] jan_drewniak: thanks, looks good [20:29:26] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [20:31:45] (03Merged) 10jenkins-bot: [Vector 2022] Deploy larger font-size and appearance menu to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028522 (https://phabricator.wikimedia.org/T362147) (owner: 10Jdrewniak) [20:32:02] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1028522|[Vector 2022] Deploy larger font-size and appearance menu to pilot wikis (T362147)]] [20:32:07] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:34:33] !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:1028522|[Vector 2022] Deploy larger font-size and appearance menu to pilot wikis (T362147)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:34:59] !log jdrewniak@deploy1002 jdrewniak: Continuing with sync [20:47:12] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1028522|[Vector 2022] Deploy larger font-size and appearance menu to pilot wikis (T362147)]] (duration: 15m 10s) [20:47:18] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:50:16] 06SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9775917 (10Dzahn) Is there a specific service you want to use? [20:51:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1458 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:52:14] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1027 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:53:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:46] (03PS1) 10Dzahn: admin: add linafaridwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) [20:55:36] (03CR) 10CI reject: [V:04-1] admin: add linafaridwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) (owner: 10Dzahn) [20:56:42] (03CR) 10Dzahn: [C:04-1] "guess I have to move the user out of "ldap_only" even if it's for access without ssh ... per "Tox tests for admin/data/data.yaml failed!"" [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) (owner: 10Dzahn) [20:59:36] (03PS2) 10Dzahn: admin: add linafaridwmde to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1028600 (https://phabricator.wikimedia.org/T364068) [21:00:05] Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240506T2100). nyaa~ [21:01:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review, 03WMDE-TechWish-Sprint-2024-04-24: Requesting access to analytics-privatedata-users for linafaridwmde - https://phabricator.wikimedia.org/T364068#9775947 (10Dzahn) 05Openโ†’03In progress p:05Triageโ†’03Medium [21:01:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9775965 (10Dzahn) Waiting for manager approval [21:02:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9775966 (10Dzahn) 05In progressโ†’03Open [21:19:04] (03CR) 10Volans: "One nit inline. Some logic around the file name might become cleaner using pathlib, but it's totally optional ;)" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [21:23:11] (03CR) 10Volans: [C:03+1] "makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [21:28:16] !log dancy@deploy1002 Installing scap version "4.82.0" for 320 hosts [21:29:05] !log dancy@deploy1002 Installation of scap version "4.82.0" completed for 320 hosts [21:32:18] (03PS1) 10Scott French: apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 [21:32:18] (03PS1) 10Scott French: api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 [21:42:46] (03PS3) 10Scott French: confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) [21:42:46] (03PS3) 10Scott French: confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) [21:45:37] (03PS1) 10Dzahn: bump version of static-bugzilla to 2024-05-06-213327 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028607 (https://phabricator.wikimedia.org/T361774) [21:46:07] (03CR) 10Dzahn: [V:03+1 C:03+2] "version string per https://gitlab.wikimedia.org/repos/sre/miscweb/bugzilla/-/jobs/256833" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028607 (https://phabricator.wikimedia.org/T361774) (owner: 10Dzahn) [21:47:29] (03Merged) 10jenkins-bot: bump version of static-bugzilla to 2024-05-06-213327 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028607 (https://phabricator.wikimedia.org/T361774) (owner: 10Dzahn) [21:49:28] (03CR) 10Scott French: "Agreed, yeah. When I come back to address the TODO, I can totally sprinkle in some pathlib." [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [21:49:59] (03CR) 10Scott French: "Thanks, Riccardo." [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [21:54:33] (03PS2) 10Scott French: apertium: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) [21:54:35] (03PS2) 10Scott French: api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) [22:07:04] (03CR) 10Scott French: "For lack of a better strategy, I'm just going alphabetically. Your review is kindly requested :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028604 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [22:12:57] (03CR) 10Scott French: "Thanks in advance for the review, Janis." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [22:13:40] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [22:14:51] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:18:13] !log dzahn@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:20:21] !log dzahn@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:20:41] !log dzahn@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:22:37] !log dzahn@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [23:01:27] FIRING: [5x] JobUnavailable: Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:01:38] >:( [23:02:32] The dashboard shows everything as happy... [23:08:21] interesting indeed [23:08:28] we should flag it for olly [23:08:36] asking -observability [23:09:03] Hi, I'm here. [23:09:29] oh hi! I just asked asked in -observability [23:09:33] any idea about the alert denisse? [23:10:20] Not yet, but I'm taking a look. [23:10:38] np, not urgent [23:10:45] we can also put this in a task! [23:13:13] Good idea, let me create one. [23:19:33] thanks! also happy to do it but I am stepping out for a bit and will follow up later [23:19:36] brett: ^ [23:19:41] if you are still around [23:19:54] I don't see any anomaly on this graph either: https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview [23:20:21] https://grafana.wikimedia.org/d/zCYRtYvWz/ncredir-overview?orgId=1&var-site=codfw [23:21:49] Thanks for looking into it [23:22:00] I've created this barebones task, I'm adding content to it: https://phabricator.wikimedia.org/T364354 [23:27:53] I've added a brief description and a couple of screenshots. [23:29:07] that alert was ACKed 14 days ago [23:29:10] and expiring now [23:29:47] it's on all DCs, but 14 days, 12 days ago, 6 days ago and last one 9 hours ago [23:30:23] That's true, I guess it could be related to the mtail to benthos switch. [23:32:26] valentin called it "known" [23:33:56] nevermind, that was another job, but he still acked it [23:35:02] I think it may be related to this task: https://phabricator.wikimedia.org/T362776 [23:36:10] I think that's very likely [23:36:22] because it matches the times when things happened too [23:36:36] like today it started on codfw and "Enable benthos on ncredir@codfw" was merged [23:36:36] I've tagged him on the task, thanks for uploading the alert on all DCs screenshot. [23:37:04] esams 6 days ago also matches [23:37:11] yep [23:38:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1027480 [23:38:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1027480 (owner: 10TrainBranchBot) [23:39:08] I think the alert fired because it was set-up to use the info gathered by mtail instead of benthos. [23:45:24] denisse: yea, .. the blue line at the bottom means ncredir@ulsfo. it starts flatlining 04/24, hours before there was "Enable benthos on ncredir@ulsfo" [23:48:02] Something interesting I notice is that the ncredir hosts have the "benthos@ncredir.service" and the "system-ncredirmtail.slice" units are both active. [23:48:16] I think the hosts should only have the "benthos" unit enabled. [23:51:38] Ah, I think I found the issue. [23:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:52:23] I think it's an error interpolating labels, I'm digging deeper. [23:59:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1027480 (owner: 10TrainBranchBot)