[00:01:41] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028926 (owner: 10TrainBranchBot) [00:35:25] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9779420 (10Papaul) [00:37:00] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom lsw1-a1-codfw - https://phabricator.wikimedia.org/T364097#9779435 (10Papaul) 05Open→03Resolved [00:42:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9779440 (10Papaul) [00:48:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:26] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364439 (10phaultfinder) 03NEW [02:36:05] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028927 [02:36:26] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364439#9779494 (10phaultfinder) [02:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:29] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364439#9779498 (10phaultfinder) [03:00:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:50] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:08:52] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 122.69 ms [03:52:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:35:04] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9779529 (10Marostegui) Great! Thank you [04:50:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [04:50:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [04:52:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1178.eqiad.wmnet with OS bookworm [04:57:09] (03CR) 10Marostegui: [C:03+1] "jcrespo when do you estimate backups will be running on es6 and es7?" [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [04:57:21] 10ops-eqiad, 06SRE, 06DBA: db1178 not booting up - https://phabricator.wikimedia.org/T364300#9779540 (10Marostegui) 05Open→03Resolved Thanks @VRiley-WMF - the host is fine now and the reimage is going through. [05:02:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:02:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1167.eqiad.wmnet with reason: Maintenance [05:03:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:03:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [05:04:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2025', diff saved to https://phabricator.wikimedia.org/P61994 and previous config saved to /var/cache/conftool/dbconfig/20240508-050408-root.json [05:04:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: host reimage [05:04:13] (03PS1) 10Marostegui: es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028920 [05:04:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61995 and previous config saved to /var/cache/conftool/dbconfig/20240508-050419-marostegui.json [05:04:22] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:04:57] (03CR) 10Marostegui: [C:03+2] es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028920 (owner: 10Marostegui) [05:05:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2025.codfw.wmnet with OS bookworm [05:07:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: host reimage [05:08:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:03] (03PS1) 10Marostegui: Revert "db1178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028946 [05:21:02] (03PS1) 10Marostegui: Revert "es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028947 [05:26:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2025.codfw.wmnet with reason: host reimage [05:27:33] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:27:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1178.eqiad.wmnet with OS bookworm [05:28:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2025.codfw.wmnet with reason: host reimage [05:34:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T361627)', diff saved to https://phabricator.wikimedia.org/P61996 and previous config saved to /var/cache/conftool/dbconfig/20240508-053445-marostegui.json [05:34:49] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [05:40:17] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:45:19] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 237.17 ms [05:47:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2022', diff saved to https://phabricator.wikimedia.org/P61997 and previous config saved to /var/cache/conftool/dbconfig/20240508-054705-root.json [05:47:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give more weight to es2021', diff saved to https://phabricator.wikimedia.org/P61998 and previous config saved to /var/cache/conftool/dbconfig/20240508-054742-marostegui.json [05:48:24] (03PS1) 10Marostegui: es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028922 [05:48:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give more weight to es2021', diff saved to https://phabricator.wikimedia.org/P61999 and previous config saved to /var/cache/conftool/dbconfig/20240508-054825-marostegui.json [05:49:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P62000 and previous config saved to /var/cache/conftool/dbconfig/20240508-054953-marostegui.json [05:50:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2022.codfw.wmnet with OS bookworm [05:50:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give more weight to es2021', diff saved to https://phabricator.wikimedia.org/P62001 and previous config saved to /var/cache/conftool/dbconfig/20240508-055023-marostegui.json [05:51:27] (03CR) 10Marostegui: [C:03+2] es2022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1028922 (owner: 10Marostegui) [05:53:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2025.codfw.wmnet with OS bookworm [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T0600) [06:02:13] (03CR) 10Marostegui: [C:03+2] Revert "es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028947 (owner: 10Marostegui) [06:03:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give more weight to es2021', diff saved to https://phabricator.wikimedia.org/P62002 and previous config saved to /var/cache/conftool/dbconfig/20240508-060312-marostegui.json [06:05:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P62003 and previous config saved to /var/cache/conftool/dbconfig/20240508-060501-marostegui.json [06:09:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2022.codfw.wmnet with reason: host reimage [06:11:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2022.codfw.wmnet with reason: host reimage [06:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:57] (03CR) 10Valerio Bozzolan: "If the goal is to force the user to adopt the "canonical" arcanist/ and phorge/, seems good to me." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [06:20:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T361627)', diff saved to https://phabricator.wikimedia.org/P62004 and previous config saved to /var/cache/conftool/dbconfig/20240508-062012-marostegui.json [06:20:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:20:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:20:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:35:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2022.codfw.wmnet with OS bookworm [06:36:46] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364439#9779623 (10phaultfinder) [06:40:39] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364439#9779632 (10phaultfinder) [06:44:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T364443 [06:44:47] T364443: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T364443 [06:44:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T364443 [06:44:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:45:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1172.eqiad.wmnet with reason: Maintenance [06:45:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T361627)', diff saved to https://phabricator.wikimedia.org/P62005 and previous config saved to /var/cache/conftool/dbconfig/20240508-064523-marostegui.json [06:45:28] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [06:45:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62006 and previous config saved to /var/cache/conftool/dbconfig/20240508-064552-root.json [06:46:49] (03PS1) 10Marostegui: Revert "es2022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028949 [06:48:33] (03PS1) 10Marostegui: mariadb: Promote es2024 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1029065 (https://phabricator.wikimedia.org/T364443) [06:51:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62007 and previous config saved to /var/cache/conftool/dbconfig/20240508-065127-root.json [06:51:30] (03CR) 10Marostegui: [C:03+2] Revert "es2022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028949 (owner: 10Marostegui) [06:56:37] (03CR) 10Muehlenhoff: "The patch is fine, but the commit message is misleading and/or wrong: Fetching tge GeoIP data works fine with Puppet 7, we have seven host" [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [06:57:44] (03CR) 10Muehlenhoff: [C:03+2] query_sever::deploy::manual: Remove obsolete class [puppet] - 10https://gerrit.wikimedia.org/r/1028763 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff) [06:58:53] (03CR) 10Jcrespo: [C:03+1] "I plan to merge this this week, which means the first run will be the 14th of May." [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [06:59:43] (03CR) 10Marostegui: [C:03+1] "\o/ thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [06:59:48] (03CR) 10Muehlenhoff: [C:03+2] query_service: Stop installing git-fat [puppet] - 10https://gerrit.wikimedia.org/r/1028799 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff) [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:42] (03PS2) 10Muehlenhoff: Remove obsolete Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1028764 (https://phabricator.wikimedia.org/T316876) [07:00:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62008 and previous config saved to /var/cache/conftool/dbconfig/20240508-070058-root.json [07:01:59] !log uninstalling git-fat on buster hosts T364373 [07:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:14] T364373: Remove git-fat from Puppet - https://phabricator.wikimedia.org/T364373 [07:05:02] (03CR) 10Volans: "Thanks for the review!" [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [07:06:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62009 and previous config saved to /var/cache/conftool/dbconfig/20240508-070632-root.json [07:10:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T361627)', diff saved to https://phabricator.wikimedia.org/P62010 and previous config saved to /var/cache/conftool/dbconfig/20240508-071047-marostegui.json [07:10:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:16:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62011 and previous config saved to /var/cache/conftool/dbconfig/20240508-071604-root.json [07:21:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62012 and previous config saved to /var/cache/conftool/dbconfig/20240508-072138-root.json [07:25:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P62014 and previous config saved to /var/cache/conftool/dbconfig/20240508-072554-marostegui.json [07:26:42] (03PS3) 10Muehlenhoff: Remove obsolete Hiera settings to allow dropping Python 2 [puppet] - 10https://gerrit.wikimedia.org/r/1028764 (https://phabricator.wikimedia.org/T316876) [07:31:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62015 and previous config saved to /var/cache/conftool/dbconfig/20240508-073109-root.json [07:33:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2150.codfw.wmnet [07:34:58] (03PS1) 10Muehlenhoff: Switch db2150 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029107 (https://phabricator.wikimedia.org/T349619) [07:36:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62016 and previous config saved to /var/cache/conftool/dbconfig/20240508-073644-root.json [07:41:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P62017 and previous config saved to /var/cache/conftool/dbconfig/20240508-074102-marostegui.json [07:44:19] (03CR) 10Volans: sre.hosts.decommission: ask on failure (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1018718 (https://phabricator.wikimedia.org/T361306) (owner: 10Volans) [07:46:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62018 and previous config saved to /var/cache/conftool/dbconfig/20240508-074620-root.json [07:46:39] (03CR) 10Volans: "Should we merge it?" [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) (owner: 10Volans) [07:46:57] (03PS1) 10Marostegui: db-production.php: Enable writes on es6 and es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) [07:47:09] (03CR) 10Marostegui: [C:04-2] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [07:47:35] (03CR) 10Marostegui: [C:04-2] "Amir, is there anything else required from the MW side of things apart from this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [07:49:54] (03CR) 10Muehlenhoff: [C:03+2] Switch db2150 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029107 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:51:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62019 and previous config saved to /var/cache/conftool/dbconfig/20240508-075150-root.json [07:52:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:55:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2150.codfw.wmnet [07:56:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T361627)', diff saved to https://phabricator.wikimedia.org/P62020 and previous config saved to /var/cache/conftool/dbconfig/20240508-075610-marostegui.json [07:56:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:56:16] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [07:56:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance [07:56:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T361627)', diff saved to https://phabricator.wikimedia.org/P62021 and previous config saved to /var/cache/conftool/dbconfig/20240508-075635-marostegui.json [07:57:39] !log depool/restart/repool ms-fe1012 [07:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2168.codfw.wmnet [07:58:43] (03CR) 10Marostegui: [C:03+2] Revert "db1178: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028946 (owner: 10Marostegui) [07:59:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62022 and previous config saved to /var/cache/conftool/dbconfig/20240508-075906-root.json [07:59:25] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028926 (owner: 10TrainBranchBot) [08:01:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62023 and previous config saved to /var/cache/conftool/dbconfig/20240508-080128-root.json [08:01:39] (03PS1) 10Muehlenhoff: mwlog: Apply python2 setting per role [puppet] - 10https://gerrit.wikimedia.org/r/1029114 [08:02:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T364443 [08:02:46] (03PS1) 10Muehlenhoff: Switch db2168 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029115 (https://phabricator.wikimedia.org/T349619) [08:02:49] T364443: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T364443 [08:03:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T364443 [08:03:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2024 with weight 0 T364443', diff saved to https://phabricator.wikimedia.org/P62024 and previous config saved to /var/cache/conftool/dbconfig/20240508-080312-root.json [08:04:00] (03CR) 10Marostegui: [C:03+2] mariadb: Promote es2024 to es5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/1029065 (https://phabricator.wikimedia.org/T364443) (owner: 10Marostegui) [08:06:38] !log Starting es5 codfw failover from es2023 to es2024 - T364443 [08:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62025 and previous config saved to /var/cache/conftool/dbconfig/20240508-080656-root.json [08:07:40] (03CR) 10Muehlenhoff: [C:03+2] Switch db2168 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029115 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:08:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028929 [08:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028929 (owner: 10TrainBranchBot) [08:08:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2023 T364443', diff saved to https://phabricator.wikimedia.org/P62026 and previous config saved to /var/cache/conftool/dbconfig/20240508-080812-root.json [08:08:23] T364443: Switchover es5 codfw master (es2023 -> es2024) - https://phabricator.wikimedia.org/T364443 [08:08:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Give some weight to es5 master', diff saved to https://phabricator.wikimedia.org/P62027 and previous config saved to /var/cache/conftool/dbconfig/20240508-080848-marostegui.json [08:11:08] (03PS1) 10Marostegui: es2023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029117 [08:11:31] (03CR) 10Marostegui: [C:03+2] es2023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029117 (owner: 10Marostegui) [08:11:50] (03PS2) 10Slyngshede: API Tokens: Allow authorized users to manage their API tokens. [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 [08:12:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2023.codfw.wmnet with OS bookworm [08:14:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62028 and previous config saved to /var/cache/conftool/dbconfig/20240508-081412-root.json [08:14:32] (03CR) 10Slyngshede: API Tokens: Allow authorized users to manage their API tokens. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 (owner: 10Slyngshede) [08:14:44] (03CR) 10Slyngshede: [C:03+2] API Tokens: Allow authorized users to manage their API tokens. [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 (owner: 10Slyngshede) [08:16:19] (03Merged) 10jenkins-bot: API Tokens: Allow authorized users to manage their API tokens. [software/bitu] - 10https://gerrit.wikimedia.org/r/1026458 (owner: 10Slyngshede) [08:16:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2025 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62029 and previous config saved to /var/cache/conftool/dbconfig/20240508-081633-root.json [08:19:14] (03PS1) 10Marostegui: es2022: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1029120 (https://phabricator.wikimedia.org/T364289) [08:20:03] (03CR) 10Marostegui: [C:03+2] es2022: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1029120 (https://phabricator.wikimedia.org/T364289) (owner: 10Marostegui) [08:20:16] (03PS1) 10Muehlenhoff: elasticsearch::tlsproxy: Stop passing certs to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/1029121 (https://phabricator.wikimedia.org/T360439) [08:20:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2168.codfw.wmnet [08:21:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2182.codfw.wmnet [08:21:31] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [08:22:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62030 and previous config saved to /var/cache/conftool/dbconfig/20240508-082202-root.json [08:22:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T361627)', diff saved to https://phabricator.wikimedia.org/P62031 and previous config saved to /var/cache/conftool/dbconfig/20240508-082231-marostegui.json [08:22:35] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [08:22:35] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [08:23:30] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [08:24:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029121 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [08:24:43] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [08:24:48] (03PS2) 10Muehlenhoff: elasticsearch::tlsproxy: Stop passing certs to tlsproxy::localssl [puppet] - 10https://gerrit.wikimedia.org/r/1029121 (https://phabricator.wikimedia.org/T360439) [08:25:43] (03PS1) 10Muehlenhoff: Switch db2182 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029123 (https://phabricator.wikimedia.org/T349619) [08:29:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62032 and previous config saved to /var/cache/conftool/dbconfig/20240508-082917-root.json [08:29:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:29:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028929 (owner: 10TrainBranchBot) [08:29:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:30:57] (03CR) 10Muehlenhoff: [C:03+2] Switch db2182 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:31:20] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [08:31:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2023.codfw.wmnet with reason: host reimage [08:32:10] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [08:33:56] (03PS2) 10Zabe: Reapply "beta: Set password hashing to 'B'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027564 [08:35:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2023.codfw.wmnet with reason: host reimage [08:35:28] (03PS1) 10Marostegui: Revert "es2023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028950 [08:35:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2182.codfw.wmnet [08:36:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029121 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [08:36:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2208.codfw.wmnet [08:37:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P62033 and previous config saved to /var/cache/conftool/dbconfig/20240508-083739-marostegui.json [08:41:40] (03CR) 10Ladsgroup: [C:03+1] db-production.php: Enable writes on es6 and es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [08:43:03] (03CR) 10Marostegui: [C:04-2] "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029109 (https://phabricator.wikimedia.org/T364446) (owner: 10Marostegui) [08:44:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62034 and previous config saved to /var/cache/conftool/dbconfig/20240508-084422-root.json [08:47:41] (03CR) 10Ladsgroup: "It's fine in my opinion." [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [08:52:09] (03CR) 10Filippo Giunchedi: [C:03+1] mwlog: Apply python2 setting per role [puppet] - 10https://gerrit.wikimedia.org/r/1029114 (owner: 10Muehlenhoff) [08:52:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P62035 and previous config saved to /var/cache/conftool/dbconfig/20240508-085246-marostegui.json [08:53:18] (03PS1) 10Muehlenhoff: profile::swift::proxy_tls: Use Envoy unconditionally and drop Hiera flag [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) [08:56:14] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9779915 (10fgiunchedi) >>! In T363660#9775097, @andrea.denisse wrote: > @fgiunchedi Good to know, thank you. Do you think we should do the syncing again to the new drive? Good question, I think we're good as-is... [08:56:47] (03PS1) 10Muehlenhoff: Switch db2208 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029129 (https://phabricator.wikimedia.org/T349619) [08:58:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2023.codfw.wmnet with OS bookworm [08:58:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [08:59:26] (03PS1) 10Ladsgroup: pager: Use SelectQueryBuilder::rawTables in IndexPager [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028952 (https://phabricator.wikimedia.org/T364428) [08:59:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62036 and previous config saved to /var/cache/conftool/dbconfig/20240508-085929-root.json [09:03:11] (03CR) 10Marostegui: [C:03+2] Revert "es2023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028950 (owner: 10Marostegui) [09:03:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62037 and previous config saved to /var/cache/conftool/dbconfig/20240508-090334-root.json [09:06:07] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029133 (https://phabricator.wikimedia.org/T364289) [09:06:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1022 T364289', diff saved to https://phabricator.wikimedia.org/P62038 and previous config saved to /var/cache/conftool/dbconfig/20240508-090621-root.json [09:06:26] T364289: Reimage external store hosts with Bookworm - https://phabricator.wikimedia.org/T364289 [09:07:04] (03CR) 10Marostegui: [C:03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029133 (https://phabricator.wikimedia.org/T364289) (owner: 10Marostegui) [09:07:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bookworm [09:07:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T361627)', diff saved to https://phabricator.wikimedia.org/P62039 and previous config saved to /var/cache/conftool/dbconfig/20240508-090754-marostegui.json [09:07:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:07:57] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:08:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [09:08:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T361627)', diff saved to https://phabricator.wikimedia.org/P62040 and previous config saved to /var/cache/conftool/dbconfig/20240508-090817-marostegui.json [09:08:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1177 T363792', diff saved to https://phabricator.wikimedia.org/P62041 and previous config saved to /var/cache/conftool/dbconfig/20240508-090925-root.json [09:09:29] T363792: Upgrade s8 to MariaDB 10.6 - https://phabricator.wikimedia.org/T363792 [09:09:57] (03PS1) 10Marostegui: db1177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029135 [09:10:47] (03CR) 10Marostegui: [C:03+2] db1177: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1029135 (owner: 10Marostegui) [09:10:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1177.eqiad.wmnet with OS bookworm [09:11:30] (03CR) 10Filippo Giunchedi: thanos: Update TLS certificate in Envoy config to match CFSSL provisioning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [09:14:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62042 and previous config saved to /var/cache/conftool/dbconfig/20240508-091434-root.json [09:15:57] (03CR) 10Filippo Giunchedi: thanos: Provision Thanos frontend TLS certificates with CFSSL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [09:18:22] (03CR) 10Zabe: [C:03+2] Reapply "beta: Set password hashing to 'B'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027564 (owner: 10Zabe) [09:18:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62043 and previous config saved to /var/cache/conftool/dbconfig/20240508-091841-root.json [09:19:05] jouncebot: nowandnext [09:19:05] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [09:19:05] In 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1000) [09:19:10] (03Merged) 10jenkins-bot: Reapply "beta: Set password hashing to 'B'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027564 (owner: 10Zabe) [09:19:13] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9779963 (10fgiunchedi) Also cc {T356412} and @elukey since the thanos-fe work here will help with that task too [09:22:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1022.eqiad.wmnet with reason: host reimage [09:23:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage [09:23:13] (03CR) 10Ladsgroup: "This can performance implications, I think our performance tsar should take a look at this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027148 (owner: 10Zabe) [09:25:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1022.eqiad.wmnet with reason: host reimage [09:26:19] (03CR) 10Filippo Giunchedi: [C:03+1] confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [09:26:47] (03CR) 10Filippo Giunchedi: [C:03+1] confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [09:27:00] (03CR) 10Ladsgroup: [C:03+1] hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [09:27:02] (03CR) 10Filippo Giunchedi: [C:03+1] confd: confd-lint-wrap ignores positional args separator [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [09:27:11] (03CR) 10Filippo Giunchedi: [C:03+1] confd: insert positional argument separator in check_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [09:28:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: host reimage [09:29:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62044 and previous config saved to /var/cache/conftool/dbconfig/20240508-092944-root.json [09:33:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62045 and previous config saved to /var/cache/conftool/dbconfig/20240508-093347-root.json [09:33:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T361627)', diff saved to https://phabricator.wikimedia.org/P62046 and previous config saved to /var/cache/conftool/dbconfig/20240508-093350-marostegui.json [09:33:54] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [09:39:37] (03PS2) 10Klausman: admin_mg: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) [09:39:45] (03PS1) 10Marostegui: Revert "db1177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028953 [09:39:49] (03PS1) 10Marostegui: Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028954 [09:41:04] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host snapshot1011.eqiad.wmnet with OS bullseye [09:41:17] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9780020 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapsh... [09:44:03] (03CR) 10Muehlenhoff: [C:03+2] Switch db2208 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029129 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:48:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2208.codfw.wmnet [09:48:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62047 and previous config saved to /var/cache/conftool/dbconfig/20240508-094853-root.json [09:49:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1177.eqiad.wmnet with OS bookworm [09:49:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P62048 and previous config saved to /var/cache/conftool/dbconfig/20240508-094905-marostegui.json [09:49:12] (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [09:49:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1022.eqiad.wmnet with OS bookworm [09:49:20] (03CR) 10Marostegui: [C:03+2] Revert "db1177: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028953 (owner: 10Marostegui) [09:50:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62049 and previous config saved to /var/cache/conftool/dbconfig/20240508-095011-root.json [09:50:15] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9780073 (10BTullis) [09:50:25] (03PS1) 10Muehlenhoff: Inline profile::swift::proxy_tls [puppet] - 10https://gerrit.wikimedia.org/r/1029140 (https://phabricator.wikimedia.org/T357750) [09:50:51] (03CR) 10Muehlenhoff: "Ok, can you please take care of this?" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [09:51:40] (03CR) 10Effie Mouzeli: [C:03+1] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [09:53:29] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1011.eqiad.wmnet with reason: host reimage [09:53:53] (03CR) 10Marostegui: [C:03+2] Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1028954 (owner: 10Marostegui) [09:54:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62050 and previous config saved to /var/cache/conftool/dbconfig/20240508-095405-root.json [09:56:20] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1011.eqiad.wmnet with reason: host reimage [09:58:20] !log depooling 6 6 codfw api appservers in advance of reimaging to k8s workers [09:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1000) [10:01:28] (03CR) 10Ladsgroup: [C:03+2] "oh lovely" [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028952 (https://phabricator.wikimedia.org/T364428) (owner: 10Ladsgroup) [10:03:15] (03CR) 10Hnowlan: [C:03+2] kubernetes: make 6 codfw api appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1028847 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [10:03:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62051 and previous config saved to /var/cache/conftool/dbconfig/20240508-100359-root.json [10:04:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P62052 and previous config saved to /var/cache/conftool/dbconfig/20240508-100416-marostegui.json [10:05:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62053 and previous config saved to /var/cache/conftool/dbconfig/20240508-100517-root.json [10:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62054 and previous config saved to /var/cache/conftool/dbconfig/20240508-100910-root.json [10:11:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2218.codfw.wmnet [10:11:47] (03PS4) 10Volans: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 [10:12:12] just got a "AphrontConnectionLostQueryException: #2006: MySQL server has gone away" from phab, although that seems to have been a one-off [10:12:13] (03PS1) 10Muehlenhoff: Switch db2218 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029143 (https://phabricator.wikimedia.org/T349619) [10:12:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:45] (03CR) 10Muehlenhoff: [C:03+2] Switch db2218 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029143 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:15:54] (03CR) 10Elukey: "Left some comments, lemme know :) The DestinationRule may be good if we want connection pooling, not sure if it is something desirable or " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:17:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2396.codfw.wmnet with OS bullseye [10:17:07] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2397.codfw.wmnet with OS bullseye [10:17:09] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2401.codfw.wmnet with OS bullseye [10:17:11] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2402.codfw.wmnet with OS bullseye [10:17:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2398.codfw.wmnet with OS bullseye [10:17:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2399.codfw.wmnet with OS bullseye [10:18:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:47] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9780145 (10elukey) >>! In T360414#9779961, @fgiunchedi wrote: > Also cc {T356412} and @elukey since the thanos-fe work here will help wit... [10:19:02] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host snapshot1011.eqiad.wmnet with OS bullseye [10:19:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62055 and previous config saved to /var/cache/conftool/dbconfig/20240508-101905-root.json [10:19:10] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9780147 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot10... [10:19:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T361627)', diff saved to https://phabricator.wikimedia.org/P62056 and previous config saved to /var/cache/conftool/dbconfig/20240508-101923-marostegui.json [10:19:26] (03Merged) 10jenkins-bot: pager: Use SelectQueryBuilder::rawTables in IndexPager [core] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028952 (https://phabricator.wikimedia.org/T364428) (owner: 10Ladsgroup) [10:19:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1193.eqiad.wmnet with reason: Maintenance [10:19:27] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:19:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1193.eqiad.wmnet with reason: Maintenance [10:19:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T361627)', diff saved to https://phabricator.wikimedia.org/P62057 and previous config saved to /var/cache/conftool/dbconfig/20240508-101946-marostegui.json [10:19:53] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9780151 (10BTullis) [10:20:13] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:20:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62058 and previous config saved to /var/cache/conftool/dbconfig/20240508-102023-root.json [10:22:50] (03CR) 10MVernon: [C:04-1] "In theory, yes, but given the general state of the beta cluster and my TODO list, it's unlikely to happen any time soon sorry :-(" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:23:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:24:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62059 and previous config saved to /var/cache/conftool/dbconfig/20240508-102416-root.json [10:25:55] (03PS9) 10Fabfur: fifo_log_demux: add new parameters for current release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) [10:31:26] (03CR) 10Aklapper: "I ran `rm -rf projects/phabricator/chatlog/` and `rm src/translations/PhabricatorChatlog*` and `../arcanist/bin/arc liberate`. Afterwards," [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [10:31:39] (03CR) 10Volans: [C:03+2] puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans) [10:32:53] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2396.codfw.wmnet with reason: host reimage [10:32:56] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2401.codfw.wmnet with reason: host reimage [10:32:59] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2402.codfw.wmnet with reason: host reimage [10:33:00] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2399.codfw.wmnet with reason: host reimage [10:33:03] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1028952|pager: Use SelectQueryBuilder::rawTables in IndexPager (T364428)]] [10:33:04] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2397.codfw.wmnet with reason: host reimage [10:33:06] T364428: InvalidArgumentException: Wikimedia\Rdbms\JoinGroupBase::table: $table must be either string, JoinGroup or SelectQueryBuilder (via IndexPager) - https://phabricator.wikimedia.org/T364428 [10:33:08] (03CR) 10Elukey: admin_mg: Add Cassandra ServiceEntry and VS for LiftWing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:33:26] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2398.codfw.wmnet with reason: host reimage [10:34:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62060 and previous config saved to /var/cache/conftool/dbconfig/20240508-103410-root.json [10:34:20] (03CR) 10Elukey: admin_mg: Add Cassandra ServiceEntry and VS for LiftWing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:34:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2218.codfw.wmnet [10:35:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62061 and previous config saved to /var/cache/conftool/dbconfig/20240508-103531-root.json [10:35:53] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1028952|pager: Use SelectQueryBuilder::rawTables in IndexPager (T364428)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:35:57] (03CR) 10Muehlenhoff: "Ok, fair enough. If the state of the cluster is currently degraded anyway (and doesn't reflect what's in prod anyway), then this will simp" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:36:00] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [10:36:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2396.codfw.wmnet with reason: host reimage [10:38:43] (03Merged) 10jenkins-bot: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans) [10:38:50] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2401.codfw.wmnet with reason: host reimage [10:39:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2220.codfw.wmnet [10:39:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62062 and previous config saved to /var/cache/conftool/dbconfig/20240508-103922-root.json [10:40:05] (03PS1) 10Muehlenhoff: Switch db2220 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029152 (https://phabricator.wikimedia.org/T349619) [10:40:19] PROBLEM - Check whether ferm is active by checking the default input chain on parse2016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:41:31] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364439#9780200 (10phaultfinder) [10:41:43] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2398.codfw.wmnet with reason: host reimage [10:42:28] (03CR) 10Muehlenhoff: [C:03+2] Switch db2220 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029152 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:42:50] (03PS1) 10Slyngshede: Tag vertical tables in CSS. [software/bitu] - 10https://gerrit.wikimedia.org/r/1029157 [10:43:51] PROBLEM - Check whether ferm is active by checking the default input chain on mw1357 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:44:03] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2402.codfw.wmnet with reason: host reimage [10:45:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T361627)', diff saved to https://phabricator.wikimedia.org/P62063 and previous config saved to /var/cache/conftool/dbconfig/20240508-104503-marostegui.json [10:45:07] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [10:45:26] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364439#9780212 (10phaultfinder) [10:46:01] (03CR) 10Slyngshede: [C:03+2] Tag vertical tables in CSS. [software/bitu] - 10https://gerrit.wikimedia.org/r/1029157 (owner: 10Slyngshede) [10:46:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2399.codfw.wmnet with reason: host reimage [10:46:55] (03PS1) 10FNegri: P:toolforge:redis_sentinel: set redis timeout [puppet] - 10https://gerrit.wikimedia.org/r/1029158 (https://phabricator.wikimedia.org/T363709) [10:47:44] (03Merged) 10jenkins-bot: Tag vertical tables in CSS. [software/bitu] - 10https://gerrit.wikimedia.org/r/1029157 (owner: 10Slyngshede) [10:48:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:48:46] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1028952|pager: Use SelectQueryBuilder::rawTables in IndexPager (T364428)]] (duration: 15m 42s) [10:48:49] T364428: InvalidArgumentException: Wikimedia\Rdbms\JoinGroupBase::table: $table must be either string, JoinGroup or SelectQueryBuilder (via IndexPager) - https://phabricator.wikimedia.org/T364428 [10:49:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2220.codfw.wmnet [10:50:30] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2397.codfw.wmnet with reason: host reimage [10:50:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62064 and previous config saved to /var/cache/conftool/dbconfig/20240508-105039-root.json [10:53:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host snapshot1011.eqiad.wmnet [10:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62065 and previous config saved to /var/cache/conftool/dbconfig/20240508-105428-root.json [10:55:15] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2396.codfw.wmnet with OS bullseye [10:57:16] !log volans@cumin1002 START - Cookbook sre.puppet.renew-cert for sretest1001.eqiad.wmnet: Renew puppet certificate - volans@cumin1002 [10:57:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2401.codfw.wmnet with OS bullseye [10:57:48] (03PS2) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) [10:58:37] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464 (10cmooney) 03NEW p:05Triage→03High [10:58:41] (03PS1) 10Muehlenhoff: Switch snapshot1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029161 (https://phabricator.wikimedia.org/T349619) [10:59:52] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sretest1001.eqiad.wmnet: Renew puppet certificate - volans@cumin1002 [11:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1100). [11:00:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P62066 and previous config saved to /var/cache/conftool/dbconfig/20240508-110010-marostegui.json [11:00:35] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2398.codfw.wmnet with OS bullseye [11:00:43] (03CR) 10FNegri: "I'm not really sure if 10 minutes is the best value here, but my reasoning is that if a client has a connection that remains idle for more" [puppet] - 10https://gerrit.wikimedia.org/r/1029158 (https://phabricator.wikimedia.org/T363709) (owner: 10FNegri) [11:00:48] (03CR) 10Btullis: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2336/console" [puppet] - 10https://gerrit.wikimedia.org/r/1026964 (https://phabricator.wikimedia.org/T362181) (owner: 10Btullis) [11:01:22] (03CR) 10Muehlenhoff: [C:03+2] Switch snapshot1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029161 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:02:23] (03CR) 10Aklapper: "My goal is to allow easier testing of this downstream extension against an upstream instance instead of requiring a WMF code instance. My " [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [11:02:38] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2402.codfw.wmnet with OS bullseye [11:02:53] (03CR) 10Aklapper: Make Translations extension work with upstream Phorge (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [11:03:31] !log volans@cumin1002 START - Cookbook sre.puppet.renew-cert for sretest1002.eqiad.wmnet: Renew puppet certificate - volans@cumin1002 [11:03:46] 10SRE-tools, 10Cloud-VPS, 06Infrastructure-Foundations, 10Spicerack: spicerack.puppet.PuppetHostsError: Unable to find CSR fingerprints for all hosts, detected errors are: Another puppet instance is already running and the waitforlock setting is set to 0; e... - https://phabricator.wikimedia.org/T361218#9780281 [11:05:04] (03PS1) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1029163 (https://phabricator.wikimedia.org/T353785) [11:05:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2399.codfw.wmnet with OS bullseye [11:05:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62067 and previous config saved to /var/cache/conftool/dbconfig/20240508-110545-root.json [11:06:18] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sretest1002.eqiad.wmnet: Renew puppet certificate - volans@cumin1002 [11:06:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host snapshot1011.eqiad.wmnet [11:06:27] !log volans@cumin1002 START - Cookbook sre.puppet.renew-cert for sretest1003.eqiad.wmnet: Renew puppet certificate - volans@cumin1002 [11:06:28] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2339/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [11:07:26] (03Abandoned) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1029163 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [11:07:52] (03PS1) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1029164 (https://phabricator.wikimedia.org/T353785) [11:08:48] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sretest1003.eqiad.wmnet: Renew puppet certificate - volans@cumin1002 [11:09:19] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2340/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [11:09:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62068 and previous config saved to /var/cache/conftool/dbconfig/20240508-110933-root.json [11:09:54] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2397.codfw.wmnet with OS bullseye [11:10:19] RECOVERY - Check whether ferm is active by checking the default input chain on parse2016 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:10:42] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host snapshot1015.eqiad.wmnet [11:10:55] (03CR) 10CI reject: [V:04-1] Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1029164 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [11:11:49] (03PS1) 10Muehlenhoff: Switch snapshot1015 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029165 (https://phabricator.wikimedia.org/T349619) [11:12:44] (03CR) 10Muehlenhoff: [C:03+2] Switch snapshot1015 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029165 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:13:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:51] RECOVERY - Check whether ferm is active by checking the default input chain on mw1357 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:14:33] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780328 (10cmooney) [11:15:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P62069 and previous config saved to /var/cache/conftool/dbconfig/20240508-111518-marostegui.json [11:16:03] (03PS3) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) [11:16:23] (03Abandoned) 10Btullis: Move stats misc_jobs from stat1007 to stat1011 [puppet] - 10https://gerrit.wikimedia.org/r/1029164 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [11:17:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host snapshot1015.eqiad.wmnet [11:18:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2341/co" [puppet] - 10https://gerrit.wikimedia.org/r/1028866 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [11:20:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62070 and previous config saved to /var/cache/conftool/dbconfig/20240508-112054-root.json [11:24:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62071 and previous config saved to /var/cache/conftool/dbconfig/20240508-112439-root.json [11:26:47] RECOVERY - BGP status on cr2-magru is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:30:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T361627)', diff saved to https://phabricator.wikimedia.org/P62072 and previous config saved to /var/cache/conftool/dbconfig/20240508-113025-marostegui.json [11:30:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1203.eqiad.wmnet with reason: Maintenance [11:30:29] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:30:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1203.eqiad.wmnet with reason: Maintenance [11:30:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T361627)', diff saved to https://phabricator.wikimedia.org/P62073 and previous config saved to /var/cache/conftool/dbconfig/20240508-113048-marostegui.json [11:34:10] (03CR) 10Muehlenhoff: [C:03+2] mwlog: Apply python2 setting per role [puppet] - 10https://gerrit.wikimedia.org/r/1029114 (owner: 10Muehlenhoff) [11:37:37] !log running homer commit for new codfw appservers [11:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:37] (03PS1) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) [11:45:39] (03PS1) 10Jelto: prometheus::ops: scrape custom gitlab exporter [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) [11:47:12] (03PS1) 10Zabe: beta: Set password hashing to argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029170 [11:47:32] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2343/console" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:49:04] (03PS2) 10Jelto: gitlab: enable custom exporter on all instances [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) [11:49:12] (03CR) 10Zabe: [C:03+2] beta: Set password hashing to argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029170 (owner: 10Zabe) [11:50:18] (03Merged) 10jenkins-bot: beta: Set password hashing to argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029170 (owner: 10Zabe) [11:50:35] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2344/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029168 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:54:59] (03PS2) 10Jelto: prometheus::ops: scrape custom gitlab exporter [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) [11:56:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T361627)', diff saved to https://phabricator.wikimedia.org/P62074 and previous config saved to /var/cache/conftool/dbconfig/20240508-115616-marostegui.json [11:56:20] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [11:57:29] taavi, hi 👋 I see you added yourself as reviewer in https://gerrit.wikimedia.org/r/c/operations/puppet/+/527915/5 . Would you be able to merge and/or deploy that one and the other three changes in the relation chain (very similar issues for other language codes)? [11:57:40] !log installing tomcat security updates [11:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:38] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2345/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [12:02:45] Jhs: yeah, I'll try to either do that myself or find someone else that knows that area better than I do [12:06:53] (03PS1) 10Btullis: Drop the deprecated dumps fetcher that pulls from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/1029176 (https://phabricator.wikimedia.org/T353785) [12:08:12] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1170.eqiad.wmnet [12:08:20] taavi, ty 👍 [12:09:05] (03PS1) 10Muehlenhoff: Switch db1170 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029177 (https://phabricator.wikimedia.org/T349619) [12:11:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P62075 and previous config saved to /var/cache/conftool/dbconfig/20240508-121123-marostegui.json [12:11:37] (03PS3) 10Klausman: admin_mg: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) [12:12:00] (03CR) 10Klausman: "My concern with Connection pooling is that Cassandra connections are stateful, e.g. you issue `USE NAMESPACE foo` and then you don't need " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [12:16:28] !log hnowlan@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2396.codfw.wmnet|mw2397.codfw.wmnet|mw2398.codfw.wmnet|mw2399.codfw.wmnet|mw2401.codfw.wmnet|mw2402.codfw.wmnet),cluster=kubernetes,service=kubesvc [12:16:46] (03CR) 10Muehlenhoff: [C:03+2] Switch db1170 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029177 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:21:15] (03CR) 10Muehlenhoff: [C:03+2] Druid: historical/middlemanager: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1026822 (owner: 10Muehlenhoff) [12:22:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1170.eqiad.wmnet [12:22:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1174.eqiad.wmnet [12:25:41] (03PS1) 10Muehlenhoff: Switch db1174 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029179 (https://phabricator.wikimedia.org/T349619) [12:26:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P62076 and previous config saved to /var/cache/conftool/dbconfig/20240508-122631-marostegui.json [12:28:44] (03PS1) 10Muehlenhoff: Configure an-test-druid to use firewall::service compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1029180 [12:31:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029180 (owner: 10Muehlenhoff) [12:32:08] (03CR) 10Muehlenhoff: [C:03+2] Switch db1174 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029179 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:34:35] (03CR) 10Elukey: [C:03+1] "Sure sure I didn't check the use-case specific requirements, I was just bringing the idea up. Fine for me to avoid the DestinationRule!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [12:36:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1174.eqiad.wmnet [12:36:40] (03PS2) 10Muehlenhoff: Configure an-test-druid to use firewall::service compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1029180 [12:36:41] (03CR) 10Elukey: "I can take care of the deployment-prep upgrade for T356412, I'll report when done :) Hopefully ETA this week, would it work?" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [12:37:35] (03CR) 10Muehlenhoff: "Thanks! No rush at all, the final refactor isn't ready anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [12:37:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029180 (owner: 10Muehlenhoff) [12:38:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1181.eqiad.wmnet [12:39:19] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:39:36] (03PS1) 10Muehlenhoff: Switch db1181 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029181 (https://phabricator.wikimedia.org/T349619) [12:41:37] (03PS4) 10Klausman: admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) [12:41:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T361627)', diff saved to https://phabricator.wikimedia.org/P62078 and previous config saved to /var/cache/conftool/dbconfig/20240508-124138-marostegui.json [12:41:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1211.eqiad.wmnet with reason: Maintenance [12:41:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1211.eqiad.wmnet with reason: Maintenance [12:42:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T361627)', diff saved to https://phabricator.wikimedia.org/P62079 and previous config saved to /var/cache/conftool/dbconfig/20240508-124201-marostegui.json [12:42:20] (03CR) 10Klausman: [C:03+2] admin_ng: Add Cassandra ServiceEntry and VS for LiftWing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [12:44:58] (03CR) 10Muehlenhoff: [C:03+2] Switch db1181 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029181 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:45:20] (03Merged) 10jenkins-bot: admin_ng: Add Cassandra ServiceEntry and VS for LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029139 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [12:45:51] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9780609 (10andrea.denisse) >>! In T363660#9779915, @fgiunchedi wrote: >>>! In T363660#9775097, @andrea.denisse wrote: >> @fgiunchedi Good to know, thank you. Do you think we should do the syncing again to the ne... [12:48:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:48:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1181.eqiad.wmnet [12:51:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:47] (03PS1) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [12:52:28] (03CR) 10Elukey: [V:03+1 C:03+2] role::swift::proxy: move eqiad envoys to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1028859 (https://phabricator.wikimedia.org/T356412) (owner: 10Elukey) [12:52:29] (03PS2) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [12:52:55] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780633 (10Papaul) @cmooney I think this is just a human error issue. We were racking all the lsw1-d* yesterday and maybe we accidentally bumped into the cable. We will check o... [12:53:05] (03PS1) 10Muehlenhoff: Configure analytics Druid nodes to use firewall::service compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1029184 [12:53:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:53:26] (03CR) 10CI reject: [V:04-1] Configure analytics Druid nodes to use firewall::service compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1029184 (owner: 10Muehlenhoff) [12:53:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1191.eqiad.wmnet [12:54:23] (03PS3) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [12:54:25] (03PS1) 10Muehlenhoff: Switch db1191 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029185 (https://phabricator.wikimedia.org/T349619) [12:56:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:56:46] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:57:32] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1010.eqiad.wmnet [12:57:38] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:58:11] !log depool/deploy/repool every node in the range ms-fe10[10-14] to upgrade envoy to PKI TLS certs [12:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:53] (03PS1) 10Klausman: Revert "admin_ng: Add Cassandra ServiceEntry and VS for LiftWing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028959 [12:59:19] (03CR) 10Klausman: [C:03+2] Revert "admin_ng: Add Cassandra ServiceEntry and VS for LiftWing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028959 (owner: 10Klausman) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1300). [13:00:05] DreamRimmer and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:31] I am around [13:00:31] i can deploy if folks are around [13:01:03] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:01:10] o/ [13:01:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 6.657 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:02:13] (03Merged) 10jenkins-bot: Revert "admin_ng: Add Cassandra ServiceEntry and VS for LiftWing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028959 (owner: 10Klausman) [13:02:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:02:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:03:37] (03CR) 10Zabe: [C:03+2] Add tm: as alias to template: on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026604 (https://phabricator.wikimedia.org/T363757) (owner: 10Dreamrimmer) [13:04:28] (03Merged) 10jenkins-bot: Add tm: as alias to template: on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026604 (https://phabricator.wikimedia.org/T363757) (owner: 10Dreamrimmer) [13:05:11] (03PS12) 10Sohom Datta: [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [13:05:14] (03CR) 10Zabe: [C:03+2] [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [13:05:50] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1010.eqiad.wmnet [13:05:58] (03Merged) 10jenkins-bot: [ruwiki] Limit the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019390 (https://phabricator.wikimedia.org/T362440) (owner: 10Dreamrimmer) [13:06:45] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1026604|Add tm: as alias to template: on English Wikipedia (T363757)]], [[gerrit:1019390|[ruwiki] Limit the use of the ContentTranslation tool (T362440)]] [13:06:52] T363757: Add tm: as alias to template: on English Wikipedia - https://phabricator.wikimedia.org/T363757 [13:06:53] T362440: Set wgContentTranslationPublishRequirements for Russian Wikipedia - https://phabricator.wikimedia.org/T362440 [13:07:07] (03PS2) 10Muehlenhoff: Configure analytics Druid nodes to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1029184 [13:07:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T361627)', diff saved to https://phabricator.wikimedia.org/P62080 and previous config saved to /var/cache/conftool/dbconfig/20240508-130727-marostegui.json [13:07:29] (03CR) 10Muehlenhoff: [C:03+2] Switch db1191 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029185 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:07:31] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:09:03] anzx: your patch is still marked as 'work in progress' [13:10:32] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1011.eqiad.wmnet [13:10:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029184 (owner: 10Muehlenhoff) [13:11:32] !log zabe@deploy1002 zabe and dreamrimmer: Backport for [[gerrit:1026604|Add tm: as alias to template: on English Wikipedia (T363757)]], [[gerrit:1019390|[ruwiki] Limit the use of the ContentTranslation tool (T362440)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:11:40] (03PS1) 10Anzx: pswiki: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028616 (https://phabricator.wikimedia.org/T360851) [13:12:09] DreamRimmer: do you know how to test patches on mwdebug? [13:12:25] yes [13:12:38] alright:) [13:12:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1191.eqiad.wmnet [13:12:51] could you please test your two patches then [13:13:01] zabe: marked it as active now [13:13:10] doing [13:13:12] (03PS1) 10Bartosz Dziewoński: Update wgCdnMaxAge value and documentation to match Varnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 [13:13:54] (03CR) 10Effie Mouzeli: [C:03+2] admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [13:14:09] both working [13:14:42] cool [13:14:43] syncing [13:14:45] !log zabe@deploy1002 zabe and dreamrimmer: Continuing with sync [13:15:34] (03PS1) 10Muehlenhoff: Switch public Druid nodes to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1029188 [13:15:48] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1011.eqiad.wmnet [13:16:00] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1202.eqiad.wmnet [13:16:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029188 (owner: 10Muehlenhoff) [13:16:47] (03Merged) 10jenkins-bot: admin_ng/helmfile_istio-gateway: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025296 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli) [13:16:48] (03PS3) 10Muehlenhoff: Configure analytics Druid nodes to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1029184 [13:17:34] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1012.eqiad.wmnet [13:17:39] (03PS1) 10Muehlenhoff: Switch db1202 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029189 (https://phabricator.wikimedia.org/T349619) [13:17:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029184 (owner: 10Muehlenhoff) [13:18:26] (03CR) 10Zabe: [C:04-1] pswiki: update wordmark (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1028616 (https://phabricator.wikimedia.org/T360851) (owner: 10Anzx) [13:18:45] (03PS2) 10Bartosz Dziewoński: Update wgCdnMaxAge value and documentation to match Varnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 [13:18:49] (03CR) 10Muehlenhoff: [C:03+2] Switch db1202 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029189 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:19:05] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:21:14] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:22:34] zabe: i will check again and schedule https://gerrit.wikimedia.org/r/1028616 patch for later [13:22:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P62081 and previous config saved to /var/cache/conftool/dbconfig/20240508-132235-marostegui.json [13:23:07] alright:) [13:23:10] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1012.eqiad.wmnet [13:23:15] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1062 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:23:25] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2030 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:25:08] (03PS1) 10Fabfur: fifo-log-demux: removed unused resources [puppet] - 10https://gerrit.wikimedia.org/r/1029191 (https://phabricator.wikimedia.org/T355905) [13:25:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1202.eqiad.wmnet [13:26:18] (03CR) 10CDanis: [C:03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [13:26:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1227.eqiad.wmnet [13:27:10] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:27:29] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1013.eqiad.wmnet [13:27:30] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:27:43] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:27:53] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:28:02] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:28:21] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1026604|Add tm: as alias to template: on English Wikipedia (T363757)]], [[gerrit:1019390|[ruwiki] Limit the use of the ContentTranslation tool (T362440)]] (duration: 21m 36s) [13:28:27] T363757: Add tm: as alias to template: on English Wikipedia - https://phabricator.wikimedia.org/T363757 [13:28:27] T362440: Set wgContentTranslationPublishRequirements for Russian Wikipedia - https://phabricator.wikimedia.org/T362440 [13:29:33] (03PS1) 10Muehlenhoff: Switch db1227 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029192 (https://phabricator.wikimedia.org/T349619) [13:29:37] (03PS6) 10Dreamrimmer: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) [13:29:39] (03CR) 10Zabe: [C:03+2] Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) (owner: 10Dreamrimmer) [13:30:33] (03Merged) 10jenkins-bot: Enable 'flood' user group at en.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019822 (https://phabricator.wikimedia.org/T351250) (owner: 10Dreamrimmer) [13:31:50] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1013.eqiad.wmnet [13:32:33] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780842 (10cmooney) >>! In T364464#9780633, @Papaul wrote: > @cmooney I think this is just a human error issue. We were racking all the lsw1-d* yesterday and maybe we accidenta... [13:32:47] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1014.eqiad.wmnet [13:33:16] (03CR) 10Muehlenhoff: [C:03+2] Switch db1227 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029192 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:33:28] (03PS7) 10Dreamrimmer: Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) [13:33:30] (03CR) 10Zabe: [C:03+2] Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [13:34:16] (03Merged) 10jenkins-bot: Remove wmgCollectionArticleNamespaces config for enWS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1019423 (https://phabricator.wikimedia.org/T361422) (owner: 10Dreamrimmer) [13:35:13] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1019822|Enable 'flood' user group at en.wikiquote (T351250)]], [[gerrit:1019423|Remove wmgCollectionArticleNamespaces config for enWS (T361422)]] [13:35:18] T351250: Enable 'flood' user group at en.wikiquote - https://phabricator.wikimedia.org/T351250 [13:35:18] T361422: Remove wmgCollectionArticleNamespaces config for enWS - https://phabricator.wikimedia.org/T361422 [13:35:47] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1014.eqiad.wmnet [13:37:18] (03PS4) 10Muehlenhoff: Configure analytics Druid nodes to use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1029184 [13:37:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P62082 and previous config saved to /var/cache/conftool/dbconfig/20240508-133742-marostegui.json [13:37:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1227.eqiad.wmnet [13:37:54] !log zabe@deploy1002 zabe and dreamrimmer: Backport for [[gerrit:1019822|Enable 'flood' user group at en.wikiquote (T351250)]], [[gerrit:1019423|Remove wmgCollectionArticleNamespaces config for enWS (T361422)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:38:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1236.eqiad.wmnet [13:38:17] DreamRimmer: can you test?:) [13:38:25] doing [13:38:32] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:39:05] (03PS1) 10Muehlenhoff: Switch db1236 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029195 (https://phabricator.wikimedia.org/T349619) [13:39:08] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:39:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029184 (owner: 10Muehlenhoff) [13:40:06] looks good [13:40:09] Pseudobots? [13:40:29] (03CR) 10Jforrester: [C:03+1] "Great idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [13:41:31] yeah that comes from https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaMessages/+/31330d5871d013f4e461deea99236c46de980555/i18n/wikimedia/en.json#82 [13:41:33] !log uploaded tcp-mss-clamper 0.5 (bullseye|bookworm)-wikimedia (apt.wm.o) [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:43] (03CR) 10Muehlenhoff: [C:03+2] Switch db1236 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1029195 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:41:43] !log zabe@deploy1002 zabe and dreamrimmer: Continuing with sync [13:42:19] working... [13:42:38] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9780865 (10andrea.denisse) >>! In T360414#9780145, @elukey wrote: >>>! In T360414#9779961, @fgiunchedi wrote: >> Also cc {T356412} and @e... [13:43:31] !log update to tcp-mss-clamper 0.5 on ncredir6001 [13:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:36] (03PS3) 10Gergő Tisza: Update wgCdnMaxAge value and documentation to match Varnish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [13:45:09] (03CR) 10Gergő Tisza: "Changed to permalinks in case someone uses this to find the relevant configuration entries in the future." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [13:45:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1236.eqiad.wmnet [13:45:38] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [13:46:29] PROBLEM - Check whether ferm is active by checking the default input chain on parse2015 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:46:41] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.07 ms [13:47:02] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [13:47:23] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.99 ms [13:47:35] PROBLEM - Check whether ferm is active by checking the default input chain on parse2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:49:05] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:51:44] Maybe we will need to ask en.wikiquote to update their messages [13:52:11] !log installing Java 17 security updates [13:52:12] sure, they can do that [13:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T361627)', diff saved to https://phabricator.wikimedia.org/P62083 and previous config saved to /var/cache/conftool/dbconfig/20240508-135250-marostegui.json [13:52:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1214.eqiad.wmnet with reason: Maintenance [13:52:55] it's their decision how the group should be named, but until then it will have the default name defined in wikimediamessages [13:52:55] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [13:53:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1214.eqiad.wmnet with reason: Maintenance [13:53:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1062 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:53:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T361627)', diff saved to https://phabricator.wikimedia.org/P62084 and previous config saved to /var/cache/conftool/dbconfig/20240508-135314-marostegui.json [13:53:25] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2030 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:54:06] (03CR) 10MVernon: [C:04-1] "> I can take care of the deployment-prep upgrade for T356412, I'll report when done :)" [puppet] - 10https://gerrit.wikimedia.org/r/1029128 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [13:54:08] (03PS1) 10Andrew Bogott: puppet-git-sync-upstream: run as 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1029198 (https://phabricator.wikimedia.org/T364047) [13:54:35] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1019822|Enable 'flood' user group at en.wikiquote (T351250)]], [[gerrit:1019423|Remove wmgCollectionArticleNamespaces config for enWS (T361422)]] (duration: 19m 22s) [13:54:44] T351250: Enable 'flood' user group at en.wikiquote - https://phabricator.wikimedia.org/T351250 [13:54:45] T361422: Remove wmgCollectionArticleNamespaces config for enWS - https://phabricator.wikimedia.org/T361422 [13:54:53] (03PS1) 10Muehlenhoff: failover idp to idp1003 [dns] - 10https://gerrit.wikimedia.org/r/1029199 [13:55:05] (03CR) 10Majavah: [C:04-1] "The script needs root to write to the Prometheus node-exporter directory, and switches to `gitpuppet` when needed." [puppet] - 10https://gerrit.wikimedia.org/r/1029198 (https://phabricator.wikimedia.org/T364047) (owner: 10Andrew Bogott) [13:55:25] DreamRimmer: should be all live [13:55:35] (03CR) 10Vgutierrez: "let's make the prometheus configuration optional so we can merge this and enable it on the hosts that we will use for testing fifo-log-dem" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [13:55:35] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:55:52] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9780908 (10MoritzMuehlenhoff) [13:55:55] (03PS13) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) [13:56:42] (03CR) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [13:57:26] zabe: yep [13:57:30] (03CR) 10Elukey: "The code looks good to me, what I am wondering is if all clients have the right ca-bundle (namely containing the root PKI CA cert). For ex" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [13:57:33] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:57:45] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:58:06] thank you :) [13:58:36] yw [13:58:57] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:59:00] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:59:24] (03PS1) 10Btullis: Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) [13:59:33] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:59:34] (03PS1) 10Btullis: Add WMF annotations to the imported ceph-csi-rbd plugin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) [14:00:04] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 8 CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1400) [14:00:42] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1028546/2347/" [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [14:00:45] (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1029199 (owner: 10Muehlenhoff) [14:00:55] (03CR) 10Andrea Denisse: [V:03+1] thanos: Provision Thanos frontend TLS certificates with CFSSL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [14:01:01] (03CR) 10Muehlenhoff: [C:03+2] failover idp to idp1003 [dns] - 10https://gerrit.wikimedia.org/r/1029199 (owner: 10Muehlenhoff) [14:01:54] (03PS10) 10Vgutierrez: fifo_log_demux: add new parameters for 0.7.3 release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:02:03] RECOVERY - Host ps1-d2-codfw is UP: PING WARNING - Packet loss = 90%, RTA = 31.03 ms [14:02:14] (03CR) 10CI reject: [V:04-1] fifo_log_demux: add new parameters for 0.7.3 release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:03:46] (03PS1) 10Zabe: Move wgGroupsAddToSelf and wgGroupsRemoveFromSelf to core-Permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029200 [14:03:50] (03PS11) 10Vgutierrez: fifo_log_demux: add new parameters for 0.7.3 release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:04:41] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:05:15] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.74 ms [14:05:31] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [14:07:29] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:08:52] (03PS12) 10Vgutierrez: fifo_log_demux: add new parameters for 0.7.3 release [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:08:54] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:09:29] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:10:23] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:10:32] (03PS1) 10Zabe: beta: Test encrypted Argon2 password hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029201 [14:11:50] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Comms to msw-d2-codfw down - https://phabricator.wikimedia.org/T364464#9780928 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm port 47 on the maw was going up and down on it's own. replaced the rj-45 terminator. remained steady. [14:12:26] (03CR) 10Fabfur: fifo_log_demux: add new parameters for 0.7.3 release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:13:03] (03PS4) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [14:13:56] !log jiji@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:14:09] 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364439#9780938 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm uplink for msw2 was degraded and flapping. repaired. staying up now. [14:14:10] !log jiji@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:15:13] !log jiji@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:15:27] !log jiji@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:16:29] RECOVERY - Check whether ferm is active by checking the default input chain on parse2015 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:17:00] !log installing libgd2 security updates [14:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:35] RECOVERY - Check whether ferm is active by checking the default input chain on parse2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:17:55] !log jiji@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:18:45] !log jiji@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:20:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T361627)', diff saved to https://phabricator.wikimedia.org/P62086 and previous config saved to /var/cache/conftool/dbconfig/20240508-142045-marostegui.json [14:20:49] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [14:21:37] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:49] !log jiji@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:23:38] !log jiji@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:25:11] (03PS13) 10Vgutierrez: fifo_log_demux,ATS: Support fifo-log-demux 0.7.3 [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:25:35] (03PS1) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 [14:26:01] (03CR) 10Vgutierrez: fifo_log_demux,ATS: Support fifo-log-demux 0.7.3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:26:13] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add node20 production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1026873 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff) [14:26:48] (03CR) 10Fabfur: "Ok for me but I'd like to leave the -2 to be sure no-one merges that" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:26:48] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:27:40] (03PS1) 10Eevans: Reimage aqs1013 w/o preserving data [puppet] - 10https://gerrit.wikimedia.org/r/1029206 (https://phabricator.wikimedia.org/T364422) [14:28:21] (03CR) 10Vgutierrez: "cp4051 change added for testing purpose with pcc, it will be dropped before merging" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:28:33] (03CR) 10Vgutierrez: [V:03+1] fifo_log_demux,ATS: Support fifo-log-demux 0.7.3 [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:29:07] (03CR) 10Zabe: [C:03+2] beta: Test encrypted Argon2 password hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029201 (owner: 10Zabe) [14:29:52] (03Merged) 10jenkins-bot: beta: Test encrypted Argon2 password hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029201 (owner: 10Zabe) [14:32:38] (03PS2) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 [14:35:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P62087 and previous config saved to /var/cache/conftool/dbconfig/20240508-143552-marostegui.json [14:35:56] (03CR) 10Gergő Tisza: Use encrypted Argon2 Hashes to store user passwords (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [14:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:34] (03CR) 10Fabfur: [C:03+1] "thanks for the new PS, ok for me!" [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:38:11] !log installing Java 11 security updates [14:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:49] (03PS3) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) [14:39:26] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9781008 (10Jclark-ctr) Replaced Backplane : cable that connects raid card<-> backplane / power control board. I did find a cable with a loose pin on the power control board (not replaced) but will be reaching out to Dell r... [14:40:21] (03PS14) 10Vgutierrez: fifo_log_demux,ATS: Support fifo-log-demux 0.7.3 [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:40:32] (03PS4) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) [14:41:06] (03PS15) 10Vgutierrez: fifo_log_demux,ATS: Support fifo-log-demux 0.7.3 [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:41:07] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9781016 (10Marostegui) You believe it is all good for us now to start getting this host back to production or you still want to test something else? [14:43:18] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9781018 (10Jclark-ctr) I am powering it up now and will check idrac. [14:43:42] (03PS1) 10Volans: Drop Python support for 3.7, 3.8, add 3.11 [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 [14:43:42] (03PS1) 10Volans: Use importlib.metadata instead of pkg_resources [software/cumin] - 10https://gerrit.wikimedia.org/r/1029210 [14:44:18] (03CR) 10Fabfur: [C:03+2] fifo_log_demux,ATS: Support fifo-log-demux 0.7.3 [puppet] - 10https://gerrit.wikimedia.org/r/1028807 (https://phabricator.wikimedia.org/T364383) (owner: 10Fabfur) [14:44:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:44:44] !log installing Java 8 security updates [14:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:45:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T352010)', diff saved to https://phabricator.wikimedia.org/P62088 and previous config saved to /var/cache/conftool/dbconfig/20240508-144501-ladsgroup.json [14:45:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:46:24] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9781027 (10Marostegui) Excellent thank you! [14:46:30] (03CR) 10CI reject: [V:04-1] Drop Python support for 3.7, 3.8, add 3.11 [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [14:46:41] (03CR) 10CI reject: [V:04-1] Use importlib.metadata instead of pkg_resources [software/cumin] - 10https://gerrit.wikimedia.org/r/1029210 (owner: 10Volans) [14:51:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P62089 and previous config saved to /var/cache/conftool/dbconfig/20240508-145100-marostegui.json [14:51:46] (03CR) 10Gergő Tisza: Use encrypted Argon2 Hashes to store user passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [14:56:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9781043 (10Jhancock.wm) [14:56:39] (03CR) 10Xcollazo: [C:03+1] "LGTM from dumps point of view, as they are inactive, and current data would still be available." [puppet] - 10https://gerrit.wikimedia.org/r/1029176 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [14:58:20] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:58:32] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:59:07] (03PS1) 10Vgutierrez: hiera: Set prometheus port on fifo-log-demux@cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/1029211 (https://phabricator.wikimedia.org/T364383) [14:59:51] (03CR) 10Scott French: [C:03+2] api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:00:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:31] (03CR) 10Fabfur: [C:03+1] hiera: Set prometheus port on fifo-log-demux@cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/1029211 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [15:00:47] (03PS1) 10Addshore: Ignore mediawiki/tools/cli for gerrit replication [puppet] - 10https://gerrit.wikimedia.org/r/1029212 (https://phabricator.wikimedia.org/T333029) [15:00:50] (03Merged) 10jenkins-bot: api-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028605 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:01:22] (03PS1) 10Vgutierrez: hiera: Set prometheus port on fifo-log-demux@cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1029213 (https://phabricator.wikimedia.org/T364383) [15:02:26] (03CR) 10Fabfur: [C:03+1] hiera: Set prometheus port on fifo-log-demux@cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1029213 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [15:02:50] 06SRE, 06Infrastructure-Foundations, 10netops: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480 (10cmooney) 03NEW p:05Triage→03Medium [15:02:51] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9781087 (10Jclark-ctr) I believe we are good to reimage server OS looks corrupt. if you could just wait till tomorrow to put back in production while i wait for Dell to respond if they will send out new cable. [15:05:09] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:05:10] (03CR) 10Elukey: [C:03+1] Reimage aqs1013 w/o preserving data [puppet] - 10https://gerrit.wikimedia.org/r/1029206 (https://phabricator.wikimedia.org/T364422) (owner: 10Eevans) [15:05:22] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:06:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T361627)', diff saved to https://phabricator.wikimedia.org/P62090 and previous config saved to /var/cache/conftool/dbconfig/20240508-150611-marostegui.json [15:06:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:06:17] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:06:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:06:51] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:07:05] (03CR) 10Vgutierrez: [C:03+2] hiera: Set prometheus port on fifo-log-demux@cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/1029211 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [15:07:25] (03PS1) 10Santiago Faci: Bumping mpic version for a new release v0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029214 [15:08:04] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:08:13] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:09:34] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:12:17] (03PS5) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [15:12:47] (03CR) 10Zabe: Use encrypted Argon2 Hashes to store user passwords (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [15:12:56] (03Abandoned) 10Cathal Mooney: Add VM BGP for esams/drmrs/magru back to YAML for now [homer/public] - 10https://gerrit.wikimedia.org/r/1026956 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:13:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:52] (03PS1) 10Zabe: beta: Switch back to pbkdf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029215 [15:15:06] (03CR) 10Zabe: [C:03+2] beta: Switch back to pbkdf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029215 (owner: 10Zabe) [15:15:52] (03Merged) 10jenkins-bot: beta: Switch back to pbkdf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029215 (owner: 10Zabe) [15:16:38] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:17:10] (03CR) 10Effie Mouzeli: [C:03+1] "I would also suggest setting Compress=no, since in the past this has caused problems. I attempted to sort this mess out in I079d4159f1082e" [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [15:17:29] (03PS2) 10Btullis: Drop the deprecated dumps fetcher that pulls from stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/1029176 (https://phabricator.wikimedia.org/T353785) [15:17:39] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:20:16] (03CR) 10Ottomata: EventStreamConfig: Add webrequest.frontend.v1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [15:20:29] (03CR) 10Zabe: "The reason I would like to do this now, is because I want to do T112359 and it would make sense for this to happen first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [15:21:13] (03CR) 10Pppery: "It looks like I failed to commit the changes to .phutil_module_cache when I rebuilt the library map after deleting the files added by http" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [15:21:22] !log bump apt package gitlab-ce to 16.9.7-ce.0 [15:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:35] (03CR) 10Ottomata: EventStreamConfig: Add webrequest.frontend.v1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026506 (https://phabricator.wikimedia.org/T314956) (owner: 10Gmodena) [15:21:41] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:22:03] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:22:23] (03CR) 10Lucas Werkmeister (WMDE): "The deployment calendar for next week doesn’t exist yet, but I think we can deploy this on Monday or so. (I missed the backport window tod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228) (owner: 10Lucas Werkmeister (WMDE)) [15:23:36] (03CR) 10Btullis: Drop the deprecated dumps fetcher that pulls from stat1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029176 (https://phabricator.wikimedia.org/T353785) (owner: 10Btullis) [15:24:01] (03PS1) 10Vgutierrez: Revert "fifo_log_demux,ATS: Support fifo-log-demux 0.7.3" [puppet] - 10https://gerrit.wikimedia.org/r/1028961 [15:25:38] (03PS1) 10Vgutierrez: Revert "hiera: Set prometheus port on fifo-log-demux@cp4052" [puppet] - 10https://gerrit.wikimedia.org/r/1028962 [15:25:48] (03PS1) 10Elukey: Release new version [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1029218 (https://phabricator.wikimedia.org/T362984) [15:26:02] (03CR) 10Clare Ming: [C:03+2] Bumping mpic version for a new release v0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029214 (owner: 10Santiago Faci) [15:26:43] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Set prometheus port on fifo-log-demux@cp4052" [puppet] - 10https://gerrit.wikimedia.org/r/1028962 (owner: 10Vgutierrez) [15:26:57] (03Merged) 10jenkins-bot: Bumping mpic version for a new release v0.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029214 (owner: 10Santiago Faci) [15:26:59] (03CR) 10Elukey: "The new pristine/upstream tarballs/code have already been imported via gpb and pushed to the related branches. I have built the package on" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1029218 (https://phabricator.wikimedia.org/T362984) (owner: 10Elukey) [15:28:09] (03CR) 10Vgutierrez: [C:03+2] Revert "fifo_log_demux,ATS: Support fifo-log-demux 0.7.3" [puppet] - 10https://gerrit.wikimedia.org/r/1028961 (owner: 10Vgutierrez) [15:28:20] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:28:36] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:32:51] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9781202 (10Marostegui) Absolutely - just close this task once your part is done and we will take it from there [15:35:23] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:35:50] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:37:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1226.eqiad.wmnet with reason: Maintenance [15:37:24] (03PS6) 10Pppery: Phabricator: Delete chatlog group [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1022053 (https://phabricator.wikimedia.org/T318763) [15:37:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1226.eqiad.wmnet with reason: Maintenance [15:37:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T361627)', diff saved to https://phabricator.wikimedia.org/P62091 and previous config saved to /var/cache/conftool/dbconfig/20240508-153738-marostegui.json [15:37:43] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:38:39] !log imported tomcat9 9.0.43-2~deb11u10+wmf12u1 to component/tomcat9 for bookworm-wikimedia (rebasing our forward port to the latest security update) [15:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:55] (03PS1) 10Btullis: Move dumps::generation::worker::dumper_misc_crons_only role [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) [15:42:35] (03CR) 10BCornwall: [C:03+1] "Thanks for remembering that for me...." [puppet] - 10https://gerrit.wikimedia.org/r/1029191 (https://phabricator.wikimedia.org/T355905) (owner: 10Fabfur) [15:43:32] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2351/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [15:45:53] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2352/co" [puppet] - 10https://gerrit.wikimedia.org/r/1029191 (https://phabricator.wikimedia.org/T355905) (owner: 10Fabfur) [15:50:35] !log tested fifo-log-demux 0.7.3 on cp4052, downgraded to 0.6.5 [15:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:39] (03PS2) 10Klausman: ml-services: further tune autoscaling for editquality-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029222 (https://phabricator.wikimedia.org/T363336) [15:52:47] (03CR) 10Jforrester: [C:03+1] "Happy to sling it out tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027194 (https://phabricator.wikimedia.org/T364228) (owner: 10Lucas Werkmeister (WMDE)) [15:54:03] (03CR) 10Elukey: [C:03+1] ml-services: further tune autoscaling for editquality-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029222 (https://phabricator.wikimedia.org/T363336) (owner: 10Klausman) [15:54:16] (03CR) 10Klausman: [C:03+2] ml-services: further tune autoscaling for editquality-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029222 (https://phabricator.wikimedia.org/T363336) (owner: 10Klausman) [15:55:11] (03Merged) 10jenkins-bot: ml-services: further tune autoscaling for editquality-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029222 (https://phabricator.wikimedia.org/T363336) (owner: 10Klausman) [15:56:25] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:57:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T361627)', diff saved to https://phabricator.wikimedia.org/P62092 and previous config saved to /var/cache/conftool/dbconfig/20240508-155757-marostegui.json [15:58:01] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [15:58:16] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:58:58] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [15:59:59] (03PS1) 10Cathal Mooney: Increase timeout for Netbox Capirca script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1029226 [16:01:04] (03CR) 10JMeybohm: [C:03+1] "LGTM, but benthos also requires an update for https://phabricator.wikimedia.org/T359423 - might be worth it to bundle them" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [16:02:55] (03CR) 10JMeybohm: [C:03+1] benthos: add securityContext to all containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028910 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [16:03:09] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [16:03:56] (03CR) 10JMeybohm: [C:03+1] blubberoid: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028911 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [16:05:18] (03CR) 10Btullis: [V:03+1] "The list of misc dumps that are included is:" [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [16:06:16] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:13:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P62093 and previous config saved to /var/cache/conftool/dbconfig/20240508-161305-marostegui.json [16:21:13] !log Deploying refinery [16:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:32] (03CR) 10Xcollazo: [C:03+1] "LGTM, however, let's wait till `snapshot1008` is idle." [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [16:21:46] !log sfaci@deploy1002 Started deploy [analytics/refinery@1c45ef4]: Regular analytics weekly train [analytics/refinery@1c45ef4d] [16:22:24] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9781322 (10BTullis) I have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/102922... [16:23:07] (03CR) 10Btullis: [V:03+1] "Great, thanks. I will check back in about a week." [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [16:24:21] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:25:11] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:25:13] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:25:40] (03Abandoned) 10Andrea Denisse: thanos: Provision Thanos frontend TLS certificates with CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1028546 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:28:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P62094 and previous config saved to /var/cache/conftool/dbconfig/20240508-162812-marostegui.json [16:28:18] (03PS2) 10Andrea Denisse: thanos: Update TLS certificate in Envoy config to match CFSSL provisioning [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) [16:29:21] (03PS1) 10Cathal Mooney: Support VM BGP automation using Netbox flag for L3 POPs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) [16:29:52] (03PS5) 10JMeybohm: Add CertProvider to hot reload TLS certs for gRPC service [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) [16:31:52] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 16): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2353/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:33:24] (03CR) 10Andrea Denisse: [V:03+1] "PCC results show a NOOP, but I think they should generate the cert again with the new name. Please let me know what you think." [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:33:43] (03CR) 10Andrea Denisse: [V:03+1] thanos: Update TLS certificate in Envoy config to match CFSSL provisioning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1028876 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:38:06] (03PS1) 10Santiago Faci: Bumping mpic version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029234 [16:38:24] !log sfaci@deploy1002 Finished deploy [analytics/refinery@1c45ef4]: Regular analytics weekly train [analytics/refinery@1c45ef4d] (duration: 16m 37s) [16:38:47] !log sfaci@deploy1002 Started deploy [analytics/refinery@1c45ef4] (thin): Regular analytics weekly train THIN [analytics/refinery@1c45ef4d] [16:42:28] (03CR) 10Santiago Faci: [C:03+2] Bumping mpic version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029234 (owner: 10Santiago Faci) [16:42:40] !log sfaci@deploy1002 Finished deploy [analytics/refinery@1c45ef4] (thin): Regular analytics weekly train THIN [analytics/refinery@1c45ef4d] (duration: 03m 53s) [16:43:03] !log sfaci@deploy1002 Started deploy [analytics/refinery@1c45ef4] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1c45ef4d] [16:43:21] (03Merged) 10jenkins-bot: Bumping mpic version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029234 (owner: 10Santiago Faci) [16:43:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T361627)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240508-164322-marostegui.json [16:43:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [16:43:35] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [16:43:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [16:44:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:45:39] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:45:54] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:45:55] !log sfaci@deploy1002 Finished deploy [analytics/refinery@1c45ef4] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@1c45ef4d] (duration: 02m 52s) [16:46:41] (03PS3) 10Ahmon Dancy: coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 [16:46:41] (03PS1) 10Ahmon Dancy: coredump.conf: Disable compression [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) [16:47:37] (03CR) 10Ahmon Dancy: "Done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029235 to keep this change focused." [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [16:49:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:08] (03CR) 10CI reject: [V:04-1] coredump.conf: Disable compression [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [16:50:10] (03CR) 10CI reject: [V:04-1] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [16:51:02] gah [16:52:04] (03PS4) 10Ahmon Dancy: coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 [16:52:04] (03PS2) 10Ahmon Dancy: coredump.conf: Disable compression [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) [16:53:46] (03CR) 10Scott French: [C:03+2] confd: confd-lint-wrap ignores positional args separator [puppet] - 10https://gerrit.wikimedia.org/r/1028897 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [16:55:29] (03CR) 10CI reject: [V:04-1] coredump.conf: Remove misconfigured KeepFree setting [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [16:55:33] (03CR) 10CI reject: [V:04-1] coredump.conf: Disable compression [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [16:55:49] 😢 [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1700) [17:03:06] (03CR) 10Ahmon Dancy: "query_service::deploy::manual is still referenced by modules/query_service/manifests/common.pp and modules/query_service/spec/classes/quer" [puppet] - 10https://gerrit.wikimedia.org/r/1028763 (https://phabricator.wikimedia.org/T316876) (owner: 10Muehlenhoff) [17:03:33] !log sfaci@deploy1002 Started deploy [airflow-dags/analytics@1f72038]: (no justification provided) [17:04:02] !log sfaci@deploy1002 Finished deploy [airflow-dags/analytics@1f72038]: (no justification provided) (duration: 00m 29s) [17:09:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1209.eqiad.wmnet with reason: Maintenance [17:09:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1209.eqiad.wmnet with reason: Maintenance [17:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:12:32] (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1029226 (owner: 10Cathal Mooney) [17:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:16:57] (03PS1) 10Ladsgroup: FlaggedRevsStats: Fix migration to query builder [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028964 [17:24:34] (03CR) 10CI reject: [V:04-1] FlaggedRevsStats: Fix migration to query builder [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028964 (owner: 10Ladsgroup) [17:27:56] (03PS1) 10Dreamrimmer: quwiki: Set MetaNamespaceName to Wikipidiya [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029237 (https://phabricator.wikimedia.org/T355129) [17:31:45] (03CR) 10Ladsgroup: "recheck" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028964 (owner: 10Ladsgroup) [17:33:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [17:33:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [17:33:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T361627)', diff saved to https://phabricator.wikimedia.org/P62095 and previous config saved to /var/cache/conftool/dbconfig/20240508-173353-marostegui.json [17:33:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:38:46] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: MD Raid monitoring: add the failed disk physical location to the auto-generated task - https://phabricator.wikimedia.org/T364496 (10Volans) 03NEW p:05Triage→03Medium [17:40:15] (03PS2) 10Cathal Mooney: Support VM BGP automation using Netbox flag for L3 POPs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) [17:42:59] (03CR) 10Volans: Support VM BGP automation using Netbox flag for L3 POPs (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) (owner: 10Cathal Mooney) [17:44:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T361627)', diff saved to https://phabricator.wikimedia.org/P62096 and previous config saved to /var/cache/conftool/dbconfig/20240508-174428-marostegui.json [17:44:32] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [17:59:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P62097 and previous config saved to /var/cache/conftool/dbconfig/20240508-175936-marostegui.json [18:00:05] jeena and jnuche: That opportune time for a MediaWiki train - Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1800). [18:06:55] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029239 (https://phabricator.wikimedia.org/T361398) [18:06:57] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029239 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [18:08:10] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029239 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [18:14:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P62098 and previous config saved to /var/cache/conftool/dbconfig/20240508-181443-marostegui.json [18:21:37] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:47] (03PS1) 10Ladsgroup: Revert "logos: Add fawiki logo for 1,000,000 article" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029240 [18:23:59] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.4 refs T361398 [18:24:04] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [18:26:43] jouncebot: nowandnext [18:26:43] For the next 1 hour(s) and 33 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T1800) [18:26:43] In 1 hour(s) and 33 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T2000) [18:27:23] jeena: I have a couple of patches. e.g. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1028964 [18:27:44] Hi Amir1 [18:27:51] I still need to roll to group1 [18:27:55] sure [18:27:57] let me know [18:28:01] okay [18:29:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T361627)', diff saved to https://phabricator.wikimedia.org/P62099 and previous config saved to /var/cache/conftool/dbconfig/20240508-182951-marostegui.json [18:29:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:29:55] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:30:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:30:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T361627)', diff saved to https://phabricator.wikimedia.org/P62100 and previous config saved to /var/cache/conftool/dbconfig/20240508-183014-marostegui.json [18:32:22] There is an error, I'm not sure if it should block the train https://phabricator.wikimedia.org/T364499 [18:32:27] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1029212 (https://phabricator.wikimedia.org/T333029) (owner: 10Addshore) [18:33:06] But the error count doesn't seem so high so I think I'll go forward [18:33:22] (03CR) 10Scott French: [C:03+2] confd: insert positional argument separator in check_cmd [puppet] - 10https://gerrit.wikimedia.org/r/1028898 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [18:33:47] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029242 (https://phabricator.wikimedia.org/T361398) [18:33:57] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029242 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [18:34:42] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029242 (https://phabricator.wikimedia.org/T361398) (owner: 10TrainBranchBot) [18:41:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T361627)', diff saved to https://phabricator.wikimedia.org/P62101 and previous config saved to /var/cache/conftool/dbconfig/20240508-184152-marostegui.json [18:41:56] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [18:43:51] PROBLEM - Check whether ferm is active by checking the default input chain on parse1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:45:29] (03PS3) 10Cathal Mooney: Support VM BGP automation using Netbox flag for L3 POPs [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) [18:46:05] (03PS5) 10Scott French: confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) [18:46:05] (03PS5) 10Scott French: confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) [18:46:10] (03CR) 10Cathal Mooney: Support VM BGP automation using Netbox flag for L3 POPs (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1029231 (https://phabricator.wikimedia.org/T364480) (owner: 10Cathal Mooney) [18:47:31] (03CR) 10Scott French: "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [18:49:21] (03CR) 10Dzahn: "Admittedly, this patch is an attempt to just not have to deal with setting up volatile on every (local / cloud) puppetmaster. By the way t" [puppet] - 10https://gerrit.wikimedia.org/r/1026193 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [18:49:46] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.4 refs T361398 [18:49:52] T361398: 1.43.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T361398 [18:50:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T352010)', diff saved to https://phabricator.wikimedia.org/P62102 and previous config saved to /var/cache/conftool/dbconfig/20240508-185038-ladsgroup.json [18:50:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:53:32] (03CR) 10Gergő Tisza: "Looks good to me but 1) I think someone from Security should say they are OK with it 2) it should probably go to Beta first, as I don't th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [18:54:59] Amir1: I am finished now [18:55:28] (03CR) 10Zabe: "Fair." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [18:55:45] awesome [18:57:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P62103 and previous config saved to /var/cache/conftool/dbconfig/20240508-185700-marostegui.json [18:57:50] (03CR) 10Ladsgroup: [C:03+2] Revert "logos: Add fawiki logo for 1,000,000 article" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029240 (owner: 10Ladsgroup) [18:58:00] (03CR) 10Bartosz Dziewoński: [C:03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029187 (owner: 10Bartosz Dziewoński) [18:58:39] (03Merged) 10jenkins-bot: Revert "logos: Add fawiki logo for 1,000,000 article" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029240 (owner: 10Ladsgroup) [18:59:51] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1029240|Revert "logos: Add fawiki logo for 1,000,000 article"]] [19:02:32] (03CR) 10Ladsgroup: [C:03+2] FlaggedRevsStats: Fix migration to query builder [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028964 (owner: 10Ladsgroup) [19:02:42] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1029240|Revert "logos: Add fawiki logo for 1,000,000 article"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:03:28] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [19:05:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P62104 and previous config saved to /var/cache/conftool/dbconfig/20240508-190546-ladsgroup.json [19:08:07] (03PS1) 10Majavah: cawiki: Restore normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029245 (https://phabricator.wikimedia.org/T363057) [19:09:15] Amir1: ping me when I can deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1029245 please? [19:09:22] sure [19:09:23] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2011 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:10:38] (03Merged) 10jenkins-bot: FlaggedRevsStats: Fix migration to query builder [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1028964 (owner: 10Ladsgroup) [19:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 38.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:12:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P62105 and previous config saved to /var/cache/conftool/dbconfig/20240508-191207-marostegui.json [19:13:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:51] RECOVERY - Check whether ferm is active by checking the default input chain on parse1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:16:10] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1029240|Revert "logos: Add fawiki logo for 1,000,000 article"]] (duration: 16m 18s) [19:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:16:55] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1028964|FlaggedRevsStats: Fix migration to query builder]] [19:17:25] (03CR) 10Scott French: "Nice! A couple of mostly nits or clarifications. Otherwise looks good." [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [19:20:14] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1028964|FlaggedRevsStats: Fix migration to query builder]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:20:32] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [19:20:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P62106 and previous config saved to /var/cache/conftool/dbconfig/20240508-192054-ladsgroup.json [19:20:58] (03PS1) 10BBlack: admin: Restore analytics-product-users access for nshahquinn-wmf, hghani [puppet] - 10https://gerrit.wikimedia.org/r/1029270 (https://phabricator.wikimedia.org/T364359) [19:25:49] PROBLEM - Check whether ferm is active by checking the default input chain on mw1393 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:27:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T361627)', diff saved to https://phabricator.wikimedia.org/P62107 and previous config saved to /var/cache/conftool/dbconfig/20240508-192715-marostegui.json [19:27:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance [19:27:24] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:27:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance [19:27:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T361627)', diff saved to https://phabricator.wikimedia.org/P62108 and previous config saved to /var/cache/conftool/dbconfig/20240508-192743-marostegui.json [19:29:15] PROBLEM - Check whether ferm is active by checking the default input chain on mw2401 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:29:18] (03CR) 10BBlack: [C:03+2] admin: Restore analytics-product-users access for nshahquinn-wmf, hghani [puppet] - 10https://gerrit.wikimedia.org/r/1029270 (https://phabricator.wikimedia.org/T364359) (owner: 10BBlack) [19:33:34] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1028964|FlaggedRevsStats: Fix migration to query builder]] (duration: 16m 39s) [19:33:50] jeena: I'm done! [19:33:53] taavi: ^ [19:34:12] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Restore nshahquinn-wmf and hghani to analytics-product-users - https://phabricator.wikimedia.org/T364359#9781941 (10BBlack) The patch should fix things up, let me know if there's still problems after ~half an hour to let the change propagate through the syst... [19:34:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029245 (https://phabricator.wikimedia.org/T363057) (owner: 10Majavah) [19:35:30] (03Merged) 10jenkins-bot: cawiki: Restore normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029245 (https://phabricator.wikimedia.org/T363057) (owner: 10Majavah) [19:36:01] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1029245|cawiki: Restore normal logo (T363057)]] [19:36:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T352010)', diff saved to https://phabricator.wikimedia.org/P62109 and previous config saved to /var/cache/conftool/dbconfig/20240508-193601-ladsgroup.json [19:36:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [19:36:04] T363057: Changing logos and tagline for the 750k article milestone in the Catalan Wikipedia - https://phabricator.wikimedia.org/T363057 [19:36:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:36:12] (03CR) 10Gergő Tisza: Use encrypted Argon2 Hashes to store user passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [19:36:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [19:36:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P62110 and previous config saved to /var/cache/conftool/dbconfig/20240508-193624-ladsgroup.json [19:38:37] !log taavi@deploy1002 taavi: Backport for [[gerrit:1029245|cawiki: Restore normal logo (T363057)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:38:53] !log taavi@deploy1002 taavi: Continuing with sync [19:39:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T361627)', diff saved to https://phabricator.wikimedia.org/P62111 and previous config saved to /var/cache/conftool/dbconfig/20240508-193920-marostegui.json [19:39:23] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2011 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:39:24] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [19:40:34] (03PS1) 10Ebernhardson: cirrus: Shift remaining public wikis in codfw to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029273 (https://phabricator.wikimedia.org/T363475) [19:40:38] (03PS1) 10Ebernhardson: cirrus: Expand codfw to serve writes to all public wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029274 (https://phabricator.wikimedia.org/T363475) [19:41:27] (03CR) 10CI reject: [V:04-1] cirrus: Shift remaining public wikis in codfw to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029273 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [19:44:21] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1036 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:44:21] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1039 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:44:35] (03PS2) 10Ebernhardson: cirrus: Expand codfw to serve writes to all public wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029274 (https://phabricator.wikimedia.org/T363475) [19:45:17] (03CR) 10Ebernhardson: [C:03+2] cirrus: Expand codfw to serve writes to all public wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029274 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [19:46:14] (03Merged) 10jenkins-bot: cirrus: Expand codfw to serve writes to all public wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029274 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [19:46:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1396 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:48:55] !log ebernhardson@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:49:05] !log ebernhardson@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:51:31] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1029245|cawiki: Restore normal logo (T363057)]] (duration: 15m 29s) [19:51:35] T363057: Changing logos and tagline for the 750k article milestone in the Catalan Wikipedia - https://phabricator.wikimedia.org/T363057 [19:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P62112 and previous config saved to /var/cache/conftool/dbconfig/20240508-195428-marostegui.json [19:55:49] RECOVERY - Check whether ferm is active by checking the default input chain on mw1393 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:59:15] RECOVERY - Check whether ferm is active by checking the default input chain on mw2401 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T2000). [20:00:05] ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:58] (03CR) 10Zabe: Use encrypted Argon2 Hashes to store user passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [20:01:40] \o [20:03:38] (03PS2) 10Ebernhardson: cirrus: Shift remaining public wikis in codfw to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029273 (https://phabricator.wikimedia.org/T363475) [20:03:49] it looks like i'm the only one on the list, i can ship [20:08:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029273 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [20:09:27] (03Merged) 10jenkins-bot: cirrus: Shift remaining public wikis in codfw to replacement updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029273 (https://phabricator.wikimedia.org/T363475) (owner: 10Ebernhardson) [20:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P62113 and previous config saved to /var/cache/conftool/dbconfig/20240508-200935-marostegui.json [20:09:54] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:1029273|cirrus: Shift remaining public wikis in codfw to replacement updater (T363475)]] [20:09:58] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:11:19] Just got a 502 on wikidata, figured I should report: [20:11:20] Request from [ip] via cp1104.eqiad.wmnet, ATS/9.1.4 Error: 502, Broken pipe at 2024-05-08 20:10:33 GMT [20:12:34] !log ebernhardson@deploy1002 ebernhardson: Backport for [[gerrit:1029273|cirrus: Shift remaining public wikis in codfw to replacement updater (T363475)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:12:47] !log ebernhardson@deploy1002 ebernhardson: Continuing with sync [20:14:21] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1036 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:14:21] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1039 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:15:58] (03CR) 10Gergő Tisza: Use encrypted Argon2 Hashes to store user passwords (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [20:16:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1396 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:17:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw1379 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:20:35] PROBLEM - Check whether ferm is active by checking the default input chain on parse2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:21:15] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2039 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:24:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T361627)', diff saved to https://phabricator.wikimedia.org/P62114 and previous config saved to /var/cache/conftool/dbconfig/20240508-202446-marostegui.json [20:24:50] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [20:24:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:25:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:25:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:25:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [20:25:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T361627)', diff saved to https://phabricator.wikimedia.org/P62115 and previous config saved to /var/cache/conftool/dbconfig/20240508-202516-marostegui.json [20:25:54] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:1029273|cirrus: Shift remaining public wikis in codfw to replacement updater (T363475)]] (duration: 16m 00s) [20:25:57] T363475: SUP: Shift Writes from Cirrus to SUP - https://phabricator.wikimedia.org/T363475 [20:26:28] FIRING: [4x] JobUnavailable: Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:17] (03CR) 10Scott French: [C:03+1] "Thanks, Riccardo." [software/conftool] - 10https://gerrit.wikimedia.org/r/995053 (https://phabricator.wikimedia.org/T356423) (owner: 10Volans) [20:34:22] (03CR) 10Scott French: [C:03+2] confd: Extend confd-lint-wrap to accept a unique resource name [puppet] - 10https://gerrit.wikimedia.org/r/1028559 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [20:34:34] (03CR) 10SBassett: "Reedy has been following along with this work, AIUI, and is still a member of the Security Team. But I don't have any problem with the ge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [20:36:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T361627)', diff saved to https://phabricator.wikimedia.org/P62116 and previous config saved to /var/cache/conftool/dbconfig/20240508-203655-marostegui.json [20:37:00] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [20:38:29] (03CR) 10SBassett: "> This patch must be preceded by a patch that defines the secret in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [20:41:15] (03CR) 10Xcollazo: [C:03+1] "Reminder to self: when we merge this let's update https://wikitech.wikimedia.org/wiki/Dumps/Snapshot_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [20:47:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw1379 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:50:35] RECOVERY - Check whether ferm is active by checking the default input chain on parse2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:51:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2039 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:52:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P62117 and previous config saved to /var/cache/conftool/dbconfig/20240508-205203-marostegui.json [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240508T2100) [21:04:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 824.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:07:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P62118 and previous config saved to /var/cache/conftool/dbconfig/20240508-210711-marostegui.json [21:09:03] PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:09:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 844.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:12:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 800.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:14:00] Hm, everything seems nominal? [21:17:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 821.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:22:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T361627)', diff saved to https://phabricator.wikimedia.org/P62119 and previous config saved to /var/cache/conftool/dbconfig/20240508-212219-marostegui.json [21:22:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance [21:22:23] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [21:22:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance [21:22:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T361627)', diff saved to https://phabricator.wikimedia.org/P62121 and previous config saved to /var/cache/conftool/dbconfig/20240508-212242-marostegui.json [21:29:56] (03PS1) 10Ryan Kemper: wdqs: remove refs to query_service::deploy::manual [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) [21:29:58] (03PS1) 10Ryan Kemper: wdqs: remove config for wdqs cloud hosts [puppet] - 10https://gerrit.wikimedia.org/r/1029291 (https://phabricator.wikimedia.org/T316876) [21:34:13] (03CR) 10Bking: [C:03+1] wdqs: remove refs to query_service::deploy::manual [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:34:52] (03PS2) 10Ryan Kemper: wdqs: remove refs to query_service::deploy::manual [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) [21:35:02] (03CR) 10Muehlenhoff: "Actually, profile::query_service::deploy_mode is still set to 'manual' in hieradata/cloud/eqiad1/wikidata-query/common.yaml, with git-fat " [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:36:02] (03CR) 10Muehlenhoff: [C:03+1] "Nvm, I missed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029291" [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:37:31] (03CR) 10Muehlenhoff: [C:03+1] wdqs: remove refs to query_service::deploy::manual (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:40:40] (03PS3) 10Ryan Kemper: wdqs: remove refs to query_service::deploy::manual [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) [21:48:36] (03CR) 10Bking: [C:03+1] wdqs: remove refs to query_service::deploy::manual [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:48:45] (03CR) 10Ryan Kemper: wdqs: remove refs to query_service::deploy::manual (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:49:08] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:49:55] PROBLEM - Router interfaces on cr1-magru is CRITICAL: CRITICAL: host 195.200.68.128, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:55:09] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove refs to query_service::deploy::manual [puppet] - 10https://gerrit.wikimedia.org/r/1029290 (https://phabricator.wikimedia.org/T316876) (owner: 10Ryan Kemper) [21:58:28] (03CR) 10Ryan Kemper: "Going to recheck to see if the extraneous failure is fixed after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029290" [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [21:58:34] (03CR) 10Ryan Kemper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1028565 (owner: 10Ahmon Dancy) [22:11:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T361627)', diff saved to https://phabricator.wikimedia.org/P62122 and previous config saved to /var/cache/conftool/dbconfig/20240508-221105-marostegui.json [22:11:20] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [22:21:37] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:26:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P62123 and previous config saved to /var/cache/conftool/dbconfig/20240508-222613-marostegui.json [22:33:59] (03CR) 10Ahmon Dancy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [22:34:04] (03PS1) 10Dzahn: ci: include envoyproxy in ci_test role [puppet] - 10https://gerrit.wikimedia.org/r/1029295 (https://phabricator.wikimedia.org/T364510) [22:34:24] (03PS2) 10Dzahn: ci: include envoyproxy in ci_test role [puppet] - 10https://gerrit.wikimedia.org/r/1029295 (https://phabricator.wikimedia.org/T364510) [22:35:03] (03PS3) 10Dzahn: ci: include envoyproxy in ci_test role [puppet] - 10https://gerrit.wikimedia.org/r/1029295 (https://phabricator.wikimedia.org/T364510) [22:37:05] (03CR) 10Dzahn: [C:03+1] "lgtm (per Effie @ https://gerrit.wikimedia.org/r/c/operations/puppet/+/545558/8/modules/systemd/files/coredump-enabled.conf )" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [22:39:22] (03CR) 10Dzahn: [C:03+1] "to be fair though, it says the compress ratio can be up to 1:100 with LZ4 and I have no idea if we will run into disk space issues without" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [22:41:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P62124 and previous config saved to /var/cache/conftool/dbconfig/20240508-224120-marostegui.json [22:42:53] (03PS6) 10Scott French: confd: prom exporter uses resource name to find state file [puppet] - 10https://gerrit.wikimedia.org/r/1028560 (https://phabricator.wikimedia.org/T363924) [22:42:53] (03PS1) 10Scott French: confd: clean up confd-lint-wrap after error file fixes [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) [22:46:39] (03PS4) 10Dzahn: ci: include envoyproxy in ci_test role [puppet] - 10https://gerrit.wikimedia.org/r/1029295 (https://phabricator.wikimedia.org/T364510) [22:49:23] (03CR) 10Dzahn: [V:03+1 C:03+2] "only affects the test server, so moving ahead to fix systemd status alert - https://puppet-compiler.wmflabs.org/output/1029295/2356/contin" [puppet] - 10https://gerrit.wikimedia.org/r/1029295 (https://phabricator.wikimedia.org/T364510) (owner: 10Dzahn) [22:53:03] !log contint1003 - systemctl start wmf_auto_restart_envoyproxy T364510 T358237 [22:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:11] T364510: SystemdUnitFailed - contint1003 - envoyproxy - https://phabricator.wikimedia.org/T364510 [22:53:11] T358237: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237 [22:53:38] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9782254 (10Dzahn) added envoy to contint1003 to fix T364510 [22:56:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T361627)', diff saved to https://phabricator.wikimedia.org/P62125 and previous config saved to /var/cache/conftool/dbconfig/20240508-225628-marostegui.json [22:56:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [22:56:32] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [22:56:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [22:56:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T361627)', diff saved to https://phabricator.wikimedia.org/P62126 and previous config saved to /var/cache/conftool/dbconfig/20240508-225652-marostegui.json [23:02:30] (03CR) 10Scott French: "This is follow-up from our discussion on I859e13d7801e19463ef111227b9b9c7f958dc03a. If you have cycles to review, that would be greatly ap" [puppet] - 10https://gerrit.wikimedia.org/r/1029296 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [23:08:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T361627)', diff saved to https://phabricator.wikimedia.org/P62127 and previous config saved to /var/cache/conftool/dbconfig/20240508-230800-marostegui.json [23:08:06] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [23:12:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:13:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:17:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:18:24] 06SRE, 06SRE Observability, 13Patch-For-Review: confd prom exporter cannot distinguish targets with a common base name - https://phabricator.wikimedia.org/T363924#9782278 (10Scott_French) a:03Scott_French Alright, two more patches left: one more to finish the functional part of this, and then a cleanup. A... [23:23:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240508-232308-marostegui.json [23:37:37] 06SRE, 10SRE-Access-Requests, 06Movement-Insights: Restore nshahquinn-wmf and hghani to analytics-product-users - https://phabricator.wikimedia.org/T364359#9782299 (10nshahquinn-wmf) 05Open→03Resolved a:03nshahquinn-wmf Thank you! Everything works again 😊 [23:38:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P62129 and previous config saved to /var/cache/conftool/dbconfig/20240508-233820-marostegui.json [23:38:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028934 [23:38:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1028934 (owner: 10TrainBranchBot) [23:53:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T361627)', diff saved to https://phabricator.wikimedia.org/P62130 and previous config saved to /var/cache/conftool/dbconfig/20240508-235327-marostegui.json [23:53:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance [23:53:33] T361627: Create cuc_agent_id, cule_agent_id and cupe_agent_id columns in cu_changes, cu_log_event and cu_private_event tables respectively on WMF wikis - https://phabricator.wikimedia.org/T361627 [23:53:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance [23:53:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T361627)', diff saved to https://phabricator.wikimedia.org/P62131 and previous config saved to /var/cache/conftool/dbconfig/20240508-235350-marostegui.json