[00:08:21] (03CR) 10Dreamy Jazz: [C: 03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004788 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [00:38:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004693 [00:38:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004693 (owner: 10TrainBranchBot) [01:03:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004693 (owner: 10TrainBranchBot) [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357944#9556514 (10phaultfinder) [01:23:11] PROBLEM - snapshot of s6 in codfw on backupmon1001 is CRITICAL: snapshot for s6 at codfw (db2097) taken more than 3 days ago: Most recent backup 2024-02-17 01:20:00 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:04:21] (03CR) 10Tim Starling: [C: 03+2] Set $wgLoginNotifyUseCheckUser = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004788 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [02:05:08] (03Merged) 10jenkins-bot: Set $wgLoginNotifyUseCheckUser = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004788 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [02:15:12] !log tstarling@deploy2002 Synchronized wmf-config/CommonSettings.php: Set $wgLoginNotifyUseCheckUser = false T346989 (duration: 08m 13s) [02:15:18] T346989: Deploy LoginNotify seen subnets table - https://phabricator.wikimedia.org/T346989 [02:21:53] RECOVERY - ensure kvm processes are running on cloudvirt1032 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:38:36] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:51:25] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T0300) [03:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.19 [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1004694 (https://phabricator.wikimedia.org/T354437) [03:08:12] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.19 [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1004694 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [03:13:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:27:59] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.19 [core] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1004694 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T0400) [04:02:00] !log mwpresync@deploy2002 Pruned MediaWiki: 1.42.0-wmf.16 (duration: 01m 57s) [04:03:18] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004803 (https://phabricator.wikimedia.org/T354437) [04:03:20] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004803 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [04:04:03] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004803 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [04:04:33] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.19 refs T354437 [04:04:38] T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437 [04:13:13] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:18:39] PROBLEM - snapshot of s2 in codfw on backupmon1001 is CRITICAL: snapshot for s2 at codfw (db2097) taken more than 3 days ago: Most recent backup 2024-02-17 04:06:20 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:56:42] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.19 refs T354437 (duration: 52m 09s) [04:56:47] T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437 [05:08:51] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:39:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1170 for reimage', diff saved to https://phabricator.wikimedia.org/P57191 and previous config saved to /var/cache/conftool/dbconfig/20240220-053920-marostegui.json [05:40:19] (03PS1) 10Marostegui: db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004809 [05:41:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1170.eqiad.wmnet with OS bookworm [05:41:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2137 for reimage', diff saved to https://phabricator.wikimedia.org/P57192 and previous config saved to /var/cache/conftool/dbconfig/20240220-054156-marostegui.json [05:42:14] (03CR) 10Marostegui: [C: 03+2] db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004809 (owner: 10Marostegui) [05:44:41] (03PS1) 10Marostegui: db2137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004810 [05:44:56] (03PS2) 10Marostegui: wmnet: Promote es2020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/1004670 (https://phabricator.wikimedia.org/T356372) [05:45:04] (03PS2) 10Marostegui: db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004668 (https://phabricator.wikimedia.org/T356372) [05:45:14] (03PS2) 10Marostegui: mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/1004669 (https://phabricator.wikimedia.org/T356372) [05:45:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2137.codfw.wmnet with OS bookworm [05:46:22] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004668 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [05:46:33] (03CR) 10Marostegui: [C: 03+2] db2137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004810 (owner: 10Marostegui) [05:47:07] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004668 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [05:50:27] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:1004668|db-production.php: Disable writes on es4 (T356372)]] [05:50:32] T356372: Switchover es4 codfw master es2021 -> es2020 - https://phabricator.wikimedia.org/T356372 [05:52:02] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1004668|db-production.php: Disable writes on es4 (T356372)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [05:52:24] !log marostegui@deploy2002 marostegui: Continuing with sync [05:53:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1170.eqiad.wmnet with reason: host reimage [05:54:31] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2137.codfw.wmnet with OS bookworm [05:55:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2137.codfw.wmnet with OS bookworm [05:55:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1170.eqiad.wmnet with reason: host reimage [06:00:04] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:1004668|db-production.php: Disable writes on es4 (T356372)]] (duration: 09m 36s) [06:00:25] T356372: Switchover es4 codfw master es2021 -> es2020 - https://phabricator.wikimedia.org/T356372 [06:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:01:48] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2137.codfw.wmnet with OS bookworm [06:03:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356372 [06:03:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356372 [06:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set es2020 with weight 0 T356372', diff saved to https://phabricator.wikimedia.org/P57193 and previous config saved to /var/cache/conftool/dbconfig/20240220-060404-marostegui.json [06:04:30] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004715 [06:08:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es2020 to es4 master [puppet] - 10https://gerrit.wikimedia.org/r/1004669 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [06:08:25] !log Starting es4 codfw failover from es2021 to es2020 - T356372 [06:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:32] T356372: Switchover es4 codfw master es2021 -> es2020 - https://phabricator.wikimedia.org/T356372 [06:08:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2020 to es4 primary T356372', diff saved to https://phabricator.wikimedia.org/P57194 and previous config saved to /var/cache/conftool/dbconfig/20240220-060852-marostegui.json [06:09:37] (03CR) 10Marostegui: [C: 03+2] wmnet: Promote es2020 to es4 master [dns] - 10https://gerrit.wikimedia.org/r/1004670 (https://phabricator.wikimedia.org/T356372) (owner: 10Marostegui) [06:10:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2021 T356372', diff saved to https://phabricator.wikimedia.org/P57195 and previous config saved to /var/cache/conftool/dbconfig/20240220-061025-marostegui.json [06:10:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add weight to es2020', diff saved to https://phabricator.wikimedia.org/P57196 and previous config saved to /var/cache/conftool/dbconfig/20240220-061049-root.json [06:10:59] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004715 (owner: 10Marostegui) [06:11:40] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004715 (owner: 10Marostegui) [06:12:50] (03PS1) 10Marostegui: es2021: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004811 (https://phabricator.wikimedia.org/T357905) [06:13:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1170.eqiad.wmnet with OS bookworm [06:13:10] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:1004715|Revert "db-production.php: Disable writes on es4"]] [06:14:23] (03CR) 10Marostegui: [C: 03+2] es2021: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1004811 (https://phabricator.wikimedia.org/T357905) (owner: 10Marostegui) [06:14:38] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:1004715|Revert "db-production.php: Disable writes on es4"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [06:14:59] !log marostegui@deploy2002 marostegui: Continuing with sync [06:17:24] (03PS1) 10Marostegui: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004716 [06:17:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57197 and previous config saved to /var/cache/conftool/dbconfig/20240220-061749-root.json [06:19:27] (03PS1) 10Marostegui: db1244: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004812 [06:19:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1244', diff saved to https://phabricator.wikimedia.org/P57198 and previous config saved to /var/cache/conftool/dbconfig/20240220-061932-root.json [06:20:06] (03CR) 10Marostegui: [C: 03+2] Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004716 (owner: 10Marostegui) [06:20:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 1%: After migration', diff saved to https://phabricator.wikimedia.org/P57199 and previous config saved to /var/cache/conftool/dbconfig/20240220-062058-root.json [06:21:07] (03CR) 10Marostegui: [C: 03+2] db1244: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004812 (owner: 10Marostegui) [06:22:42] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:1004715|Revert "db-production.php: Disable writes on es4"]] (duration: 09m 32s) [06:23:06] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9556783 (10Marostegui) [06:24:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1244.eqiad.wmnet with OS bookworm [06:25:09] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9556787 (10Marostegui) es2021 is no longer a master and it just need normal depooling cc @ABran-WMF [06:27:56] (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004962 [06:28:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1246', diff saved to https://phabricator.wikimedia.org/P57200 and previous config saved to /var/cache/conftool/dbconfig/20240220-062759-root.json [06:29:04] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1004962 (owner: 10Marostegui) [06:29:28] (03PS3) 10KartikMistry: Update MinT to 2024-02-20-062448-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995170 (https://phabricator.wikimedia.org/T333969) [06:29:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm [06:31:19] (03CR) 10Marostegui: [C: 03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004962 (owner: 10Marostegui) [06:32:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57201 and previous config saved to /var/cache/conftool/dbconfig/20240220-063254-root.json [06:32:58] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866#9556801 (10Marostegui) @cmooney is there anything pending here or can this be closed? [06:35:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2171 T354826', diff saved to https://phabricator.wikimedia.org/P57202 and previous config saved to /var/cache/conftool/dbconfig/20240220-063521-marostegui.json [06:35:26] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:36:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 5%: After migration', diff saved to https://phabricator.wikimedia.org/P57203 and previous config saved to /var/cache/conftool/dbconfig/20240220-063603-root.json [06:37:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1244.eqiad.wmnet with reason: host reimage [06:37:08] (03PS1) 10Marostegui: mariadb: Place db2171 in s5 [puppet] - 10https://gerrit.wikimedia.org/r/1004970 (https://phabricator.wikimedia.org/T354826) [06:38:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db2171 in s5 [puppet] - 10https://gerrit.wikimedia.org/r/1004970 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [06:39:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2171.codfw.wmnet with OS bookworm [06:39:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1244.eqiad.wmnet with reason: host reimage [06:40:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2171 multi-instance', diff saved to https://phabricator.wikimedia.org/P57204 and previous config saved to /var/cache/conftool/dbconfig/20240220-064014-marostegui.json [06:40:55] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T357952#9556808 (10Berete5212) [06:41:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db2171 in s5 depooled T354826', diff saved to https://phabricator.wikimedia.org/P57205 and previous config saved to /var/cache/conftool/dbconfig/20240220-064152-marostegui.json [06:41:58] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:42:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [06:44:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [06:47:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57206 and previous config saved to /var/cache/conftool/dbconfig/20240220-064758-root.json [06:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:51:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 10%: After migration', diff saved to https://phabricator.wikimedia.org/P57207 and previous config saved to /var/cache/conftool/dbconfig/20240220-065108-root.json [06:51:32] (03PS1) 10Marostegui: Revert "db1244: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004717 [06:54:10] (03PS1) 10Marostegui: Revert "db1246: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004718 [06:57:51] (03CR) 10Marostegui: [C: 03+2] Revert "db1244: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004717 (owner: 10Marostegui) [06:58:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [06:58:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57208 and previous config saved to /var/cache/conftool/dbconfig/20240220-065828-root.json [06:58:33] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [06:59:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1244.eqiad.wmnet with OS bookworm [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T0700) [07:00:04] kormat, marostegui, Amir1, and arnaudb: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T0700). [07:02:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2171.codfw.wmnet with reason: host reimage [07:03:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57209 and previous config saved to /var/cache/conftool/dbconfig/20240220-070303-root.json [07:04:27] (03PS1) 10Marostegui: site.pp: Add db2171 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1004973 (https://phabricator.wikimedia.org/T354826) [07:04:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm [07:06:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 25%: After migration', diff saved to https://phabricator.wikimedia.org/P57210 and previous config saved to /var/cache/conftool/dbconfig/20240220-070613-root.json [07:06:46] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db2171 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/1004973 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [07:09:40] (03CR) 10Marostegui: [C: 03+2] Revert "db1246: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004718 (owner: 10Marostegui) [07:09:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57211 and previous config saved to /var/cache/conftool/dbconfig/20240220-070948-root.json [07:09:58] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:13:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57212 and previous config saved to /var/cache/conftool/dbconfig/20240220-071333-root.json [07:18:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57213 and previous config saved to /var/cache/conftool/dbconfig/20240220-071808-root.json [07:21:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 50%: After migration', diff saved to https://phabricator.wikimedia.org/P57214 and previous config saved to /var/cache/conftool/dbconfig/20240220-072118-root.json [07:24:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57215 and previous config saved to /var/cache/conftool/dbconfig/20240220-072455-root.json [07:25:03] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:25:13] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 26554 [07:26:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26554 [07:26:37] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 18779 [07:26:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 18779 [07:26:57] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 60501 [07:27:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 60501 [07:27:17] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 56286 [07:27:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56286 [07:28:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2171.codfw.wmnet with OS bookworm [07:28:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57216 and previous config saved to /var/cache/conftool/dbconfig/20240220-072838-root.json [07:31:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2170', diff saved to https://phabricator.wikimedia.org/P57217 and previous config saved to /var/cache/conftool/dbconfig/20240220-073139-root.json [07:32:38] (03PS1) 10Marostegui: db2170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004975 [07:32:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2170.codfw.wmnet with OS bookworm [07:33:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1170 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57218 and previous config saved to /var/cache/conftool/dbconfig/20240220-073313-root.json [07:33:49] (03CR) 10Marostegui: [C: 03+2] db2170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004975 (owner: 10Marostegui) [07:34:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [07:36:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 75%: After migration', diff saved to https://phabricator.wikimedia.org/P57219 and previous config saved to /var/cache/conftool/dbconfig/20240220-073623-root.json [07:36:26] (03PS1) 10Marostegui: db2171: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004976 [07:36:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57220 and previous config saved to /var/cache/conftool/dbconfig/20240220-073658-root.json [07:37:04] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:37:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867#9557370 (10ops-monitoring-bot) Draining ganeti2028.codfw.wmnet of running VMs [07:37:51] (03CR) 10Marostegui: [C: 03+2] db2171: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1004976 (owner: 10Marostegui) [07:38:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [07:39:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2168', diff saved to https://phabricator.wikimedia.org/P57221 and previous config saved to /var/cache/conftool/dbconfig/20240220-073912-root.json [07:40:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57222 and previous config saved to /var/cache/conftool/dbconfig/20240220-074000-root.json [07:40:02] (03PS1) 10Marostegui: db2168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005016 [07:40:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2168.codfw.wmnet with OS bookworm [07:41:28] (03CR) 10Marostegui: [C: 03+2] db2168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005016 (owner: 10Marostegui) [07:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:42:16] (03PS1) 10Muehlenhoff: Remove cluster::management role from cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005019 (https://phabricator.wikimedia.org/T353419) [07:43:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57223 and previous config saved to /var/cache/conftool/dbconfig/20240220-074343-root.json [07:43:49] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:48:19] (03CR) 10Muehlenhoff: [C: 03+2] Remove cluster::management role from cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005019 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [07:50:43] (03PS1) 10Ayounsi: peering cookbook: handle more failure scenarios [cookbooks] - 10https://gerrit.wikimedia.org/r/1005021 [07:51:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2021 (re)pooling @ 100%: After migration', diff saved to https://phabricator.wikimedia.org/P57224 and previous config saved to /var/cache/conftool/dbconfig/20240220-075128-root.json [07:52:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57225 and previous config saved to /var/cache/conftool/dbconfig/20240220-075203-root.json [07:52:13] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [07:52:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2170.codfw.wmnet with reason: host reimage [07:55:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57226 and previous config saved to /var/cache/conftool/dbconfig/20240220-075505-root.json [07:55:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2170.codfw.wmnet with reason: host reimage [07:58:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57227 and previous config saved to /var/cache/conftool/dbconfig/20240220-075848-root.json [07:58:56] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2168.codfw.wmnet with reason: host reimage [08:03:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2168.codfw.wmnet with reason: host reimage [08:04:33] (03PS1) 10Marostegui: Revert "db2170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004719 [08:07:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57228 and previous config saved to /var/cache/conftool/dbconfig/20240220-080708-root.json [08:07:13] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:10:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57229 and previous config saved to /var/cache/conftool/dbconfig/20240220-081010-root.json [08:10:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174#9557692 (10ABran-WMF) [08:13:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1244 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57230 and previous config saved to /var/cache/conftool/dbconfig/20240220-081353-root.json [08:13:58] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:15:43] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753#9557728 (10JJMC89) [08:15:55] (03PS1) 10Marostegui: Revert "db2168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004720 [08:16:05] (03CR) 10Marostegui: [C: 03+2] Revert "db2170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004719 (owner: 10Marostegui) [08:16:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2170.codfw.wmnet with OS bookworm [08:16:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 1%: After migration', diff saved to https://phabricator.wikimedia.org/P57231 and previous config saved to /var/cache/conftool/dbconfig/20240220-081627-root.json [08:17:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2138', diff saved to https://phabricator.wikimedia.org/P57232 and previous config saved to /var/cache/conftool/dbconfig/20240220-081740-root.json [08:18:31] (03PS1) 10Marostegui: db2138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005023 [08:19:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2138.codfw.wmnet with OS bookworm [08:19:52] (03CR) 10Marostegui: [C: 03+2] db2138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005023 (owner: 10Marostegui) [08:20:22] (03CR) 10Marostegui: [C: 03+2] Revert "db2168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004720 (owner: 10Marostegui) [08:20:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57233 and previous config saved to /var/cache/conftool/dbconfig/20240220-082043-root.json [08:20:48] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:22:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57234 and previous config saved to /var/cache/conftool/dbconfig/20240220-082213-root.json [08:23:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2168.codfw.wmnet with OS bookworm [08:25:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57235 and previous config saved to /var/cache/conftool/dbconfig/20240220-082515-root.json [08:31:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 5%: After migration', diff saved to https://phabricator.wikimedia.org/P57236 and previous config saved to /var/cache/conftool/dbconfig/20240220-083132-root.json [08:31:59] (03CR) 10Ayounsi: "Thanks !" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [08:32:19] (03PS7) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [08:35:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57237 and previous config saved to /var/cache/conftool/dbconfig/20240220-083547-root.json [08:35:54] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:37:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57238 and previous config saved to /var/cache/conftool/dbconfig/20240220-083718-root.json [08:37:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2138.codfw.wmnet with reason: host reimage [08:40:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2138.codfw.wmnet with reason: host reimage [08:41:29] (03PS1) 10Marostegui: db2167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005026 [08:41:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2167', diff saved to https://phabricator.wikimedia.org/P57239 and previous config saved to /var/cache/conftool/dbconfig/20240220-084136-root.json [08:42:43] (03CR) 10Marostegui: [C: 03+2] db2167: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005026 (owner: 10Marostegui) [08:43:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2167.codfw.wmnet with OS bookworm [08:44:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:45:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [08:45:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:45:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:45:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T357189)', diff saved to https://phabricator.wikimedia.org/P57240 and previous config saved to /var/cache/conftool/dbconfig/20240220-084530-arnaudb.json [08:45:35] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:46:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:46:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:46:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T355609)', diff saved to https://phabricator.wikimedia.org/P57241 and previous config saved to /var/cache/conftool/dbconfig/20240220-084637-marostegui.json [08:46:42] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:49:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T357189)', diff saved to https://phabricator.wikimedia.org/P57242 and previous config saved to /var/cache/conftool/dbconfig/20240220-084901-arnaudb.json [08:50:41] (03PS1) 10Marostegui: Revert "db2138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004721 [08:50:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57243 and previous config saved to /var/cache/conftool/dbconfig/20240220-085052-root.json [08:50:58] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [08:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2171 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57244 and previous config saved to /var/cache/conftool/dbconfig/20240220-085222-root.json [08:56:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57245 and previous config saved to /var/cache/conftool/dbconfig/20240220-085641-root.json [08:56:52] (03CR) 10Marostegui: [C: 03+2] Revert "db2138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1004721 (owner: 10Marostegui) [08:56:57] !log dcausse@deploy2002 Started deploy [airflow-dags/search@a6356d2]: search: wdqs-updater reconcile, do not create the dag dynamically [08:57:26] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@a6356d2]: search: wdqs-updater reconcile, do not create the dag dynamically (duration: 00m 28s) [08:58:40] (ProbeDown) firing: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2138.codfw.wmnet with OS bookworm [09:01:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: mw-parsoid stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004739 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [09:02:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] deploy: Add mw-parsoid namespace stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1004149 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [09:02:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2167.codfw.wmnet with reason: host reimage [09:03:40] (ProbeDown) resolved: (2) Service etherpad1003:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:04:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P57246 and previous config saved to /var/cache/conftool/dbconfig/20240220-090408-arnaudb.json [09:04:43] (03Merged) 10jenkins-bot: admin_ng: mw-parsoid stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004739 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [09:04:46] (03PS8) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [09:05:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2167.codfw.wmnet with reason: host reimage [09:05:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57247 and previous config saved to /var/cache/conftool/dbconfig/20240220-090557-root.json [09:06:14] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [09:08:16] !log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:09:52] !log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:10:42] (03CR) 10CI reject: [V: 04-1] Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:11:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57248 and previous config saved to /var/cache/conftool/dbconfig/20240220-091146-root.json [09:13:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T355609)', diff saved to https://phabricator.wikimedia.org/P57249 and previous config saved to /var/cache/conftool/dbconfig/20240220-091321-marostegui.json [09:13:27] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:15:48] !log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:16:02] !log dcausse@deploy2002 Started deploy [airflow-dags/search@088b013]: search: wdqs updater set proper start date [09:16:28] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@088b013]: search: wdqs updater set proper start date (duration: 00m 26s) [09:16:30] !log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:19:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P57250 and previous config saved to /var/cache/conftool/dbconfig/20240220-091914-arnaudb.json [09:21:01] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57251 and previous config saved to /var/cache/conftool/dbconfig/20240220-092102-root.json [09:21:08] (03PS1) 10Slyngshede: P:idp Switch idp test host to Java 17. [puppet] - 10https://gerrit.wikimedia.org/r/1005031 (https://phabricator.wikimedia.org/T357749) [09:21:10] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [09:21:10] (03CR) 10Hashar: match `pip2` path used by common `run.sh` (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1004754 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [09:21:26] (03PS3) 10Samtar: InitialiseSettings: Enable Edit Recovery on 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) [09:21:44] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:22:05] (03PS9) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [09:22:32] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:23:38] (03PS10) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [09:23:56] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:24:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005031 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [09:26:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2167.codfw.wmnet with OS bookworm [09:26:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57252 and previous config saved to /var/cache/conftool/dbconfig/20240220-092651-root.json [09:28:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P57253 and previous config saved to /var/cache/conftool/dbconfig/20240220-092827-marostegui.json [09:30:13] (03CR) 10CI reject: [V: 04-1] Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:33:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] mw-parsoid: Introduce it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004157 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [09:33:47] (03PS1) 10Slyngshede: P:idp Add dummy OIDC secret for superset-next. [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 [09:34:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T357189)', diff saved to https://phabricator.wikimedia.org/P57254 and previous config saved to /var/cache/conftool/dbconfig/20240220-093420-arnaudb.json [09:34:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [09:34:26] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:34:27] (03Merged) 10jenkins-bot: mw-parsoid: Introduce it [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004157 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [09:34:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [09:34:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T357189)', diff saved to https://phabricator.wikimedia.org/P57255 and previous config saved to /var/cache/conftool/dbconfig/20240220-093442-arnaudb.json [09:35:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 (owner: 10Slyngshede) [09:35:35] (03CR) 10Slyngshede: [V: 03+2] P:idp Add dummy OIDC secret for superset-next. [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 (owner: 10Slyngshede) [09:35:47] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] P:idp Add dummy OIDC secret for superset-next. [labs/private] - 10https://gerrit.wikimedia.org/r/1005034 (owner: 10Slyngshede) [09:36:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2168 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57256 and previous config saved to /var/cache/conftool/dbconfig/20240220-093607-root.json [09:36:13] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [09:36:55] !log installing imagemagick security updates [09:36:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1394/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005031 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [09:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T357189)', diff saved to https://phabricator.wikimedia.org/P57257 and previous config saved to /var/cache/conftool/dbconfig/20240220-093803-arnaudb.json [09:38:48] (PuppetZeroResources) firing: Puppet has failed generate resources on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:39:08] (03PS11) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [09:39:48] (03CR) 10Ayounsi: "I can't figure out how to make the last test pass." [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:41:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57258 and previous config saved to /var/cache/conftool/dbconfig/20240220-094156-root.json [09:42:03] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idp Switch idp test host to Java 17. [puppet] - 10https://gerrit.wikimedia.org/r/1005031 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [09:42:18] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10User-aborrero: ACPI kernel failure on debian installer last step - https://phabricator.wikimedia.org/T357896#9558219 (10LSobanski) [09:43:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P57259 and previous config saved to /var/cache/conftool/dbconfig/20240220-094334-marostegui.json [09:46:38] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:46:57] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:49:55] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [09:53:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P57260 and previous config saved to /var/cache/conftool/dbconfig/20240220-095310-arnaudb.json [09:53:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57261 and previous config saved to /var/cache/conftool/dbconfig/20240220-095327-root.json [09:53:32] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [09:53:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2169', diff saved to https://phabricator.wikimedia.org/P57262 and previous config saved to /var/cache/conftool/dbconfig/20240220-095353-root.json [09:56:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2169.codfw.wmnet with OS bookworm [09:57:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57263 and previous config saved to /var/cache/conftool/dbconfig/20240220-095701-root.json [10:00:03] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [10:04:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2169 multiinstance', diff saved to https://phabricator.wikimedia.org/P57264 and previous config saved to /var/cache/conftool/dbconfig/20240220-100444-marostegui.json [10:04:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T355609)', diff saved to https://phabricator.wikimedia.org/P57265 and previous config saved to /var/cache/conftool/dbconfig/20240220-100449-marostegui.json [10:04:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:04:55] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:05:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:05:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T355609)', diff saved to https://phabricator.wikimedia.org/P57266 and previous config saved to /var/cache/conftool/dbconfig/20240220-100511-marostegui.json [10:06:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2169 to s6 depooled', diff saved to https://phabricator.wikimedia.org/P57267 and previous config saved to /var/cache/conftool/dbconfig/20240220-100623-marostegui.json [10:08:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P57268 and previous config saved to /var/cache/conftool/dbconfig/20240220-100816-arnaudb.json [10:08:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57269 and previous config saved to /var/cache/conftool/dbconfig/20240220-100832-root.json [10:08:37] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [10:10:09] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [10:10:24] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [10:12:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2138 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57270 and previous config saved to /var/cache/conftool/dbconfig/20240220-101206-root.json [10:16:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [10:18:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2169.codfw.wmnet with reason: host reimage [10:23:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T357189)', diff saved to https://phabricator.wikimedia.org/P57271 and previous config saved to /var/cache/conftool/dbconfig/20240220-102322-arnaudb.json [10:23:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:23:28] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:23:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57272 and previous config saved to /var/cache/conftool/dbconfig/20240220-102337-root.json [10:23:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [10:23:42] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [10:23:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T357189)', diff saved to https://phabricator.wikimedia.org/P57273 and previous config saved to /var/cache/conftool/dbconfig/20240220-102344-arnaudb.json [10:27:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T357189)', diff saved to https://phabricator.wikimedia.org/P57274 and previous config saved to /var/cache/conftool/dbconfig/20240220-102703-arnaudb.json [10:28:37] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:28:54] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:31:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T355609)', diff saved to https://phabricator.wikimedia.org/P57275 and previous config saved to /var/cache/conftool/dbconfig/20240220-103141-marostegui.json [10:31:47] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:34:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on cumin1001.eqiad.wmnet with reason: being taken down [10:34:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cumin1001.eqiad.wmnet with reason: being taken down [10:34:28] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir4001.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:36:30] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:38:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57276 and previous config saved to /var/cache/conftool/dbconfig/20240220-103842-root.json [10:38:50] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [10:38:58] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2169.codfw.wmnet with OS bookworm [10:39:28] looking [10:39:42] here if needed [10:42:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P57277 and previous config saved to /var/cache/conftool/dbconfig/20240220-104209-arnaudb.json [10:42:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57278 and previous config saved to /var/cache/conftool/dbconfig/20240220-104231-root.json [10:42:35] something is causing more than usual traffic/socket usage on the ncredir boxes in ulsfo [10:43:58] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:44:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:54] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:57] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:45:17] still looking [10:45:23] acked vo page [10:46:02] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:46:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2194', diff saved to https://phabricator.wikimedia.org/P57279 and previous config saved to /var/cache/conftool/dbconfig/20240220-104633-root.json [10:46:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P57280 and previous config saved to /var/cache/conftool/dbconfig/20240220-104647-marostegui.json [10:47:18] looks like tehre are a lot more requests to ncredir in ulsfo [10:47:40] PROBLEM - nova-compute proc minimum on cloudvirt1032 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:47:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:47:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.001 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:47:53] fabfur: Are the logs sent anywhere? [10:47:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 6.623 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:48:11] claime: actually don't know, let me search [10:48:19] the wikitech page is out of date and has almost no information [10:48:27] I'm also searching for the logs atm [10:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:48:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2194.codfw.wmnet with OS bookworm [10:48:54] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:49:57] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:50:18] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cloudvirt1032.eqiad.wmnet with reason: nova-compute registration [10:50:32] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cloudvirt1032.eqiad.wmnet with reason: nova-compute registration [10:51:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:51:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:51:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:52:04] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:54:54] PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir4002.ulsfo.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir4002.ulsfo.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir4002.ulsfo.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir4002.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:55:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:55:56] RECOVERY - PyBal backends health check on lvs4010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:56:29] !log Import CAS 6.6.12+wmf11u2 in apt-repo [10:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:37] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:59:01] PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_443: Servers ncredir4001.ulsfo.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir4002.ulsfo.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir4001.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:59:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2194 multi instance', diff saved to https://phabricator.wikimedia.org/P57281 and previous config saved to /var/cache/conftool/dbconfig/20240220-105959-marostegui.json [11:00:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P57282 and previous config saved to /var/cache/conftool/dbconfig/20240220-110004-arnaudb.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1100) [11:00:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57283 and previous config saved to /var/cache/conftool/dbconfig/20240220-110008-root.json [11:00:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57284 and previous config saved to /var/cache/conftool/dbconfig/20240220-110011-root.json [11:00:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2190', diff saved to https://phabricator.wikimedia.org/P57285 and previous config saved to /var/cache/conftool/dbconfig/20240220-110020-root.json [11:00:43] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_80: Servers ncredir4002.ulsfo.wmnet are marked down but pooled: ncredirlb_80: Servers ncredir4002.ulsfo.wmnet are marked down but pooled: ncredirlb6_443: Servers ncredir4002.ulsfo.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:00:54] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:00:58] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:01:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.323 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 5.707 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:45] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:01:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P57286 and previous config saved to /var/cache/conftool/dbconfig/20240220-110154-marostegui.json [11:03:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db2194 in s3 depooled T354826', diff saved to https://phabricator.wikimedia.org/P57287 and previous config saved to /var/cache/conftool/dbconfig/20240220-110444-marostegui.json [11:05:58] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:06:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [11:08:05] RECOVERY - PyBal backends health check on lvs4010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:08:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: db2097 rebooted itself - https://phabricator.wikimedia.org/T357878#9558420 (10jcrespo) p:05Triage→03High [11:08:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: db2097 rebooted itself - https://phabricator.wikimedia.org/T357878#9558418 (10jcrespo) a:05Papaul→03jcrespo Papaul is on vacations, so it couldn't be him. :-( [11:08:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: host reimage [11:12:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:13:37] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:54] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:31] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2312.codfw.wmnet with OS bullseye [11:14:54] looking at the jobrunner errors [11:15:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T357189)', diff saved to https://phabricator.wikimedia.org/P57288 and previous config saved to /var/cache/conftool/dbconfig/20240220-111510-arnaudb.json [11:15:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:15:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57289 and previous config saved to /var/cache/conftool/dbconfig/20240220-111516-root.json [11:15:17] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:15:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [11:15:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2167 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57290 and previous config saved to /var/cache/conftool/dbconfig/20240220-111525-root.json [11:15:26] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:15:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T357189)', diff saved to https://phabricator.wikimedia.org/P57291 and previous config saved to /var/cache/conftool/dbconfig/20240220-111531-arnaudb.json [11:17:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T355609)', diff saved to https://phabricator.wikimedia.org/P57292 and previous config saved to /var/cache/conftool/dbconfig/20240220-111700-marostegui.json [11:17:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:17:05] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:17:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:17:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:17:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T355609)', diff saved to https://phabricator.wikimedia.org/P57293 and previous config saved to /var/cache/conftool/dbconfig/20240220-111722-marostegui.json [11:17:31] possibly some issues with CirrusSearch jobs, jobrunners are getting getting 503s for cirrusSearchElasticaWrite [11:17:40] (KubernetesRsyslogDown) firing: rsyslog on mw2434:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2434 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:18:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T357189)', diff saved to https://phabricator.wikimedia.org/P57294 and previous config saved to /var/cache/conftool/dbconfig/20240220-111854-arnaudb.json [11:19:22] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2313.codfw.wmnet with OS bullseye [11:19:25] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2367.codfw.wmnet with OS bullseye [11:19:28] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2369.codfw.wmnet with OS bullseye [11:19:30] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2384.codfw.wmnet with OS bullseye [11:19:32] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2385.codfw.wmnet with OS bullseye [11:21:56] RECOVERY - nova-compute proc minimum on cloudvirt1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:21:58] (ProbeDown) firing: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:57] (ProbeDown) resolved: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:37] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:54] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:29:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2194.codfw.wmnet with OS bookworm [11:29:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:29:33] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2384.codfw.wmnet with OS bullseye [11:29:39] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2385.codfw.wmnet with OS bullseye [11:30:02] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2384.codfw.wmnet with OS bullseye [11:30:10] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2385.codfw.wmnet with OS bullseye [11:30:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57295 and previous config saved to /var/cache/conftool/dbconfig/20240220-113021-root.json [11:30:27] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:30:49] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2312.codfw.wmnet with reason: host reimage [11:33:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:33:35] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2190.codfw.wmnet onto db2194.codfw.wmnet [11:33:38] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2312.codfw.wmnet with reason: host reimage [11:34:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P57296 and previous config saved to /var/cache/conftool/dbconfig/20240220-113401-arnaudb.json [11:34:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:35:05] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2369.codfw.wmnet with reason: host reimage [11:35:24] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2367.codfw.wmnet with reason: host reimage [11:37:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2313.codfw.wmnet with reason: host reimage [11:37:22] marostegui: Cannot execute Wikimedia\Rdbms\Database::runOnTransactionIdleCallbacks critical section while session state is out of sync. [11:37:30] ? [11:37:31] re: above MediaWikiHighErrorRate [11:37:46] Can it be something you're doing rn? [11:37:54] just asking, if not I'll dig [11:38:04] Let me check, but I doubt it [11:39:12] claime: this looks fine https://logstash.wikimedia.org/goto/8c7a440ebd236d6294f2ba72a20a54bc [11:39:13] sorry actually it's not that, wrong dashboard [11:39:14] claime: the MediaWikiHighErrorRate alert for jobrunners because something in cirrussearch is broken [11:39:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:39:17] Everything seems to be normal with dbs [11:39:32] marostegui: yeah sorry, what hnowlan said above [11:39:37] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2367.codfw.wmnet with reason: host reimage [11:39:50] claime: no problem! [11:39:56] better to ask! [11:40:13] I've mentioned it to search, trying to find a useful dashboard to see what's failing [11:40:28] (doesn't help that there's like 3 different opensearch dashboards for mediawiki errors...) [11:40:33] or 5 [11:40:34] or 10 [11:40:36] marostegui: It it OK to deploy cxserver now or should I wait till current winodow is over? [11:42:00] kart_: DBAs do not have any maintenance window now. However, double check with claime and hnowlan as they are troubleshooting issues at the moment [11:42:11] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2369.codfw.wmnet with reason: host reimage [11:42:14] Sure. [11:43:07] claime: hnowlan: Can I deploy cxserver now? (And, probably MinT after that if that's OK too) [11:43:43] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:43:43] PROBLEM - SSH on mw2379 is CRITICAL: connect to address 10.192.5.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:43:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T355609)', diff saved to https://phabricator.wikimedia.org/P57297 and previous config saved to /var/cache/conftool/dbconfig/20240220-114349-marostegui.json [11:43:55] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:43:57] kart_: I don't see why not [11:44:39] Cool. Thanks! [11:45:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57298 and previous config saved to /var/cache/conftool/dbconfig/20240220-114526-root.json [11:45:31] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [11:45:43] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:45:43] RECOVERY - SSH on mw2379 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:46:07] Looks like wikibugs is absent? [11:47:25] jobqueue cirrussearch issues appear to have stabilised for now. couldn't get a proper view of search health while it was broken though :/ [11:47:45] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:48:10] kart_: unfortunately, yes: https://phabricator.wikimedia.org/T357729 [11:49:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P57299 and previous config saved to /var/cache/conftool/dbconfig/20240220-114906-arnaudb.json [11:49:19] it should rejoin when it has something to say, per -cloud [11:50:33] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:50:35] !log updating pdns-recursor to 4.8.6-1 on doh* hosts [11:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:00] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:51:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2312.codfw.wmnet with OS bullseye [11:52:59] It's not absent though? [11:54:35] (03CR) 10Majavah: "wikibugs test please ignore" [puppet] - 10https://gerrit.wikimedia.org/r/935093 (https://phabricator.wikimedia.org/T337259) (owner: 10Majavah) [11:54:37] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:54:43] PROBLEM - SSH on mw2379 is CRITICAL: connect to address 10.192.5.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:54:57] claime: ^ there it is [11:55:12] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:55:19] thanks :) [11:55:30] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2313.codfw.wmnet with OS bullseye [11:55:43] RECOVERY - SSH on mw2379 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:57:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for sdeckelmann-wmf - https://phabricator.wikimedia.org/T357847#9558593 (10mark) This is approved. [11:57:59] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2367.codfw.wmnet with OS bullseye [11:58:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P57300 and previous config saved to /var/cache/conftool/dbconfig/20240220-115855-marostegui.json [11:59:06] (03PS2) 10Ssingh: conftool: update schema for dnsbox for anycast authdns setups [puppet] - 10https://gerrit.wikimedia.org/r/1004205 (https://phabricator.wikimedia.org/T347054) [11:59:30] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:00:07] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:00:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57301 and previous config saved to /var/cache/conftool/dbconfig/20240220-120031-root.json [12:00:59] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2369.codfw.wmnet with OS bullseye [12:01:18] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [12:01:48] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9558616 (10aborrero) [12:02:14] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2384.codfw.wmnet with OS bullseye [12:02:24] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2384.codfw.wmnet with OS bullseye [12:02:27] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2385.codfw.wmnet with OS bullseye [12:02:43] PROBLEM - SSH on mw2379 is CRITICAL: connect to address 10.192.5.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:02:43] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:03:01] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw2385.codfw.wmnet with OS bullseye [12:03:25] (SystemdUnitFailed) resolved: (2) httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:12] !log cxserver: Update to 2024-02-15-085232-production + Bump mesh.configuration to 1.7 (T333969, T352747, T355686, T255568) [12:04:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T357189)', diff saved to https://phabricator.wikimedia.org/P57302 and previous config saved to /var/cache/conftool/dbconfig/20240220-120412-arnaudb.json [12:04:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [12:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [12:04:34] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [12:04:34] T352747: Google is not listed as an option for Norwegian - https://phabricator.wikimedia.org/T352747 [12:04:34] T355686: Configure mesh listeners to allow IPv6 localhost (::) as well as IPv4 (127.0.0.1) - https://phabricator.wikimedia.org/T355686 [12:04:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T357189)', diff saved to https://phabricator.wikimedia.org/P57303 and previous config saved to /var/cache/conftool/dbconfig/20240220-120434-arnaudb.json [12:04:35] T255568: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 [12:04:43] RECOVERY - SSH on mw2379 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:04:43] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:04:52] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:07:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T357189)', diff saved to https://phabricator.wikimedia.org/P57304 and previous config saved to /var/cache/conftool/dbconfig/20240220-120752-arnaudb.json [12:08:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:09:45] PROBLEM - SSH on mw2379 is CRITICAL: connect to address 10.192.5.5 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:09:45] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:41] hnowlan: those jobs are about asynchronous indexing, so it's unlikely to have a direct effect on Search (outside of a few pages not being searchable, or ranking not being accurate). Still needs to be looked at and fixed, but not an emergency. [12:13:33] (KubernetesCalicoDown) firing: mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2379.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P57305 and previous config saved to /var/cache/conftool/dbconfig/20240220-121402-marostegui.json [12:14:31] hmm mw2379 lost ipv4 bgp [12:16:58] !log Draining mw2379 [12:17:01] gehel: ack, thanks [12:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:37] claime: that's very suspicious - that host had bgp issues once before, something is up there. [12:17:42] yeah [12:17:43] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:18:06] topranks and I were looking into issues with that host when it was first being imaged too [12:18:43] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2385.codfw.wmnet with OS bullseye [12:18:51] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2384.codfw.wmnet with OS bullseye [12:18:58] * topranks looking [12:19:31] topranks: I'm draining and cordoning it, I was about to delete the calico pod to see if it reconnected, but I'm leaving it for you then [12:22:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P57306 and previous config saved to /var/cache/conftool/dbconfig/20240220-122258-arnaudb.json [12:25:51] claime: something odd with the networking on that host, still trying to work it out [12:25:57] can't ping the v4 gateway for some reason [12:27:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2434:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2434 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:29:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T355609)', diff saved to https://phabricator.wikimedia.org/P57307 and previous config saved to /var/cache/conftool/dbconfig/20240220-122907-marostegui.json [12:29:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [12:29:14] the other two hosts connected to the same switch /subnet (mw2380 and mw2383) seem fine [12:29:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:29:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1196.eqiad.wmnet with reason: Maintenance [12:29:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:29:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:29:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T355609)', diff saved to https://phabricator.wikimedia.org/P57308 and previous config saved to /var/cache/conftool/dbconfig/20240220-122947-marostegui.json [12:36:39] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 294.7k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:37:49] RECOVERY - SSH on mw2379 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:37:53] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:38:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P57309 and previous config saved to /var/cache/conftool/dbconfig/20240220-123804-arnaudb.json [12:38:33] (KubernetesCalicoDown) resolved: mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2379.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:46:39] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 204.7k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:46:43] (03PS1) 10Cathal Mooney: Change lvs2014 IP on private1-a3-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1005061 [12:48:03] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s5 [12:48:07] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8 [12:50:07] (03PS1) 10Marostegui: clouddb1020: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005062 (https://phabricator.wikimedia.org/T356838) [12:52:15] (03CR) 10Marostegui: "The host is depooled already" [puppet] - 10https://gerrit.wikimedia.org/r/1005062 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [12:53:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T357189)', diff saved to https://phabricator.wikimedia.org/P57310 and previous config saved to /var/cache/conftool/dbconfig/20240220-125311-arnaudb.json [12:53:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [12:53:20] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:53:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [12:54:47] (03CR) 10Majavah: [C: 03+1] "This seems to match other MariaDB version update patches, and the node is indeed depooled." [puppet] - 10https://gerrit.wikimedia.org/r/1005062 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [12:54:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [12:55:07] (03CR) 10Marostegui: "Thanks, I have a meeting in 5 minutes, will upgrade + repool once done" [puppet] - 10https://gerrit.wikimedia.org/r/1005062 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [12:55:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [12:55:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T357189)', diff saved to https://phabricator.wikimedia.org/P57311 and previous config saved to /var/cache/conftool/dbconfig/20240220-125516-arnaudb.json [12:57:20] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170#9558781 (10ayounsi) 05Stalled→03Resolved Not sure I understand this ticket, please re-open if needed or follow up in the other one. [12:58:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T355609)', diff saved to https://phabricator.wikimedia.org/P57312 and previous config saved to /var/cache/conftool/dbconfig/20240220-125814-marostegui.json [12:58:19] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:58:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T357189)', diff saved to https://phabricator.wikimedia.org/P57313 and previous config saved to /var/cache/conftool/dbconfig/20240220-125835-arnaudb.json [12:58:40] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1300) [13:07:37] (03PS1) 10Ayounsi: Add sdeckelmann-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1005064 (https://phabricator.wikimedia.org/T357847) [13:08:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2190.codfw.wmnet onto db2194.codfw.wmnet [13:09:19] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove cumin1001 from alertmanager config and tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1005054 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:10:36] (03CR) 10Slyngshede: [C: 03+2] Puppetmaster: Alert when unmerged changes exists in Puppet repo. [alerts] - 10https://gerrit.wikimedia.org/r/1003761 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:12:08] (03Merged) 10jenkins-bot: Puppetmaster: Alert when unmerged changes exists in Puppet repo. [alerts] - 10https://gerrit.wikimedia.org/r/1003761 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:13:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P57314 and previous config saved to /var/cache/conftool/dbconfig/20240220-131320-marostegui.json [13:13:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P57315 and previous config saved to /var/cache/conftool/dbconfig/20240220-131341-arnaudb.json [13:14:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1005064 (https://phabricator.wikimedia.org/T357847) (owner: 10Ayounsi) [13:17:00] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) [13:18:09] (03CR) 10CI reject: [V: 04-1] openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez) [13:20:03] (03CR) 10Ayounsi: [C: 03+2] Add sdeckelmann-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1005064 (https://phabricator.wikimedia.org/T357847) (owner: 10Ayounsi) [13:22:21] (03PS1) 10Slyngshede: R:idp_test Upgrade IDP test to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1005086 (https://phabricator.wikimedia.org/T357749) [13:24:43] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1396/console" [puppet] - 10https://gerrit.wikimedia.org/r/1005086 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [13:25:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for sdeckelmann-wmf - https://phabricator.wikimedia.org/T357847#9558840 (10ayounsi) 05Open→03Resolved a:03ayounsi You should be good to go ! Please re-open if any issues. [13:25:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1397/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005086 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [13:26:46] (03CR) 10Slyngshede: [V: 03+1] "We'll still need to go in an manually remove the Java 11 headless package." [puppet] - 10https://gerrit.wikimedia.org/r/1005086 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [13:28:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P57316 and previous config saved to /var/cache/conftool/dbconfig/20240220-132827-marostegui.json [13:28:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P57317 and previous config saved to /var/cache/conftool/dbconfig/20240220-132848-arnaudb.json [13:29:12] (03CR) 10Filippo Giunchedi: Load average for Swift cluster (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:29:27] 10SRE, 10Infrastructure-Foundations: Migrate Spicerack logs from cumin1001 to cumin1002? - https://phabricator.wikimedia.org/T353523#9558864 (10MoritzMuehlenhoff) >>! In T353523#9412919, @Volans wrote: > Yes, it would be nice to sync `/var/log/spicerack/` and `/var/log/cumin` from `cumin1001` to `cumin1002` wh... [13:30:56] (03CR) 10Muehlenhoff: [C: 03+1] "That's expected and fine. It was even an intentional choice of profile::java to ensure that we don't break deployments not fully managed b" [puppet] - 10https://gerrit.wikimedia.org/r/1005086 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [13:32:17] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] R:idp_test Upgrade IDP test to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1005086 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [13:32:41] (03CR) 10Ayounsi: Cookbook to renumber a host while changing its vlan (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:33:15] (03PS1) 10Marostegui: Revert "db2190: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005067 [13:34:01] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) [13:34:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin1001 from alertmanager config and tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/1005054 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:34:59] (03CR) 10Ayounsi: Cookbook to renumber a host while changing its vlan (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:35:14] (03CR) 10CI reject: [V: 04-1] openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez) [13:36:07] (03CR) 10Hashar: [C: 03+1] "I have successfully built the image locally and even tested it with our `integration/zuul/deploy`. Thank you for the cleanup of the mess" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1004754 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [13:37:29] (03CR) 10Ayounsi: [C: 03+1] Enable BGP session status change logs on l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1004747 (owner: 10Cathal Mooney) [13:42:53] RECOVERY - MariaDB Replica Lag: x1 on db2097 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:53] RECOVERY - mysqld processes on db2097 is OK: PROCS OK: 3 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:42:53] RECOVERY - MariaDB Replica SQL: x1 on db2097 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:55] RECOVERY - MariaDB Replica SQL: s2 on db2097 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:57] RECOVERY - MariaDB Replica Lag: s6 on db2097 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:57] RECOVERY - MariaDB Replica IO: x1 on db2097 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:59] RECOVERY - MariaDB Replica IO: s6 on db2097 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:42:59] RECOVERY - MariaDB Replica Lag: s2 on db2097 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:01] RECOVERY - MariaDB Replica SQL: s6 on db2097 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:11] (03CR) 10Jaime Nuche: "Thanks for the extra testing!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1004754 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [13:43:25] RECOVERY - MariaDB read only s6 on db2097 is OK: Version 10.6.16-MariaDB, Uptime 52s, read_only: True, event_scheduler: True, 11.61 QPS, connection latency: 0.011510s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:43:25] RECOVERY - MariaDB read only x1 on db2097 is OK: Version 10.6.16-MariaDB, Uptime 48s, read_only: True, event_scheduler: True, 13.43 QPS, connection latency: 0.021238s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:43:25] RECOVERY - MariaDB read only s2 on db2097 is OK: Version 10.6.16-MariaDB, Uptime 1s, read_only: True, event_scheduler: True, 11.49 QPS, connection latency: 0.013861s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:43:27] RECOVERY - MariaDB Replica IO: s2 on db2097 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:43:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T355609)', diff saved to https://phabricator.wikimedia.org/P57318 and previous config saved to /var/cache/conftool/dbconfig/20240220-134334-marostegui.json [13:43:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:43:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:43:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1206.eqiad.wmnet with reason: Maintenance [13:43:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T357189)', diff saved to https://phabricator.wikimedia.org/P57319 and previous config saved to /var/cache/conftool/dbconfig/20240220-134354-arnaudb.json [13:43:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:44:00] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:44:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T355609)', diff saved to https://phabricator.wikimedia.org/P57320 and previous config saved to /var/cache/conftool/dbconfig/20240220-134403-marostegui.json [13:44:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [13:45:26] (03CR) 10Marostegui: [C: 03+2] Revert "db2190: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005067 (owner: 10Marostegui) [13:45:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:45:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [13:45:49] moritzm: ok to merge? [13:46:08] yes, please [13:47:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:47:07] moritzm: doneª [13:47:18] !log setting up mariadb instances at db2097 [13:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:22] thx [13:47:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:47:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57321 and previous config saved to /var/cache/conftool/dbconfig/20240220-134734-root.json [13:47:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 1%: After migration', diff saved to https://phabricator.wikimedia.org/P57322 and previous config saved to /var/cache/conftool/dbconfig/20240220-134742-root.json [13:47:52] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [13:48:13] (03PS1) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [13:48:29] (03PS1) 10Muehlenhoff: Remove cumin1001 from Ganeti config [puppet] - 10https://gerrit.wikimedia.org/r/1005090 (https://phabricator.wikimedia.org/T353419) [13:48:51] (03PS1) 10Marostegui: db2194: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005091 [13:49:22] (03CR) 10CI reject: [V: 04-1] haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) (owner: 10Fabfur) [13:49:36] (03CR) 10Marostegui: [C: 03+2] clouddb1020: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005062 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [13:49:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:49:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:49:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2111 (T357189)', diff saved to https://phabricator.wikimedia.org/P57323 and previous config saved to /var/cache/conftool/dbconfig/20240220-134958-arnaudb.json [13:50:05] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:50:11] (03CR) 10Marostegui: [C: 03+2] db2194: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005091 (owner: 10Marostegui) [13:51:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 5%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57324 and previous config saved to /var/cache/conftool/dbconfig/20240220-135104-root.json [13:51:44] (03PS2) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [13:52:34] ACKNOWLEDGEMENT - MariaDB Replica Lag: s2 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 292921.25 seconds Jcrespo catching up: T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:52:34] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302607.43 seconds Jcrespo catching up: T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:52:34] ACKNOWLEDGEMENT - MariaDB Replica Lag: x1 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 416037.07 seconds Jcrespo catching up: T357878 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:52:36] (03CR) 10Jgiannelos: mobileapps: add cassandra config in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/993154 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [13:52:57] (03CR) 10CI reject: [V: 04-1] haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) (owner: 10Fabfur) [13:54:06] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8 [13:54:09] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s5 [13:54:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T357189)', diff saved to https://phabricator.wikimedia.org/P57325 and previous config saved to /var/cache/conftool/dbconfig/20240220-135420-arnaudb.json [13:55:10] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: sretest [13:55:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2005.codfw.wmnet with reason: sretest [13:55:50] (03PS3) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [13:57:14] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1400/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) (owner: 10Fabfur) [13:59:26] (03PS4) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1400). nyaa~ [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:32] yup, looks like nothing to deploy [14:02:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57326 and previous config saved to /var/cache/conftool/dbconfig/20240220-140239-root.json [14:02:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 5%: After migration', diff saved to https://phabricator.wikimedia.org/P57327 and previous config saved to /var/cache/conftool/dbconfig/20240220-140247-root.json [14:02:56] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:03:35] (SystemdUnitFailed) firing: netbox_ganeti_codfw02_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:49] (03CR) 10Ayounsi: [V: 03+1] "FYI, I manually tested the command and it works as expected, so this is ready to be deployed anytime." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [14:05:02] (03PS5) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [14:05:05] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [14:06:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 10%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57328 and previous config saved to /var/cache/conftool/dbconfig/20240220-140609-root.json [14:06:20] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1402/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) (owner: 10Fabfur) [14:08:00] (03PS6) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [14:09:16] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1403/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) (owner: 10Fabfur) [14:09:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P57329 and previous config saved to /var/cache/conftool/dbconfig/20240220-140926-arnaudb.json [14:15:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T355609)', diff saved to https://phabricator.wikimedia.org/P57330 and previous config saved to /var/cache/conftool/dbconfig/20240220-141525-marostegui.json [14:15:28] !log Uncordoning mw2379 [14:15:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:18] (03CR) 10Ayounsi: [C: 03+1] Remove cumin1001 from Ganeti config [puppet] - 10https://gerrit.wikimedia.org/r/1005090 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [14:17:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57331 and previous config saved to /var/cache/conftool/dbconfig/20240220-141744-root.json [14:17:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 10%: After migration', diff saved to https://phabricator.wikimedia.org/P57332 and previous config saved to /var/cache/conftool/dbconfig/20240220-141752-root.json [14:18:00] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:18:35] (SystemdUnitFailed) resolved: netbox_ganeti_codfw02_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:47] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2005.codfw.wmnet with OS bookworm [14:20:31] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [14:20:39] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] use `pip` of current Python installation in common `run.sh` [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1004754 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [14:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57333 and previous config saved to /var/cache/conftool/dbconfig/20240220-142114-root.json [14:21:46] !log launching build-production-images - T342346 [14:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:03] T342346: Refresh integration/zuul/deploy to work on Debian Bullseye - https://phabricator.wikimedia.org/T342346 [14:23:05] (03CR) 10Jelto: [C: 03+2] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004344 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [14:23:13] (03PS3) 10Arturo Borrero Gonzalez: openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) [14:24:07] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004344 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [14:24:09] (03CR) 10Herron: [C: 03+1] "LGTM! one non blocking question inline" [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi) [14:24:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P57334 and previous config saved to /var/cache/conftool/dbconfig/20240220-142433-arnaudb.json [14:25:09] (03PS1) 10Slyngshede: P:idp Force Tomcat to use the default Java installation. [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) [14:25:40] (03CR) 10Ssingh: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1005061 (owner: 10Cathal Mooney) [14:26:27] (03PS2) 10Slyngshede: P:idp Force Tomcat to use the default Java installation. [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) [14:26:42] (03PS1) 10Muehlenhoff: Add puppetised java.security config file for hardened TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/1005095 (https://phabricator.wikimedia.org/T357749) [14:26:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin1001 from Ganeti config [puppet] - 10https://gerrit.wikimedia.org/r/1005090 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [14:28:46] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, profile::etherpad::service_ensure is set to stopped for etherpad1004 so it should not be started." [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [14:28:59] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P57336 and previous config saved to /var/cache/conftool/dbconfig/20240220-143032-marostegui.json [14:31:59] (03CR) 10Cathal Mooney: [C: 03+2] Change lvs2014 IP on private1-a3-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1005061 (owner: 10Cathal Mooney) [14:32:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1404/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [14:32:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57337 and previous config saved to /var/cache/conftool/dbconfig/20240220-143249-root.json [14:32:55] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:32:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: After migration', diff saved to https://phabricator.wikimedia.org/P57338 and previous config saved to /var/cache/conftool/dbconfig/20240220-143258-root.json [14:34:27] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 105 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:35:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 181 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:35:32] (03PS3) 10Slyngshede: P:idp Force Tomcat to use the default Java installation. [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) [14:36:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 50%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57339 and previous config saved to /var/cache/conftool/dbconfig/20240220-143619-root.json [14:37:40] (03CR) 10Jelto: [V: 03+1 C: 03+2] site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [14:37:43] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 178 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:37:55] (03PS4) 10Slyngshede: P:idp Force Tomcat to use the default Java installation. [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) [14:38:36] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:59] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [14:39:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T357189)', diff saved to https://phabricator.wikimedia.org/P57340 and previous config saved to /var/cache/conftool/dbconfig/20240220-143939-arnaudb.json [14:39:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:39:51] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:39:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [14:40:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T357189)', diff saved to https://phabricator.wikimedia.org/P57341 and previous config saved to /var/cache/conftool/dbconfig/20240220-144001-arnaudb.json [14:43:24] (03PS5) 10Slyngshede: P:idp Force Tomcat to use the default Java installation. [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) [14:44:07] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9559322 (10Jelto) [14:44:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T357189)', diff saved to https://phabricator.wikimedia.org/P57342 and previous config saved to /var/cache/conftool/dbconfig/20240220-144414-arnaudb.json [14:44:31] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1231.eqiad.wmnet [14:45:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P57343 and previous config saved to /var/cache/conftool/dbconfig/20240220-144539-marostegui.json [14:46:12] !log updating pdns-recursor to 4.8.6-1 on dns* [14:46:13] (03PS6) 10Slyngshede: P:idp Force Tomcat to use the default Java installation. [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) [14:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:26] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Trigger savepoints in production envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004199 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [14:46:51] (03PS1) 10Muehlenhoff: Remove grants for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005101 (https://phabricator.wikimedia.org/T353419) [14:47:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1408/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [14:47:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57344 and previous config saved to /var/cache/conftool/dbconfig/20240220-144753-root.json [14:47:55] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:47:59] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,name=cp20(29|30).codfw.wmnet [14:47:59] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:48:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: After migration', diff saved to https://phabricator.wikimedia.org/P57345 and previous config saved to /var/cache/conftool/dbconfig/20240220-144803-root.json [14:48:06] T354826: Re-arrange core multi-instance hosts - https://phabricator.wikimedia.org/T354826 [14:48:09] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:48:12] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:48:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1231.eqiad.wmnet [14:49:13] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2029-2030].codfw.wmnet with reason: T355867 [14:49:18] T355867: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 [14:49:21] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:49:21] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:49:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2029-2030].codfw.wmnet with reason: T355867 [14:49:37] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.59 ms [14:49:56] !log disable puppet on A:cp to merge CR 1004126 [14:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:01] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [14:50:40] (ProbeDown) firing: (2) Service etherpad1004:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57346 and previous config saved to /var/cache/conftool/dbconfig/20240220-145124-root.json [14:52:43] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 88 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:53:19] 10SRE, 10Infrastructure-Foundations, 10netops: BGP peering from LSW to K8s hosts using loopback IP not IRB - https://phabricator.wikimedia.org/T357619#9559391 (10cmooney) [14:53:35] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:10] 10SRE, 10Infrastructure-Foundations, 10netops: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9559389 (10cmooney) 05Open→03Resolved All done, used statics to avoid any disruption to forwarding. [14:54:20] (03CR) 10Ssingh: [C: 03+2] P:cache::haproxy: add boolean to install component/haproxy26 [puppet] - 10https://gerrit.wikimedia.org/r/1004126 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [14:54:53] (03PS1) 10Slyngshede: IDP: Add superset_k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 [14:55:04] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:55:09] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:56:01] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:16] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:57:23] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [14:57:53] (03CR) 10Jaime Nuche: "Thanks Clément!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1004754 (https://phabricator.wikimedia.org/T342346) (owner: 10Jaime Nuche) [14:58:13] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2114.codfw.wmnet [14:58:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 (owner: 10Slyngshede) [14:58:36] (JobUnavailable) firing: (2) Reduced availability for job etherpad in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:47] !log silencing etherpad1004.* until service installation is finished - T316421 [14:59:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P57347 and previous config saved to /var/cache/conftool/dbconfig/20240220-145920-arnaudb.json [14:59:50] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 189 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:00:02] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [15:00:13] (03CR) 10Marostegui: "I will take care of merging this, as it needs grants removal from production" [puppet] - 10https://gerrit.wikimedia.org/r/1005101 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [15:00:13] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [15:00:31] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] IDP: Add superset_k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 (owner: 10Slyngshede) [15:00:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T355609)', diff saved to https://phabricator.wikimedia.org/P57348 and previous config saved to /var/cache/conftool/dbconfig/20240220-150046-marostegui.json [15:00:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [15:00:49] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [15:01:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [15:01:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T355609)', diff saved to https://phabricator.wikimedia.org/P57349 and previous config saved to /var/cache/conftool/dbconfig/20240220-150108-marostegui.json [15:01:27] jouncebot: next [15:01:27] In 0 hour(s) and 58 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1600) [15:02:11] i missed the backport window :( but i've got a quick config patch to patch if anyone is still here & bored [15:02:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2114.codfw.wmnet [15:02:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57350 and previous config saved to /var/cache/conftool/dbconfig/20240220-150258-root.json [15:03:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: After migration', diff saved to https://phabricator.wikimedia.org/P57351 and previous config saved to /var/cache/conftool/dbconfig/20240220-150307-root.json [15:03:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:52] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db[2146,2151].codfw.wmnet [15:05:00] (03PS1) 10Slyngshede: IDP: Add Superset next k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005105 [15:05:04] (03CR) 10Muehlenhoff: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1005101 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [15:05:34] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] IDP: Add Superset next k8s dummy secret. [labs/private] - 10https://gerrit.wikimedia.org/r/1005105 (owner: 10Slyngshede) [15:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: After rearraging sections T354826', diff saved to https://phabricator.wikimedia.org/P57352 and previous config saved to /var/cache/conftool/dbconfig/20240220-150629-root.json [15:07:12] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1411/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [15:09:18] (03PS1) 10Muehlenhoff: profile::mariadb::wmf_root_client: Remove cumin1001 from allow list [puppet] - 10https://gerrit.wikimedia.org/r/1005106 (https://phabricator.wikimedia.org/T353419) [15:09:34] !log sudo cumin 'A:cp' "run-puppet-agent --enable 'merging CR 1004126'" [15:10:16] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219#9559471 (10RobH) [15:11:16] (03CR) 10Marostegui: [C: 03+1] profile::mariadb::wmf_root_client: Remove cumin1001 from allow list [puppet] - 10https://gerrit.wikimedia.org/r/1005106 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [15:13:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db[2146,2151].codfw.wmnet [15:13:32] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [15:14:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P57353 and previous config saved to /var/cache/conftool/dbconfig/20240220-151426-arnaudb.json [15:16:10] !log starting the Alert hosts upgrade to Bookworm - T333615 [15:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:15] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [15:16:41] !log depooled wdqs2009 & wdqs2020 (T355867) [15:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:52] T355867: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 [15:17:55] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9559503 (10andrea.denisse) [15:18:06] (03PS1) 10Stang: zhwiki: Create group ipblock-exempt-granter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) [15:18:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: After migration', diff saved to https://phabricator.wikimedia.org/P57354 and previous config saved to /var/cache/conftool/dbconfig/20240220-151812-root.json [15:18:35] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:09] (03PS1) 10MVernon: aptrepo: add external repository for Ceph reef release [puppet] - 10https://gerrit.wikimedia.org/r/1005110 (https://phabricator.wikimedia.org/T279621) [15:20:48] (03PS2) 10MVernon: aptrepo: add external repository for Ceph reef release [puppet] - 10https://gerrit.wikimedia.org/r/1005110 (https://phabricator.wikimedia.org/T279621) [15:21:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9559543 (10Jhancock.wm) [15:23:24] (03CR) 10Arnaudb: [C: 03+1] aptrepo: add external repository for Ceph reef release [puppet] - 10https://gerrit.wikimedia.org/r/1005110 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:24:32] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 83 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:24:46] (03CR) 10MVernon: [C: 03+2] aptrepo: add external repository for Ceph reef release [puppet] - 10https://gerrit.wikimedia.org/r/1005110 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:25:35] (03PS1) 10Clément Goubert: sre.hosts.reimage: Fix dry-run failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1005112 [15:25:42] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [15:27:02] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.98 ms [15:28:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T355609)', diff saved to https://phabricator.wikimedia.org/P57355 and previous config saved to /var/cache/conftool/dbconfig/20240220-152807-marostegui.json [15:28:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:28:30] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:28:36] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.41 ms [15:29:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T357189)', diff saved to https://phabricator.wikimedia.org/P57356 and previous config saved to /var/cache/conftool/dbconfig/20240220-152933-arnaudb.json [15:29:35] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219#9559631 (10RobH) [15:29:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:29:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:29:48] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 36 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:29:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [15:29:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:29:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:30:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T357189)', diff saved to https://phabricator.wikimedia.org/P57357 and previous config saved to /var/cache/conftool/dbconfig/20240220-153000-arnaudb.json [15:30:06] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: Fix dry-run failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1005112 (owner: 10Clément Goubert) [15:30:20] !log import ceph-reef packages to apt1001 T279621 [15:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:32] T279621: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 [15:31:02] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:31:03] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304#9559649 (10RobH) [15:31:11] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219#9559638 (10RobH) 05Open→03Resolved a:05wiki_willy→03None Only two sub-tasks open, T350621 and T342239 which are both being taken care of on their own tasks.... [15:32:05] !log temp disable meta-monitoring on wikitech-static.w.o - T333615 [15:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:10] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [15:32:29] (03PS2) 10Clément Goubert: sre.hosts.reimage: Fix dry-run failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1005112 [15:33:36] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.28 ms [15:33:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[1168,1210,1226,1233].eqiad.wmnet with reason: Silence for reboot T356240 [15:33:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[1168,1210,1226,1233].eqiad.wmnet with reason: Silence for reboot T356240 [15:34:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T357189)', diff saved to https://phabricator.wikimedia.org/P57358 and previous config saved to /var/cache/conftool/dbconfig/20240220-153410-arnaudb.json [15:34:26] (03PS2) 10C. Scott Ananian: Correctly turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551 [15:34:38] (03PS3) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) [15:35:16] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 37 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:35:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1168 db1210 db1226 db1233 depool for T356240', diff saved to https://phabricator.wikimedia.org/P57359 and previous config saved to /var/cache/conftool/dbconfig/20240220-153557-arnaudb.json [15:36:58] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1168.eqiad.wmnet [15:37:10] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1210.eqiad.wmnet [15:37:31] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1226.eqiad.wmnet [15:37:47] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1233.eqiad.wmnet [15:39:46] (03CR) 10CDanis: [C: 03+1] wikimedia.org: add trace [dns] - 10https://gerrit.wikimedia.org/r/1005041 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [15:40:00] (03CR) 10CDanis: [C: 03+1] Add trace.w.o to CDN [puppet] - 10https://gerrit.wikimedia.org/r/1005043 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [15:41:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1210.eqiad.wmnet [15:41:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1226.eqiad.wmnet [15:41:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1168.eqiad.wmnet [15:42:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1233.eqiad.wmnet [15:43:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P57361 and previous config saved to /var/cache/conftool/dbconfig/20240220-154313-marostegui.json [15:46:00] !log re-enable meta-monitoring on wikitech-static.w.o - T333615 [15:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:07] (03CR) 10Marostegui: [C: 03+2] Remove grants for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005101 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [15:46:11] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [15:46:28] !log When doing the alert hosts upgrade we encountered some issues that prevented us to properly reimage the hosts to proceed with the upgrade. We're investigating this issue and inform of the new alert hosts upgrade date ASAP. - T333615 [15:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P57362 and previous config saved to /var/cache/conftool/dbconfig/20240220-154917-arnaudb.json [15:49:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57363 and previous config saved to /var/cache/conftool/dbconfig/20240220-154920-arnaudb.json [15:49:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57364 and previous config saved to /var/cache/conftool/dbconfig/20240220-154920-arnaudb.json [15:49:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57365 and previous config saved to /var/cache/conftool/dbconfig/20240220-154924-arnaudb.json [15:49:30] PROBLEM - Host vrts1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:49:42] ^ guessing this not known? [15:49:42] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:50:12] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:50:13] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:50:36] RECOVERY - Host vrts1002 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [15:50:40] (ProbeDown) firing: Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:46] 10SRE, 10Infrastructure-Foundations, 10netops: BGP peering from LSW to K8s hosts using loopback IP not IRB - https://phabricator.wikimedia.org/T357619#9559790 (10cmooney) 05Open→03Resolved Config pushed out across the estate now, multihop config only added on CRs. [15:53:34] PROBLEM - Host vrts1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:45] (03PS2) 10Eevans: c-cqlsh is now deprecated; long live cqlsh-instance [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1004235 [15:53:49] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:53:50] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:54:26] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:54:48] 10ops-codfw, 10serviceops: Issues reimaging servers in codfw - https://phabricator.wikimedia.org/T358001#9559853 (10hnowlan) [15:55:18] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@b115452]: (no justification provided) [15:55:23] !log import ceph-reef packages to apt1001 T279621 [15:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:28] T279621: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 [15:55:40] (ProbeDown) firing: (2) Service vrts1002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts1002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:52] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@b115452]: (no justification provided) (duration: 00m 34s) [15:56:37] (03Abandoned) 10Jgiannelos: changeprop: Disable restbase/parsoid related rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/998992 (https://phabricator.wikimedia.org/T344945) (owner: 10Jgiannelos) [15:58:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P57366 and previous config saved to /var/cache/conftool/dbconfig/20240220-155820-marostegui.json [15:58:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253#9559872 (10ssingh) Hi dc-ops team: quick question: have these hosts already been hardware decommissioned? [15:59:30] jouncebot: nowandnext [15:59:30] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [15:59:30] In 0 hour(s) and 0 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1600) [15:59:39] That feels like an edge case [15:59:39] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [16:00:04] eoghan, jelto, and arnoldokoth: #bothumor I � Unicode. All rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1600). [16:00:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2089*,elastic2062*,elastic2061* for switch maintenance - bking@cumin2002 - T355860 [16:00:08] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2089*,elastic2062*,elastic2061* for switch maintenance - bking@cumin2002 - T355860 [16:00:23] (03PS1) 10Reedy: Revert "Replace wfGetDB() with ICP getReplicaDatabase()" [extensions/AntiSpoof] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005076 (https://phabricator.wikimedia.org/T357995) [16:00:26] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [16:00:57] (03CR) 10Reedy: [C: 03+2] Revert "Replace wfGetDB() with ICP getReplicaDatabase()" [extensions/AntiSpoof] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005076 (https://phabricator.wikimedia.org/T357995) (owner: 10Reedy) [16:00:58] !log running `homer 'cr*codfw*' commit 'T351074'` for new k8s workers [16:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:17] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:02:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [16:02:07] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a7-codfw.mgmt with reason: prepping for server uplink migration codfw rack a7 [16:02:15] (03PS1) 10CDobbins: puppet/modules/admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1005117 [16:02:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a7-codfw.mgmt with reason: prepping for server uplink migration codfw rack a7 [16:02:28] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9559899 (10cmooney) [16:02:44] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867#9559902 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=343ed6db-68dd-4330-8851-9631da7da8d5... [16:02:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253#9559903 (10ssingh) For further context: we have a request from @dr0ptp4kt for running a Blazegraph experiment and we are trying to free up a cp node for him. So we were wond... [16:03:10] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866#9559896 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T355866#9556801, @Marostegui wrote: > @cmooney is there anythin... [16:04:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P57367 and previous config saved to /var/cache/conftool/dbconfig/20240220-160423-arnaudb.json [16:04:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57368 and previous config saved to /var/cache/conftool/dbconfig/20240220-160429-arnaudb.json [16:04:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 50%: maintenance done', diff saved to https://phabricator.wikimedia.org/P57369 and previous config saved to /var/cache/conftool/dbconfig/20240220-160432-arnaudb.json [16:04:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57370 and previous config saved to /var/cache/conftool/dbconfig/20240220-160437-arnaudb.json [16:04:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57371 and previous config saved to /var/cache/conftool/dbconfig/20240220-160438-arnaudb.json [16:05:56] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.eqiad.wmnet [16:05:59] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 22 hosts with reason: Migrating servers in codfw rack A7 to lsw1-a7-codfw [16:06:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 22 hosts with reason: Migrating servers in codfw rack A7 to lsw1-a7-codfw [16:06:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867#9559917 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=47f3a57d-6476-4782-ba82-9c2dc99042c9... [16:06:53] RECOVERY - Host vrts1002 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [16:07:28] jouncebot: nowandnext [16:07:28] For the next 0 hour(s) and 52 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1600) [16:07:28] In 0 hour(s) and 52 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1700) [16:07:32] !log Commencing network maintenance migrating servers to new switch codfw rack A7 T355867 [16:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:49] T355867: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 [16:07:49] PROBLEM - spamassassin on vrts1002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [16:08:49] RECOVERY - spamassassin on vrts1002 is OK: PROCS OK: 3 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [16:09:15] !log hnowlan@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2312.codfw.wmnet|mw2313.codfw.wmnet|mw2367.codfw.wmnet|mw2369.codfw.wmnet) [16:10:40] (ProbeDown) resolved: (2) Service vrts1002:25 has failed probes (tcp_vrts_smtp_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#vrts1002:25 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:29] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [16:11:32] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [16:12:32] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.eqiad.wmnet [16:13:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T355609)', diff saved to https://phabricator.wikimedia.org/P57372 and previous config saved to /var/cache/conftool/dbconfig/20240220-161326-marostegui.json [16:13:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [16:13:32] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:13:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [16:13:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T355609)', diff saved to https://phabricator.wikimedia.org/P57373 and previous config saved to /var/cache/conftool/dbconfig/20240220-161348-marostegui.json [16:14:11] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [16:14:25] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1002.eqiad.wmnet [16:17:05] 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#9560023 (10ssingh) Thanks to @Muehlenhoff, we have imported the forward port of OpenSSL 1.1.1 and have build haproxy 2.6 against it. We will be reimaging a cp host to bookworm. Shar... [16:17:08] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867#9560025 (10cmooney) All links moved successfully and all hosts responding to ping as before. [16:17:23] (03PS4) 10Arturo Borrero Gonzalez: openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) [16:17:52] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez) [16:17:54] (03CR) 10Alexandros Kosiaris: serve https for jaeger-query oauth2-proxy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004239 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [16:18:46] (03CR) 10Brouberol: "Thanks for pushing these patches. I should have done so in the first place." [labs/private] - 10https://gerrit.wikimedia.org/r/1005102 (owner: 10Slyngshede) [16:18:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [16:19:04] 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#9560050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm [16:19:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T357189)', diff saved to https://phabricator.wikimedia.org/P57374 and previous config saved to /var/cache/conftool/dbconfig/20240220-161931-arnaudb.json [16:19:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:19:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 75%: maintenance done', diff saved to https://phabricator.wikimedia.org/P57375 and previous config saved to /var/cache/conftool/dbconfig/20240220-161937-arnaudb.json [16:19:41] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:19:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57376 and previous config saved to /var/cache/conftool/dbconfig/20240220-161942-arnaudb.json [16:19:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57377 and previous config saved to /var/cache/conftool/dbconfig/20240220-161942-arnaudb.json [16:19:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57378 and previous config saved to /var/cache/conftool/dbconfig/20240220-161946-arnaudb.json [16:19:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:19:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T357189)', diff saved to https://phabricator.wikimedia.org/P57379 and previous config saved to /var/cache/conftool/dbconfig/20240220-161953-arnaudb.json [16:19:59] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867#9560061 (10MatthewVernon) ms and thanos swift both OK post-move. [16:20:11] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [16:20:20] (03Merged) 10jenkins-bot: Revert "Replace wfGetDB() with ICP getReplicaDatabase()" [extensions/AntiSpoof] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005076 (https://phabricator.wikimedia.org/T357995) (owner: 10Reedy) [16:20:51] (03PS1) 10Alexandros Kosiaris: jaeger: Add 4180 port to the network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) [16:21:23] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1002.eqiad.wmnet [16:21:56] (03CR) 10CDanis: [C: 03+1] jaeger: Add 4180 port to the network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) (owner: 10Alexandros Kosiaris) [16:22:15] (03CR) 10Cathal Mooney: [C: 03+2] Enable BGP session status change logs on l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1004747 (owner: 10Cathal Mooney) [16:23:09] (03CR) 10Majavah: "This seems to already work without this change:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) (owner: 10Alexandros Kosiaris) [16:24:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T357189)', diff saved to https://phabricator.wikimedia.org/P57380 and previous config saved to /var/cache/conftool/dbconfig/20240220-162408-arnaudb.json [16:24:14] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1001.eqiad.wmnet [16:27:42] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [16:27:51] 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#9560148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm executed with errors: - cp4052 (**FAIL**)... [16:27:57] (03PS5) 10Arturo Borrero Gonzalez: openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) [16:29:10] (03PS6) 10Arturo Borrero Gonzalez: openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) [16:29:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [16:30:25] 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#9560162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm [16:30:54] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1001.eqiad.wmnet [16:32:43] (03Merged) 10jenkins-bot: Enable BGP session status change logs on l3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/1004747 (owner: 10Cathal Mooney) [16:33:24] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez) [16:34:23] 10SRE, 10Wikimedia-Mailing-lists: Set up mailing list for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9560067 (10JJMC89) [16:34:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 100%: maintenance done', diff saved to https://phabricator.wikimedia.org/P57381 and previous config saved to /var/cache/conftool/dbconfig/20240220-163442-arnaudb.json [16:34:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57382 and previous config saved to /var/cache/conftool/dbconfig/20240220-163447-arnaudb.json [16:34:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57383 and previous config saved to /var/cache/conftool/dbconfig/20240220-163447-arnaudb.json [16:34:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57384 and previous config saved to /var/cache/conftool/dbconfig/20240220-163451-arnaudb.json [16:35:21] !log reedy@deploy2002 Synchronized php-1.42.0-wmf.19/extensions/AntiSpoof/: T357995 (duration: 11m 02s) [16:35:55] T357995: `Database error` error page when creating account on the beta cluster - https://phabricator.wikimedia.org/T357995 [16:39:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P57385 and previous config saved to /var/cache/conftool/dbconfig/20240220-163915-arnaudb.json [16:41:13] (03CR) 10Filippo Giunchedi: [C: 03+1] jaeger: Add 4180 port to the network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) (owner: 10Alexandros Kosiaris) [16:41:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T355609)', diff saved to https://phabricator.wikimedia.org/P57386 and previous config saved to /var/cache/conftool/dbconfig/20240220-164134-marostegui.json [16:41:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:42:20] (03PS7) 10Arturo Borrero Gonzalez: openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) [16:42:30] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for cp[2029-2030].codfw.wmnet [16:42:32] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp[2029-2030].codfw.wmnet [16:43:23] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,name=cp20(29|30).codfw.wmnet [16:43:32] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez) [16:43:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2005.codfw.wmnet with OS bookworm [16:44:28] (03PS2) 10Hnowlan: changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) [16:45:26] (03CR) 10CI reject: [V: 04-1] changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:48:34] (03PS3) 10Hnowlan: changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) [16:48:58] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:54:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P57387 and previous config saved to /var/cache/conftool/dbconfig/20240220-165421-arnaudb.json [16:54:45] (03PS1) 10Hnowlan: mw-jobrunner: reduce replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005121 (https://phabricator.wikimedia.org/T349796) [16:55:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:55:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [16:55:30] (03PS2) 10CDobbins: puppet/modules/admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1005117 [16:55:32] (03PS1) 10CDobbins: admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005122 [16:56:10] (03Abandoned) 10CDobbins: puppet/modules/admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1005117 (owner: 10CDobbins) [16:56:30] 10SRE, 10Wikimedia-Mailing-lists: Not receiving posts or moderation messages - https://phabricator.wikimedia.org/T358020#9560351 (10JJMC89) [16:56:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P57388 and previous config saved to /var/cache/conftool/dbconfig/20240220-165641-marostegui.json [16:57:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [16:59:06] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:30] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9560354 (10akosiaris) [17:00:05] jhathaway and rzl: Your horoscope predicts another Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1700). [17:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:36] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 55 probes of 802 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:02:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 181 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:05:32] (03CR) 10CDobbins: [C: 03+2] admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005122 (owner: 10CDobbins) [17:07:10] (03Abandoned) 10C. Scott Ananian: [ParserOutput] allow rollback of render id [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1003833 (https://phabricator.wikimedia.org/T356368) (owner: 10C. Scott Ananian) [17:07:25] (03CR) 10CDobbins: admin: update data.yaml for cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1005122 (owner: 10CDobbins) [17:08:15] tgr|away: hello! sorry to be late, taking a look [17:09:00] oh, it looks like this is merged already, thanks cwhite [17:09:10] tgr|away: do you still need anything in the puppet window? [17:09:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T357189)', diff saved to https://phabricator.wikimedia.org/P57389 and previous config saved to /var/cache/conftool/dbconfig/20240220-170928-arnaudb.json [17:09:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:09:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:09:45] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:09:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T357189)', diff saved to https://phabricator.wikimedia.org/P57390 and previous config saved to /var/cache/conftool/dbconfig/20240220-170949-arnaudb.json [17:10:06] (03CR) 10Ssingh: [C: 03+1] "Looks good! We will merge this tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1005122 (owner: 10CDobbins) [17:11:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P57391 and previous config saved to /var/cache/conftool/dbconfig/20240220-171147-marostegui.json [17:13:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T357189)', diff saved to https://phabricator.wikimedia.org/P57392 and previous config saved to /var/cache/conftool/dbconfig/20240220-171358-arnaudb.json [17:16:07] rzl: sorry should have asked, but since it was already merged I assumed it was taken care of [17:17:06] rzl: sorry I am late! [17:17:16] jhathaway: nah all good, you just figured that out faster than me :P [17:17:22] TBH I have no idea if it needs some sort of deploy step [17:17:41] nod, me neither [17:18:03] let me check some Logstash logs [17:18:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bookworm [17:18:26] 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744#9560423 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm completed: - cp4052 (**PASS**) - Removed... [17:18:34] it needs a cumin'ed `systemctl restart logstash` but my guess is cwhite will have done that right after merging? [17:20:01] Seems to be in use already. Sorry for the noise! [17:20:12] rad, no worries [17:20:29] puppet window complete, then :) [17:23:27] rzl: puppet manages the restart of logstash. should all be good :) [17:24:23] cwhite: oh cool! I was looking at https://wikitech.wikimedia.org/wiki/Logstash#Configuration_changes, is that out of date? [17:24:45] good question, I'll have a look [17:24:56] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 84.42 ms [17:26:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T355609)', diff saved to https://phabricator.wikimedia.org/P57393 and previous config saved to /var/cache/conftool/dbconfig/20240220-172653-marostegui.json [17:26:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [17:27:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance [17:27:11] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:27:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T355609)', diff saved to https://phabricator.wikimedia.org/P57394 and previous config saved to /var/cache/conftool/dbconfig/20240220-172716-marostegui.json [17:29:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P57395 and previous config saved to /var/cache/conftool/dbconfig/20240220-172904-arnaudb.json [17:30:36] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 35 probes of 802 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:32:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 39 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:35:10] (03CR) 10Andrew Bogott: [C: 03+1] "This looks good. The pcc failures are for nodes that don't exist anymore." [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez) [17:40:53] (03CR) 10CDanis: [C: 03+1] "I think (but am not 100% sure) that Alex applied this in production to check and then sent the patch for review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) (owner: 10Alexandros Kosiaris) [17:44:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P57396 and previous config saved to /var/cache/conftool/dbconfig/20240220-174411-arnaudb.json [17:46:35] (03CR) 10Ssingh: [C: 03+2] cp4052: install haproxy 2.6 from component/haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1004128 (https://phabricator.wikimedia.org/T352744) (owner: 10Ssingh) [17:47:20] (03PS2) 10Jdlrobson: Enable desktop diff for anonymous users as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997585 (https://phabricator.wikimedia.org/T350181) [17:56:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T355609)', diff saved to https://phabricator.wikimedia.org/P57397 and previous config saved to /var/cache/conftool/dbconfig/20240220-175605-marostegui.json [17:56:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:57:44] (03PS2) 10JHathaway: etcd: disable the diff output for client config with passwords [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) [17:59:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T357189)', diff saved to https://phabricator.wikimedia.org/P57398 and previous config saved to /var/cache/conftool/dbconfig/20240220-175917-arnaudb.json [17:59:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [17:59:23] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:59:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [17:59:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T357189)', diff saved to https://phabricator.wikimedia.org/P57399 and previous config saved to /var/cache/conftool/dbconfig/20240220-175938-arnaudb.json [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1800) [18:03:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T357189)', diff saved to https://phabricator.wikimedia.org/P57400 and previous config saved to /var/cache/conftool/dbconfig/20240220-180342-arnaudb.json [18:11:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P57401 and previous config saved to /var/cache/conftool/dbconfig/20240220-181111-marostegui.json [18:18:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P57402 and previous config saved to /var/cache/conftool/dbconfig/20240220-181850-arnaudb.json [18:22:33] !log reprepro -C component/haproxy26 include bookworm-wikimedia haproxy_2.6.16-1~bpo12+1_amd64.changes: T352744 [18:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:49] T352744: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 [18:26:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P57403 and previous config saved to /var/cache/conftool/dbconfig/20240220-182617-marostegui.json [18:31:10] !log pool cp4052: bookworm cp host with haproxy 2.6 built against OpenSSL 1.1.1: T352744 [18:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:26] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=(cdn|ats-be) [18:31:33] T352744: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 [18:33:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P57404 and previous config saved to /var/cache/conftool/dbconfig/20240220-183356-arnaudb.json [18:41:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T355609)', diff saved to https://phabricator.wikimedia.org/P57405 and previous config saved to /var/cache/conftool/dbconfig/20240220-184124-marostegui.json [18:41:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1228.eqiad.wmnet with reason: Maintenance [18:41:42] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:41:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1228.eqiad.wmnet with reason: Maintenance [18:41:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1228 (T355609)', diff saved to https://phabricator.wikimedia.org/P57406 and previous config saved to /var/cache/conftool/dbconfig/20240220-184157-marostegui.json [18:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:49:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T357189)', diff saved to https://phabricator.wikimedia.org/P57407 and previous config saved to /var/cache/conftool/dbconfig/20240220-184903-arnaudb.json [18:49:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [18:49:10] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:49:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [18:49:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T357189)', diff saved to https://phabricator.wikimedia.org/P57408 and previous config saved to /var/cache/conftool/dbconfig/20240220-184925-arnaudb.json [18:53:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T357189)', diff saved to https://phabricator.wikimedia.org/P57409 and previous config saved to /var/cache/conftool/dbconfig/20240220-185322-arnaudb.json [18:56:31] (03PS1) 10Kamila Součková: shellbox: add PHP-FPM process_control_timeout setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) [19:00:04] jeena and brennen: MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T1900). Please do the needful. [19:00:13] (03PS2) 10Kamila Součková: shellbox: add PHP-FPM process_control_timeout setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) [19:00:49] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [19:01:34] (03PS1) 10Ssingh: P:cache::base: add script to check versions of varnish and varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/1005140 [19:01:58] o/ (but forgot i was backup today, and sitting at a coffeeshop sorting e-mail, so it may be a few if i need to roll it out) [19:02:17] brennen: jeena is on it [19:02:22] cool cool [19:02:28] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005141 (https://phabricator.wikimedia.org/T354437) [19:02:30] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005141 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [19:03:11] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005141 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [19:03:13] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1412/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [19:03:35] and you didn't forget, I'm flailing at train schedules, sorry for the surprise, sync'd with jeena to be sure she could do it this week but missed you. [19:06:48] oh no worries. [19:07:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T355609)', diff saved to https://phabricator.wikimedia.org/P57410 and previous config saved to /var/cache/conftool/dbconfig/20240220-190722-marostegui.json [19:07:40] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:08:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P57411 and previous config saved to /var/cache/conftool/dbconfig/20240220-190829-arnaudb.json [19:11:35] (03PS2) 10Ssingh: P:cache::base: add script to check versions of varnish and varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/1005140 [19:12:48] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.19 refs T354437 [19:13:04] T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437 [19:21:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:21:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:22:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P57412 and previous config saved to /var/cache/conftool/dbconfig/20240220-192229-marostegui.json [19:23:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.222 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:23:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P57413 and previous config saved to /var/cache/conftool/dbconfig/20240220-192335-arnaudb.json [19:26:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [19:30:46] (03PS2) 10CDanis: jaeger: Add 4180 port to the network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) (owner: 10Alexandros Kosiaris) [19:31:54] (03CR) 10CDanis: [C: 03+2] jaeger: Add 4180 port to the network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) (owner: 10Alexandros Kosiaris) [19:32:46] (03Merged) 10jenkins-bot: jaeger: Add 4180 port to the network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005119 (https://phabricator.wikimedia.org/T320555) (owner: 10Alexandros Kosiaris) [19:35:56] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [19:36:02] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:37:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P57414 and previous config saved to /var/cache/conftool/dbconfig/20240220-193735-marostegui.json [19:38:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T357189)', diff saved to https://phabricator.wikimedia.org/P57415 and previous config saved to /var/cache/conftool/dbconfig/20240220-193842-arnaudb.json [19:38:48] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:39:26] (03PS1) 10Ryan Kemper: wdqs: whitelist iconclass endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1005147 (https://phabricator.wikimedia.org/T357533) [19:40:10] (03CR) 10Bking: [C: 03+1] wdqs: whitelist iconclass endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1005147 (https://phabricator.wikimedia.org/T357533) (owner: 10Ryan Kemper) [19:41:06] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: whitelist iconclass endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1005147 (https://phabricator.wikimedia.org/T357533) (owner: 10Ryan Kemper) [19:43:49] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [19:44:23] (03CR) 10CDanis: [C: 03+2] Add trace.w.o to CDN [puppet] - 10https://gerrit.wikimedia.org/r/1005043 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [19:48:32] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:48:34] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [19:48:38] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [19:52:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T355609)', diff saved to https://phabricator.wikimedia.org/P57416 and previous config saved to /var/cache/conftool/dbconfig/20240220-195242-marostegui.json [19:52:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [19:52:48] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:52:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [19:53:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T355609)', diff saved to https://phabricator.wikimedia.org/P57417 and previous config saved to /var/cache/conftool/dbconfig/20240220-195303-marostegui.json [19:54:34] (03PS1) 10Ryan Kemper: cloudelastic: decom cloudelastic100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1005151 (https://phabricator.wikimedia.org/T357780) [19:56:09] (03CR) 10Bking: [C: 03+1] cloudelastic: decom cloudelastic100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1005151 (https://phabricator.wikimedia.org/T357780) (owner: 10Ryan Kemper) [19:57:12] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: decom cloudelastic100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/1005151 (https://phabricator.wikimedia.org/T357780) (owner: 10Ryan Kemper) [20:01:33] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic[1001-1004].wikimedia.org [20:08:52] (03PS3) 10Jdlrobson: Enable night mode on mobile test servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003873 (https://phabricator.wikimedia.org/T357759) [20:09:13] (03PS1) 10Ebernhardson: cirrus: Add conditional -backfill releases (take two) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005155 [20:10:24] (03PS3) 10Jdlrobson: Enable desktop diff for anonymous users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997585 (https://phabricator.wikimedia.org/T350181) [20:11:29] (03PS2) 10Ebernhardson: cirrus: Add conditional -backfill releases (take two) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005155 [20:12:09] 10SRE-Access-Requests, 10LDAP-Access-Requests: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9561229 (10taavi) Hey. You should start by creating the new developer account via https://idm.wikimedia.org. Also Wikitech usernames come directly from developer account names (the... [20:13:56] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Add conditional -backfill releases (take two) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005155 (owner: 10Ebernhardson) [20:15:01] (03Merged) 10jenkins-bot: cirrus: Add conditional -backfill releases (take two) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005155 (owner: 10Ebernhardson) [20:16:25] 10SRE-Access-Requests, 10LDAP-Access-Requests: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9561232 (10brion) Thanks, in process of creating dev account bvibber ... [20:18:28] (03PS1) 10CDanis: Ask Ingress to serve trace.wikimedia.org altname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005156 (https://phabricator.wikimedia.org/T320555) [20:18:44] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:19:26] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:19:28] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch http://localhost:9600/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9600): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [20:20:08] someone working on these? [20:20:14] oh ryan ok [20:21:14] (03CR) 10RLazarus: [C: 03+1] Ask Ingress to serve trace.wikimedia.org altname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005156 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [20:23:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T355609)', diff saved to https://phabricator.wikimedia.org/P57419 and previous config saved to /var/cache/conftool/dbconfig/20240220-202300-marostegui.json [20:23:13] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:24:17] 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9561242 (10RKemper) [20:25:09] 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9561242 (10RKemper) [20:25:58] sukhe: oops, forgot to downtime [20:27:14] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [20:27:51] (03PS2) 10CDanis: Ask Ingress to serve trace.wikimedia.org altname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005156 (https://phabricator.wikimedia.org/T320555) [20:28:35] (03PS1) 10Dbrant: Move account vanishing contact form to Meta wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005161 (https://phabricator.wikimedia.org/T343536) [20:30:54] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic[1001-1004].wikimedia.org decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [20:31:01] ryankemper: all good, thanks! [20:31:04] (03CR) 10RLazarus: [C: 03+1] Ask Ingress to serve trace.wikimedia.org altname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005156 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [20:31:44] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:31:46] (03CR) 10CDanis: [C: 03+2] Ask Ingress to serve trace.wikimedia.org altname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005156 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [20:31:49] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:32:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic[1001-1004].wikimedia.org decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [20:32:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:32:03] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudelastic[1001-1004].wikimedia.org [20:33:26] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9561307 (10RKemper) [20:34:31] (03Merged) 10jenkins-bot: Ask Ingress to serve trace.wikimedia.org altname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005156 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [20:35:30] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [20:35:41] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:38:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P57420 and previous config saved to /var/cache/conftool/dbconfig/20240220-203806-marostegui.json [20:41:20] (03PS2) 10CDanis: wikimedia.org: add trace [dns] - 10https://gerrit.wikimedia.org/r/1005041 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [20:41:29] (03CR) 10CDanis: [C: 03+2] wikimedia.org: add trace [dns] - 10https://gerrit.wikimedia.org/r/1005041 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [20:46:48] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9561338 (10bvibber) bvibber LDAP account is, and bvibber phabricator account is now connected to it. I'm not sure if I have any special permissions that need migrating over... [20:52:17] (03PS1) 10CDanis: jaeger: make oidc client_id match CAS config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005163 (https://phabricator.wikimedia.org/T320555) [20:53:04] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9561362 (10bvibber) Can't log into gerrit with bvibber, it says "Authentication failed." [20:53:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P57421 and previous config saved to /var/cache/conftool/dbconfig/20240220-205312-marostegui.json [20:54:05] (03CR) 10CDanis: [C: 03+2] jaeger: make oidc client_id match CAS config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005163 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [20:55:05] (03Merged) 10jenkins-bot: jaeger: make oidc client_id match CAS config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005163 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [20:55:29] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [20:55:36] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [20:55:52] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:56:05] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:56:32] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:56:44] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240220T2100). [21:00:05] cscott, bvibber, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:32] i can deploy [21:00:50] yo [21:01:09] cscott: are you around? [21:01:16] if not, bvibber i can start with yours [21:01:20] whee [21:01:22] lol [21:02:04] bvibber: looks like yours is already merged -- do you need a backport? [21:03:08] o/ [21:03:25] yeah backport to run on current commons [21:03:30] which iirc is wmf.19? lemme check [21:03:41] cscott is around :) [21:03:42] i'm here! [21:03:49] sorry libera.chat punted me [21:03:53] 18 and 19 should cover me [21:04:08] great! then bvibber, while you prep your patches, i'll go ahead with cscott's [21:04:11] woo [21:04:20] o/ [21:04:25] (03PS3) 10Clare Ming: Correctly turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551 (owner: 10C. Scott Ananian) [21:04:29] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:04:34] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:04:41] (03PS1) 10Brion VIBBER: Fix for regression in audio track suppression logic [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005081 (https://phabricator.wikimedia.org/T357942) [21:04:49] bvibber: can you add them to the cal when they're ready? [21:04:57] (03PS2) 10Brion VIBBER: Fix for regression in audio track suppression logic [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005081 (https://phabricator.wikimedia.org/T357942) [21:05:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551 (owner: 10C. Scott Ananian) [21:05:45] cjming: btw, wikitech doesn't have a canary, so there won't be much i can test at the canary stage except that it doesn't break the main wikis. [21:06:17] (03PS1) 10Brion VIBBER: Fix for regression in audio track suppression logic [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1005082 (https://phabricator.wikimedia.org/T357942) [21:06:21] cscott: yup [21:06:30] (03Merged) 10jenkins-bot: Correctly turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551 (owner: 10C. Scott Ananian) [21:06:36] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9561435 (10taavi) >>! In T358044#9561362, @bvibber wrote: > Can't log into gerrit with bvibber, it says "Authentication failed." Gerrit (and a few other services) thinks... [21:07:06] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1003551|Correctly turn on Parsoid read views by default on wikitech Talk pages]] [21:07:44] anybody want to +2 my backports? :D https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/1005081 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/1005082 [21:08:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T355609)', diff saved to https://phabricator.wikimedia.org/P57422 and previous config saved to /var/cache/conftool/dbconfig/20240220-210819-marostegui.json [21:08:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [21:08:25] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:08:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [21:08:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T355609)', diff saved to https://phabricator.wikimedia.org/P57423 and previous config saved to /var/cache/conftool/dbconfig/20240220-210840-marostegui.json [21:08:45] !log cjming@deploy2002 cscott and cjming: Backport for [[gerrit:1003551|Correctly turn on Parsoid read views by default on wikitech Talk pages]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:16] cscott: on test servers [21:09:33] (03CR) 10Clare Ming: [C: 03+2] Fix for regression in audio track suppression logic [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1005082 (https://phabricator.wikimedia.org/T357942) (owner: 10Brion VIBBER) [21:09:41] (03CR) 10Clare Ming: [C: 03+2] Fix for regression in audio track suppression logic [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005081 (https://phabricator.wikimedia.org/T357942) (owner: 10Brion VIBBER) [21:10:18] cjming: ok, i'll just briefly check that the config on enwiki etc hasn't changed using the canaries [21:10:34] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9561461 (10bvibber) Gerrit won't let me remove the LDAP-given email: Error 409 (Conflict): Cannot remove e-mail 'brion@wikimedia.org' which is directly associated with LDAP... [21:11:28] cjming: ok confirmed that i didn't break enwiki at least, ok to proceed [21:11:37] :) [21:11:46] cscott: great - syncing [21:11:50] !log cjming@deploy2002 cscott and cjming: Continuing with sync [21:12:54] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9561466 (10bvibber) Ok confirmed after dancing around the email settings I was able to remove the conflicting one and I now have a working "bvibber" gerrit. :D [21:15:06] (03PS4) 10Clare Ming: Enable desktop diff for anonymous users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997585 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [21:19:59] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1003551|Correctly turn on Parsoid read views by default on wikitech Talk pages]] (duration: 12m 53s) [21:20:07] cscott: should be live! [21:20:22] cjming ok testing [21:20:24] bvibber: while we wait for CI to finish on yours, I'll move onto Jon's patches [21:20:39] ok [21:20:47] (03CR) 10BCornwall: "Thanks for your feedback. That's a good idea about adding the comment. More generally.... Is this an issue for all SLO metrics? A quick lo" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [21:21:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997585 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [21:21:28] cjming: looks good, thanks! [21:21:37] cscott, looks like it is working ... woo hoo! very first parsoid-based default rendering in prod! :) [21:21:37] cscott: glad to hear! [21:22:00] (03Merged) 10jenkins-bot: Enable desktop diff for anonymous users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997585 (https://phabricator.wikimedia.org/T350181) (owner: 10Jdlrobson) [21:22:23] !log cjming@deploy2002 Started scap: Backport for [[gerrit:997585|Enable desktop diff for anonymous users on enwiki (T350181)]] [21:22:31] T350181: Enable desktop diff page on mobile site - https://phabricator.wikimedia.org/T350181 [21:23:51] (03CR) 10RLazarus: "I don't know of any other graphs with the same problem. It may be an artifact of collecting these SLO timeseries with mtail, which no othe" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [21:23:52] !log cjming@deploy2002 jdlrobson and cjming: Backport for [[gerrit:997585|Enable desktop diff for anonymous users on enwiki (T350181)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:23:55] Jdlrobson: 1st patch up if you want to test [21:24:24] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:24:30] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:25:44] cjming: on it thanks [21:27:38] cjming: LGTM [21:27:47] cool - syncing [21:27:51] !log cjming@deploy2002 jdlrobson and cjming: Continuing with sync [21:28:37] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:28:43] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:29:07] (03PS2) 10BCornwall: slo_definitions: Switch to using haproxy_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) [21:29:39] (03CR) 10BCornwall: "I've added a note in the dashboard's description." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [21:29:44] (03Merged) 10jenkins-bot: Fix for regression in audio track suppression logic [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1005082 (https://phabricator.wikimedia.org/T357942) (owner: 10Brion VIBBER) [21:29:51] (03Merged) 10jenkins-bot: Fix for regression in audio track suppression logic [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005081 (https://phabricator.wikimedia.org/T357942) (owner: 10Brion VIBBER) [21:30:04] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [21:31:16] (03PS4) 10Clare Ming: Enable night mode on mobile test servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003873 (https://phabricator.wikimedia.org/T357759) (owner: 10Jdlrobson) [21:35:42] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997585|Enable desktop diff for anonymous users on enwiki (T350181)]] (duration: 13m 19s) [21:35:49] T350181: Enable desktop diff page on mobile site - https://phabricator.wikimedia.org/T350181 [21:35:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003873 (https://phabricator.wikimedia.org/T357759) (owner: 10Jdlrobson) [21:36:03] (03PS1) 10CDanis: jaeger: oauth proxy don't verify upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005167 (https://phabricator.wikimedia.org/T320555) [21:36:34] (03Merged) 10jenkins-bot: Enable night mode on mobile test servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003873 (https://phabricator.wikimedia.org/T357759) (owner: 10Jdlrobson) [21:37:39] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1003873|Enable night mode on mobile test servers (T357759)]] [21:37:45] T357759: Deploy night mode on the minerva skin on test wiki - https://phabricator.wikimedia.org/T357759 [21:39:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T355609)', diff saved to https://phabricator.wikimedia.org/P57424 and previous config saved to /var/cache/conftool/dbconfig/20240220-213904-marostegui.json [21:39:10] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:39:11] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1003873|Enable night mode on mobile test servers (T357759)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:39:14] Jdlrobson: 1st patch live, 2nd patch on test servers [21:39:43] cjming: looking! [21:39:54] bvibber: yours might be too (on test servers that is) since they were +2'd separately? [21:40:18] (03CR) 10CDanis: [C: 03+2] jaeger: oauth proxy don't verify upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005167 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [21:40:20] cjming: they affect job queue runners only so won't show up on the debug server :) just gotta push 'n' pray [21:40:21] cjming: LGTM thanks! please sync! [21:40:33] bvibber: gotcha [21:40:37] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [21:41:10] (03Merged) 10jenkins-bot: jaeger: oauth proxy don't verify upstream [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005167 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [21:42:27] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:42:40] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:46:11] (03PS1) 10CDanis: actually get the triple negative correct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005173 (https://phabricator.wikimedia.org/T320555) [21:46:26] (03CR) 10CDanis: [C: 03+2] actually get the triple negative correct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005173 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [21:46:48] (03CR) 10BBlack: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1004205 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [21:47:19] (03Merged) 10jenkins-bot: actually get the triple negative correct [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005173 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [21:47:40] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:47:43] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:47:50] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:47:55] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:48:06] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:48:17] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:48:40] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1003873|Enable night mode on mobile test servers (T357759)]] (duration: 11m 01s) [21:48:45] T357759: Deploy night mode on the minerva skin on test wiki - https://phabricator.wikimedia.org/T357759 [21:49:19] Jdlrobson: 2nd patch should be live! [21:49:26] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1005081|Fix for regression in audio track suppression logic (T357942)]], [[gerrit:1005082|Fix for regression in audio track suppression logic (T357942)]] [21:49:31] T357942: Regression in iOS video playlist output (logic error in refactor) - https://phabricator.wikimedia.org/T357942 [21:50:55] !log cjming@deploy2002 brion and cjming: Backport for [[gerrit:1005081|Fix for regression in audio track suppression logic (T357942)]], [[gerrit:1005082|Fix for regression in audio track suppression logic (T357942)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:50:57] !log cjming@deploy2002 brion and cjming: Continuing with sync [21:54:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P57426 and previous config saved to /var/cache/conftool/dbconfig/20240220-215410-marostegui.json [21:54:41] (03PS1) 10CDanis: jaeger: oauth proxy fix skip_verify config name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005177 (https://phabricator.wikimedia.org/T320555) [21:54:50] (03CR) 10CDanis: [C: 03+2] jaeger: oauth proxy fix skip_verify config name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005177 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [21:55:28] (03PS4) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) [21:55:42] (03Merged) 10jenkins-bot: jaeger: oauth proxy fix skip_verify config name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005177 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [21:56:15] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:56:27] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:56:58] (03CR) 10Subramanya Sastry: [C: 03+1] Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian) [21:58:48] thanks for your help today cjming ! [21:58:50] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1005081|Fix for regression in audio track suppression logic (T357942)]], [[gerrit:1005082|Fix for regression in audio track suppression logic (T357942)]] (duration: 09m 24s) [21:58:55] bvibber: your backports should be live! [21:58:58] T357942: Regression in iOS video playlist output (logic error in refactor) - https://phabricator.wikimedia.org/T357942 [21:59:14] Jdlrobson: happy to help :) [21:59:36] woohoo [22:00:10] cjming: and confirmed fixed :D thx! [22:00:19] bvibber: yay! [22:00:30] !log end of UTC late backport window [22:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:02] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236#9561647 (10Scott_French) FYI, I've added an outdated block to the U2F-based enrollment procedure in https://wikitech.wikimedia.org/wiki/CAS-SSO (as it no longer works). Just menti... [22:09:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P57427 and previous config saved to /var/cache/conftool/dbconfig/20240220-220917-marostegui.json [22:18:34] !log Starting refinery deployment [22:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:51] !log sfaci@deploy2002 Started deploy [analytics/refinery@d078656]: Regular analytics weekly train [analytics/refinery@d0786561] [22:24:06] (03PS4) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) [22:24:14] (03CR) 10BCornwall: "Thanks for your input. That makes sense. Since I've appended to the description, I'll mark this as Done." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:24:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T355609)', diff saved to https://phabricator.wikimedia.org/P57428 and previous config saved to /var/cache/conftool/dbconfig/20240220-222423-marostegui.json [22:24:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance [22:24:31] (03CR) 10BCornwall: "Done" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [22:24:33] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [22:24:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance [22:24:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T355609)', diff saved to https://phabricator.wikimedia.org/P57429 and previous config saved to /var/cache/conftool/dbconfig/20240220-222445-marostegui.json [22:34:10] !log sfaci@deploy2002 Finished deploy [analytics/refinery@d078656]: Regular analytics weekly train [analytics/refinery@d0786561] (duration: 13m 19s) [22:35:21] !log sfaci@deploy2002 Started deploy [analytics/refinery@d078656]: Regular analytics weekly train [analytics/refinery@d0786561] [22:35:42] !log sfaci@deploy2002 Finished deploy [analytics/refinery@d078656]: Regular analytics weekly train [analytics/refinery@d0786561] (duration: 00m 21s) [22:35:56] !log sfaci@deploy2002 Started deploy [analytics/refinery@d078656] (thin): Regular analytics weekly train THIN [analytics/refinery@d0786561] [22:36:02] !log sfaci@deploy2002 Finished deploy [analytics/refinery@d078656] (thin): Regular analytics weekly train THIN [analytics/refinery@d0786561] (duration: 00m 05s) [22:36:19] !log sfaci@deploy2002 Started deploy [analytics/refinery@d078656] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d0786561] [22:39:48] !log sfaci@deploy2002 Finished deploy [analytics/refinery@d078656] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d0786561] (duration: 03m 29s) [22:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:52:44] !log Deployed refinery using scap, then deployed onto hdfs [22:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T355609)', diff saved to https://phabricator.wikimedia.org/P57430 and previous config saved to /var/cache/conftool/dbconfig/20240220-225311-marostegui.json [22:53:17] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [23:00:49] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [23:08:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P57431 and previous config saved to /var/cache/conftool/dbconfig/20240220-230817-marostegui.json [23:23:19] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:23:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:23:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P57432 and previous config saved to /var/cache/conftool/dbconfig/20240220-232326-marostegui.json [23:24:16] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:24:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:24:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:25:08] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:38:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T355609)', diff saved to https://phabricator.wikimedia.org/P57433 and previous config saved to /var/cache/conftool/dbconfig/20240220-233832-marostegui.json [23:38:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [23:38:38] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [23:38:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [23:58:33] 10SRE, 10Wikimedia-Mailing-lists: Set up mailing list for zh.wikipedia - https://phabricator.wikimedia.org/T358011#9561878 (10Ladsgroup) a:03Ladsgroup Waiting to check some stuff. [23:59:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance