[00:08:25] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:59] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:17:17] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:17:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:39] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:34:01] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:35:29] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:40:41] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:51:03] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:53:19] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:00:25] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:09:21] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:18:27] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:20:45] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:21:33] PROBLEM - SSH on ms-be1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:29] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:45:45] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:07] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:57:08] (03PS1) 10AndyRussG: CentralNotice: Set ESI test string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843590 (https://phabricator.wikimedia.org/T308799) [01:59:25] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T0200) [02:03:13] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.6 [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843534 (https://phabricator.wikimedia.org/T320511) [02:07:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.6 [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843534 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [02:10:31] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:10:47] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:12:49] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:17:37] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:22:17] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.40.0-wmf.6 [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843534 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [02:22:41] RECOVERY - SSH on ms-be1040.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:24:27] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:26:45] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:32:56] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) @BBlack @Vgutierrez hiii! The CentralNotice change will go out with the main cluster deploy train this week! The related config change to set the requested st... [02:33:35] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:35:51] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:42:41] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:44:59] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:54:57] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 57, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T0300) [03:03:58] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10RLazarus) Draft: https://wikitech.wikimedia.org/wiki/Incidents/2022-10-15_s6_master_failure [03:15:25] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:17:22] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @Volans - thanks for reaching out with the suggestion. We definitely could start doing that going forward. In this additional tab on the Accounting Spreadsh... [03:21:23] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:28:01] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:30:02] 10SRE, 10Observability-Metrics, 10serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10lmata) [03:30:17] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:32:18] 10SRE, 10Maps, 10Observability-Metrics, 10observability, and 2 others: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10lmata) [03:46:25] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:51:21] (03PS1) 10Sohom Datta: Enable source links on Translation ns on enwikisource and thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) [03:52:43] (03PS2) 10Sohom Datta: Enable source links on Translation ns on enwikisource and thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) [03:55:19] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:57:35] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:58:55] 10SRE, 10Patch-For-Review: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10Legoktm) @Urbanecm please clarify what part of this change do you want SRE to coordinate :) [04:00:25] 10SRE, 10Patch-For-Review: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10Legoktm) [04:02:19] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:29:27] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:36:19] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:36:27] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:43:04] 10SRE, 10Patch-For-Review: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10PleaseStand) >>! In T320929#8323596, @Legoktm wrote: > @Urbanecm please clarify what part of this change do you want SRE to coordinate :) For context, here are th... [04:52:21] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:54:37] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:05:59] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:21:55] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:28:43] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:28:43] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:30:59] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:31:01] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:46:55] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:46:57] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:49:09] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:51:29] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:56:01] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:00:04] kormat, marostegui, and Amir1: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T0600) [06:07:21] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:18:37] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:19:01] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:20:51] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:26:11] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 55, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:26:23] (03CR) 10Sohom Datta: [C: 04-1] "Need to wait for the roll-out of 1.40-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) (owner: 10Sohom Datta) [06:36:31] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:37:25] Hi, CI failure of T321021 is resolved, anyone knows how to fix the branching of 1.40.0-wmf.6? [06:37:26] T321021: MediaWiki core CI failing: OutputPageTest::testAddBodyClasses - https://phabricator.wikimedia.org/T321021 [06:43:31] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:45:35] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:45:47] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:57:52] Func: looking\ [06:59:07] (03PS1) 10Majavah: OutputPageTest: Adjust testAddBodyClasses [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843562 (https://phabricator.wikimedia.org/T321021) [06:59:27] (03CR) 10Majavah: [C: 03+2] OutputPageTest: Adjust testAddBodyClasses [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843562 (https://phabricator.wikimedia.org/T321021) (owner: 10Majavah) [06:59:43] (03PS2) 10Majavah: Branch commit for wmf/1.40.0-wmf.6 [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843534 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [06:59:47] (03CR) 10Majavah: [C: 03+2] Branch commit for wmf/1.40.0-wmf.6 [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843534 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [06:59:50] that should do it] [06:59:59] sigh. /me can't spell today [07:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T0700). nyaa~ [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:09:26] (03PS1) 10Func: Follow-up 76d1135: Use better practice in the code [skins/Vector] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843886 (https://phabricator.wikimedia.org/T319447) [07:13:21] taavi: thank you, I just got the notification. [07:13:41] I noticed the above patch didn't catch the train. I think it should be backported since we renamed a newly added preference, and don't want the wrong name to go into prod. [07:14:43] Amir1 and Urbanecm: I hope it's not too late to apply for a backport? [07:14:44] (03CR) 10Majavah: [C: 03+2] Follow-up 76d1135: Use better practice in the code [skins/Vector] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843886 (https://phabricator.wikimedia.org/T319447) (owner: 10Func) [07:15:01] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:15:29] Func: wmf.6 hasn't yet been checked out on the deployment servers, so I'll just +2 it and it'll go out with the train itself [07:15:36] (03Merged) 10jenkins-bot: OutputPageTest: Adjust testAddBodyClasses [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843562 (https://phabricator.wikimedia.org/T321021) (owner: 10Majavah) [07:15:39] oh ok thank you [07:16:09] (usually the checkout is automated, but it fails if the branch commit fails to merge) [07:17:17] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:17:28] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.6 [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843534 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [07:21:51] taavi: Is that going to require a manual submodule update since the backported skin commit merged before the core branch commit? [07:22:37] PleaseStand: huh? the branch commit was merged, and the skin commit is still in CI [07:22:50] the first commit to fix the tests was in core [07:24:45] taavi: You're right, the skin commit indeed has not merged yet [07:29:32] (03Merged) 10jenkins-bot: Follow-up 76d1135: Use better practice in the code [skins/Vector] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843886 (https://phabricator.wikimedia.org/T319447) (owner: 10Func) [07:30:27] taavi: And now I see the "Update git submodules" commit as expected [07:31:27] taavi: thank you for the 1.40.0-wmf.6 CI fix up :] [07:31:29] PROBLEM - SSH on mw1309.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:33:37] Func: I will get your patch on the deployment, I have to run the 1.40.0-wmf.6 deployment manually since some test failed during the night [07:33:39] your patch will be in [07:35:17] !log Rebased /srv/mediawiki-staging/php-1.40.0-wmf.6 for de15f77aa428e3aacf6b66938fb7bdb45ef91443 ( T321021 ) and 0f8be847d9d81882ad5c1e54c2b45cc4d918eb97 ( T319447 ) [07:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:24] T319447: Create user preference to turn fixed width on and off - https://phabricator.wikimedia.org/T319447 [07:35:24] T321021: MediaWiki core CI failing: OutputPageTest::testAddBodyClasses - https://phabricator.wikimedia.org/T321021 [07:36:09] I should probably redo php-1.40.0-wmf.6 from scratch [07:36:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10dcaro) > You also have to bear in mind, with some tasks like like Ceph initial syncing, that a well tuned/performant system will use whatever bandwidth is... [07:37:25] !log Scratched /srv/mediawiki-staging/php-1.40.0-wmf.6 entirely and doing `scap prep` instead [07:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:59] !log `scap stage-train 1.40.0-wmf.6` # T320511 [07:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:03] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [07:44:29] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:46:14] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843868 (https://phabricator.wikimedia.org/T320511) [07:46:16] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843868 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [07:46:43] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:03] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843868 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [07:47:32] !log hashar@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.6 refs T320511 [07:47:37] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [07:49:45] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:04] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) Please dc ops, request for a disk replacement, as this host should be under warranty. [07:57:30] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, 10Technical-Debt: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) @mpopov @Gehel This was clarified in our ServiceOps meeting. We are not touching the `search` and `search-https` services. [08:00:05] hashar and dduvall: Dear deployers, time to do the MediaWiki train - Utc-0+Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T0800). [08:09:05] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:10:49] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:11:57] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:14:13] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:21:56] (03CR) 10Filippo Giunchedi: "I can confirm this is working as expected and supercedes the existing per-host alerts:" [puppet] - 10https://gerrit.wikimedia.org/r/841886 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [08:23:37] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.6 refs T320511 (duration: 36m 04s) [08:23:43] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [08:26:49] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10jcrespo) [08:26:50] !log scap clean auto # T320511 [08:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:54] 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10jcrespo) 05Resolved→03Open @Papaul, this is reocurring- my guess is the cable is unfit so it got loose again. Assuming it is that (or if you can provide further insight), maybe requesting a... [08:27:35] going to promote group0 [08:28:44] 10SRE, 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10jcrespo) [08:28:49] !log hashar@deploy1002 Pruned MediaWiki: 1.40.0-wmf.4 (duration: 02m 11s) [08:29:49] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:30:44] (03CR) 10Vgutierrez: sre: test warning on pybal backends being down for long (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/841905 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi) [08:30:48] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843871 (https://phabricator.wikimedia.org/T320511) [08:30:54] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843871 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [08:31:40] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/843535 [08:31:42] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843871 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [08:32:45] RECOVERY - SSH on mw1309.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:34:25] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:35:44] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.6 refs T320511 [08:35:48] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [08:41:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] confd: remove check_confd_template icinga check [puppet] - 10https://gerrit.wikimedia.org/r/841886 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [08:43:26] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: remove check_confd_template icinga check [puppet] - 10https://gerrit.wikimedia.org/r/841886 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [08:44:49] and I am rolling back [08:45:40] 10SRE, 10Infrastructure-Foundations, 10netops: BFD flapping between cr1-eqiad and cr2-drmrs - https://phabricator.wikimedia.org/T321034 (10cmooney) I think I may have solved this, although through nothing logical, similar to the earlier BGP bounce restoring the IPv6. I disabled OSPF for the interface and re... [08:50:42] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.40.0-wmf.6" # T320511 [08:50:47] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [08:51:06] (03CR) 10DCausse: [C: 03+1] cirrus: Correct comments in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843521 (https://phabricator.wikimedia.org/T262630) (owner: 10Ebernhardson) [08:52:51] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [08:54:05] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [08:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:54:51] 10SRE, 10serviceops, 10Performance-Team (Radar): Remove nutcracker from mediawiki chart - https://phabricator.wikimedia.org/T321042 (10Clement_Goubert) a:05jijiki→03Clement_Goubert [08:55:00] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [08:55:31] PROBLEM - Check systemd state on mw1439 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:15] (03PS2) 10Clément Goubert: mwdebug: Disable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/843425 (https://phabricator.wikimedia.org/T316296) [08:56:40] (03CR) 10Btullis: systemd: drop timer-specific alert in favor of generic alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:03:51] <_joe_> uhm [09:06:45] (03PS1) 10Giuseppe Lavagetto: confd: use the v3 style srv records [puppet] - 10https://gerrit.wikimedia.org/r/843873 (https://phabricator.wikimedia.org/T320397) [09:06:47] (03PS1) 10Giuseppe Lavagetto: jobrunner: add php7.4 to the list of services [puppet] - 10https://gerrit.wikimedia.org/r/843874 [09:07:18] (03PS2) 10Giuseppe Lavagetto: jobrunner: add php7.4 to the list of services [puppet] - 10https://gerrit.wikimedia.org/r/843874 [09:07:54] 10SRE, 10Infrastructure-Foundations, 10netops: BFD flapping between cr1-eqiad and cr2-drmrs - https://phabricator.wikimedia.org/T321034 (10ayounsi) 05Open→03Resolved a:03cmooney Awesome, thanks! I cleared the Icinga downtimes now that it's all back to normal. [09:10:51] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:45] (03CR) 10Clément Goubert: [C: 03+1] jobrunner: add php7.4 to the list of services [puppet] - 10https://gerrit.wikimedia.org/r/843874 (owner: 10Giuseppe Lavagetto) [09:15:36] (03PS3) 10Clément Goubert: mwdebug: Disable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/843425 (https://phabricator.wikimedia.org/T321042) [09:22:17] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:28:48] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) Disregard the above related patch, I fumbled the Bug id. [09:28:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: add php7.4 to the list of services [puppet] - 10https://gerrit.wikimedia.org/r/843874 (owner: 10Giuseppe Lavagetto) [09:29:37] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ayounsi) Not sure what happened, but there are many outstanding diffs on switches: `lang=diff [edit interfaces xe-0/0/17] - description "kafka-jumbo1010 {#20220240}"; + de... [09:31:48] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ayounsi) p:05Medium→03High [09:34:57] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: test warning on pybal backends being down for long (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/841905 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi) [09:41:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mwdebug: Disable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/843425 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [09:42:22] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:42:40] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:43:43] (03PS1) 10Clément Goubert: mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) [09:44:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706) (owner: 10Ori) [09:44:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::docker: allow runtime to be specified [puppet] - 10https://gerrit.wikimedia.org/r/841574 (https://phabricator.wikimedia.org/T316706) (owner: 10Ori) [09:46:03] (03PS1) 10Clément Goubert: mwdebug: Remove nutcracker config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) [09:49:07] (03PS1) 10Urbanecm: Revert "Add multiple integration tests for Hooks.php" [extensions/CheckUser] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843906 (https://phabricator.wikimedia.org/T321041) [09:49:41] hashar: hi, ^^ should fix T321041 [09:52:41] ...and i just saw your message you're afk, hijacking the remainder of train window to fix the blocker [09:53:23] (03PS1) 10Urbanecm: Revert "group0 wikis to 1.40.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843536 (https://phabricator.wikimedia.org/T320511) [09:53:53] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "already deployed, but not merged in gerrit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843536 (https://phabricator.wikimedia.org/T320511) (owner: 10Urbanecm) [09:54:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843906 (https://phabricator.wikimedia.org/T321041) (owner: 10Urbanecm) [09:54:14] fyi zabe ^^ [09:54:27] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10Clement_Goubert) Restored the trafficserver search.wikimedia.org removal patch. As I understand it, removing this mapping will stop traffic to the... [10:03:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:08:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:10:10] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:10:39] (03Merged) 10jenkins-bot: Revert "Add multiple integration tests for Hooks.php" [extensions/CheckUser] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843906 (https://phabricator.wikimedia.org/T321041) (owner: 10Urbanecm) [10:10:43] finally [10:11:26] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:843906|Revert "Add multiple integration tests for Hooks.php" (T321041)]] [10:11:51] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:843906|Revert "Add multiple integration tests for Hooks.php" (T321041)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:17:50] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:843906|Revert "Add multiple integration tests for Hooks.php" (T321041)]] (duration: 06m 24s) [10:17:57] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10BTullis) Thanks @Ladsgroup - noted. So it seems that we have five topic or team based `-alerts` lists on mailman already:... [10:18:35] * urbanecm done [10:23:29] urbanecm: thanks. I will catch up after lunch [10:39:41] sounds good [10:40:18] (03PS1) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) [10:41:04] (03CR) 10CI reject: [V: 04-1] analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [10:46:27] (03PS2) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) [10:47:10] (03CR) 10CI reject: [V: 04-1] analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [10:49:25] (03PS3) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) [10:49:34] third time might be the charm [10:51:49] (03CR) 10Clément Goubert: [C: 03+2] mwdebug: Disable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/843425 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:55:39] (03Merged) 10jenkins-bot: mwdebug: Disable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/843425 (https://phabricator.wikimedia.org/T321042) (owner: 10Clément Goubert) [10:57:08] !log Disabling nutcracker on k8s-experimental mwdebug - T321042 [10:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:13] T321042: Remove nutcracker from mediawiki chart - https://phabricator.wikimedia.org/T321042 [10:59:15] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:05:47] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:06:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:06:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:08:16] (03PS2) 10Clément Goubert: mediawiki: Remove all nutcracker templates and refs [deployment-charts] - 10https://gerrit.wikimedia.org/r/843878 (https://phabricator.wikimedia.org/T321042) [11:08:50] !log Nutcrackerd disabled on k8s-experimental mwdebug - T321042 [11:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:54] T321042: Remove nutcracker from mediawiki chart - https://phabricator.wikimedia.org/T321042 [11:12:57] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Remove nutcracker from mediawiki chart - https://phabricator.wikimedia.org/T321042 (10Clement_Goubert) ` root@deploy1002:/srv/deployment-charts/helmfile.d/services/mwdebug# kube_env mwdebug codfw root@deploy1002:/srv/deployment-charts/hel... [11:16:55] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) > I'll give the test a go somewhere just to see what the throughput bottleneck looks like in grafana 👍 Cool. If you're starting with iperf I c... [11:22:33] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy for the format whatever is easier for you based on your workflow, here a couple of alternative options that comes to mind, but feel free to propose somet... [11:23:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:24:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:24:18] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:26:58] (03PS4) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) [11:27:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:28:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:28:34] (03CR) 10Volans: [C: 04-1] "I think there's a problem with the logic" [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:31:58] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843930 (https://phabricator.wikimedia.org/T320511) [11:32:01] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843930 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [11:32:02] lets try again ;) [11:32:45] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843930 (https://phabricator.wikimedia.org/T320511) (owner: 10TrainBranchBot) [11:35:10] hashar: fingers crossed! [11:37:06] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.6 refs T320511 [11:37:11] T320511: 1.40.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T320511 [11:39:41] (03PS1) 10Btullis: Add cumin aliases for dse-k8s in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/843932 (https://phabricator.wikimedia.org/T310196) [11:41:04] (03CR) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:41:21] (03CR) 10Volans: "If those are usuful sure go ahead, if the only reason to add them is for I85d27351ba563d54ceaf9954a83ff1458f3c6d7e then you can just add t" [puppet] - 10https://gerrit.wikimedia.org/r/843932 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:43:52] (03CR) 10Volans: [C: 04-1] "added additional optimization comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:44:24] (03CR) 10Btullis: "Whoops. I accidentally left my +2 on as well as the -1 - Sorry." [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:45:09] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/843936 (https://phabricator.wikimedia.org/T320961) [11:45:36] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/843936 (https://phabricator.wikimedia.org/T320961) (owner: 10Kosta Harlan) [11:49:41] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/843936 (https://phabricator.wikimedia.org/T320961) (owner: 10Kosta Harlan) [11:50:14] (03CR) 10Btullis: Add cumin aliases for dse-k8s in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843932 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [11:50:57] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [11:51:52] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [11:52:36] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [11:54:50] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable AddLink backend for bat_smg and be_x_old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843937 (https://phabricator.wikimedia.org/T304549) [11:55:06] !log kharlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [11:55:31] !log kharlan@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [11:55:38] (03PS10) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [11:57:25] !log kharlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [12:14:16] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:16:52] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37587/console" [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [12:30:12] (03PS1) 10Matthias Mullie: Add default value for search-thumbnail-extra-namespaces [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/843913 (https://phabricator.wikimedia.org/T320337) [12:41:10] (03CR) 10ArielGlenn: [C: 03+1] "Two big thumbs up from me, nice to see the licensing situation being cleared up at last!" [puppet] - 10https://gerrit.wikimedia.org/r/842760 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:55:24] (03CR) 10ArielGlenn: "I ran pcc and a bunch of things were removed, I was expecting this to be a noop. Am I misunderstanding the purpose of the patch? See https" [puppet] - 10https://gerrit.wikimedia.org/r/842934 (owner: 10Krinkle) [12:55:36] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10HShaikh) [12:59:05] (03PS2) 10Clément Goubert: mwdebug: Remove nutcracker config values [deployment-charts] - 10https://gerrit.wikimedia.org/r/843880 (https://phabricator.wikimedia.org/T321042) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T1300). [13:00:05] MatmaRex, AndyRussG, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T1300) [13:00:18] I can’t deploy today, sorry [13:00:21] hello [13:00:25] hi [13:02:47] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable AddLink backend for bat_smg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843937 (https://phabricator.wikimedia.org/T304549) [13:03:19] I can deploy my own patch but would be pressed for time to do the others, I'm afraid. [13:04:08] hiii [13:04:11] oh, they are all config patches, I guess it is something I could do [13:04:58] MatmaRex: I'll start with yours [13:05:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843580 (https://phabricator.wikimedia.org/T320683) (owner: 10Bartosz Dziewoński) [13:05:40] thanks [13:06:25] kostajh: thanks!!!! [13:06:27] (03Merged) 10jenkins-bot: Add "Clear Affordances" to DiscussionTools beta feature on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843580 (https://phabricator.wikimedia.org/T320683) (owner: 10Bartosz Dziewoński) [13:06:53] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:843580|Add "Clear Affordances" to DiscussionTools beta feature on most wikis (T320683)]] [13:06:58] T320683: [Config Change] Add Clear Affordances to beta feature at Phase 1 wikis (desktop) - https://phabricator.wikimedia.org/T320683 [13:07:18] !log kharlan@deploy1002 kharlan and matmarex: Backport for [[gerrit:843580|Add "Clear Affordances" to DiscussionTools beta feature on most wikis (T320683)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:07:35] MatmaRex: please test on mwdebu1001 [13:07:44] mwdebug1001, even [13:08:10] looking [13:08:37] seems good [13:09:52] ack [13:13:53] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:843580|Add "Clear Affordances" to DiscussionTools beta feature on most wikis (T320683)]] (duration: 07m 00s) [13:14:09] AndyRussG: ok, on to yours :) [13:14:19] okiiii :) :) [13:14:29] !log kharlan@deploy1002 backport aborted: (duration: 00m 02s) [13:14:43] (just a second) [13:15:22] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:15:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843590 (https://phabricator.wikimedia.org/T308799) (owner: 10AndyRussG) [13:16:00] this will insert an html comment on wikis in group 0 (and then on other subsequent groups, when the train reaches them later this week) [13:16:35] (03Merged) 10jenkins-bot: CentralNotice: Set ESI test string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843590 (https://phabricator.wikimedia.org/T308799) (owner: 10AndyRussG) [13:16:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:17:00] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:843590|CentralNotice: Set ESI test string (T308799 T320734)]] [13:17:23] !log kharlan@deploy1002 kharlan and andyrussg: Backport for [[gerrit:843590|CentralNotice: Set ESI test string (T308799 T320734)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:18:34] hi kostajh, would you mind running a maint script to purge the cache for one file? [13:19:09] AndyRussG: please test on mwdebug1001.eqiad.wmnet [13:19:17] koi: sure, which script? [13:19:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:20:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:21:16] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:21:19] it's the "purgeList.php", the command should be `echo "https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-tagline-fr.svg" | mwscript purgeList.php #T320840` [13:22:16] kostajh: looks great, thanks!! [13:22:34] ok, syncing [13:25:16] koi: ok [13:26:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10EChetty) [13:26:33] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:843590|CentralNotice: Set ESI test string (T308799 T320734)]] (duration: 09m 33s) [13:27:06] koi: do I need to pass a --wiki argument to mwscript? [13:27:16] I think no [13:27:35] looks like it’s sometimes used but not required https://sal.toolforge.org/production?p=0&q=purgeList.php&d= [13:27:53] (03PS1) 10Dom Walden: Add IP address of deployment-cache-text07. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843943 (https://phabricator.wikimedia.org/T321072) [13:27:53] this is a follow-up from the 12 September script run? https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2022-09-12 [13:28:22] alright, running it [13:28:26] no, it's just for T320840 [13:28:27] T320840: Vector 2022: Wrong tagline for other site displayed under logo - https://phabricator.wikimedia.org/T320840 [13:28:34] done [13:28:38] thanks! [13:28:44] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable AddLink backend for bat_smg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843937 (https://phabricator.wikimedia.org/T304549) [13:28:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843937 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:29:05] (03CR) 10JMeybohm: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/842765 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:29:05] yw [13:29:47] (03Merged) 10jenkins-bot: GrowthExperiments: Enable AddLink backend for bat_smg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843937 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:30:10] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:843937|GrowthExperiments: Enable AddLink backend for bat_smg (T304549)]] [13:30:15] T304549: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 [13:30:34] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:843937|GrowthExperiments: Enable AddLink backend for bat_smg (T304549)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:30:50] (03PS1) 10Samtar: reverse-proxy-staging: Update -cache-text07/-cache-upload07 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843944 (https://phabricator.wikimedia.org/T321072) [13:31:43] (03PS2) 10Samtar: reverse-proxy-staging: Update -cache-text07/-cache-upload07 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843944 (https://phabricator.wikimedia.org/T321072) [13:33:10] 10SRE, 10ops-eqiad: Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10EChetty) [13:35:01] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:843937|GrowthExperiments: Enable AddLink backend for bat_smg (T304549)]] (duration: 04m 50s) [13:35:17] (03PS1) 10Hashar: Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) [13:36:10] (03CR) 10JMeybohm: [C: 04-1] "There are still some errors but maybe I have introduced those even before your change...not sure" [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [13:36:18] (03PS2) 10Hashar: Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) [13:36:45] alright, that is the end of backports for now [13:37:01] Lucas_WMDE: should I log some message or just... vanish from this channel? :) [13:37:15] (03PS1) 10Hashar: Remove motd plugin and its config [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843946 (https://phabricator.wikimedia.org/T321075) [13:37:18] you can !log something like UTC afternoon backport+config window done, if you want :) [13:37:41] (03CR) 10Chad: [C: 03+2] Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:37:43] and then swoosh your cape and vanish ;) [13:37:59] kostajh: thx again! :) [13:38:07] (03PS1) 10Hashar: Remove motd plugin [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843947 (https://phabricator.wikimedia.org/T321075) [13:38:35] !log UTC afternoon backport+config window done \o/ [13:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:44] \o/ [13:38:45] (03PS1) 10Hashar: Remove motd plugin and its config [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843948 (https://phabricator.wikimedia.org/T321075) [13:38:53] thanks Lucas_WMDE [13:38:55] yw AndyRussG [13:38:56] ciao [13:39:03] thanks for doing the backportb kostajh ! [13:39:51] (03CR) 10Clément Goubert: Remove references to deprecated kubeyaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [13:41:23] (03Abandoned) 10Dom Walden: Add IP address of deployment-cache-text07. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843943 (https://phabricator.wikimedia.org/T321072) (owner: 10Dom Walden) [13:42:00] (03CR) 10CI reject: [V: 04-1] Remove motd plugin [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843947 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:42:14] (03PS1) 10Hashar: gerrit: remove etc/motd.config [puppet] - 10https://gerrit.wikimedia.org/r/843949 (https://phabricator.wikimedia.org/T321075) [13:42:30] (03CR) 10CI reject: [V: 04-1] Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:43:24] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:52] heh, I might sneak in one more config patch if that is OK [13:44:41] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable AddLink backend for be_x_oldwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843950 (https://phabricator.wikimedia.org/T304549) [13:44:56] ^ Lucas_WMDE: any objections to that? [13:45:10] nope [13:45:16] (03CR) 10Hnowlan: "Thanks for this! This lgtm but I am not sure I am qualified to approve - I'll try to get a second opinion ." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/805476 (https://phabricator.wikimedia.org/T167420) (owner: 10TheDJ) [13:45:45] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843950 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:47:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843950 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:48:13] (03Merged) 10jenkins-bot: GrowthExperiments: Enable AddLink backend for be_x_oldwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843950 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:48:34] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:843950|GrowthExperiments: Enable AddLink backend for be_x_oldwiki (T304549)]] [13:48:39] T304549: Deploy "add a link" to 5th round of wikis - https://phabricator.wikimedia.org/T304549 [13:48:57] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:843950|GrowthExperiments: Enable AddLink backend for be_x_oldwiki (T304549)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:49:17] (03PS1) 10Elukey: ml-services: update revscoring-editquality-goodfaith's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/843951 (https://phabricator.wikimedia.org/T320374) [13:53:30] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:843950|GrowthExperiments: Enable AddLink backend for be_x_oldwiki (T304549)]] (duration: 04m 56s) [13:54:04] (03CR) 10Chad: [C: 03+2] Remove motd plugin and its config [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843946 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:54:24] (03Merged) 10jenkins-bot: Remove motd plugin and its config [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843946 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:54:30] (03CR) 10Chad: [C: 03+2] Remove motd plugin [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843947 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:54:36] (03CR) 10Chad: [C: 03+2] Remove motd plugin and its config [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843948 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:54:49] (03CR) 10Chad: [C: 03+1] gerrit: remove etc/motd.config [puppet] - 10https://gerrit.wikimedia.org/r/843949 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:55:28] (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring-editquality-goodfaith's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/843951 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [13:57:34] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:58:18] (03CR) 10CI reject: [V: 04-1] Remove motd plugin [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843947 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:58:34] (03Merged) 10jenkins-bot: Remove motd plugin and its config [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843948 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [13:59:33] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:00:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:03:12] this is my deployment sigh [14:05:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:05:59] (03PS3) 10Clément Goubert: Remove references to deprecated kubeyaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) [14:07:33] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10herron) p:05Triage→03Medium [14:10:55] (03PS1) 10Volans: doc: add directory for documentation [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/843953 [14:12:20] (03CR) 10Vgutierrez: [C: 03+1] "I wasn't aware of this, thanks for taking care @TheresNoTime" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843944 (https://phabricator.wikimedia.org/T321072) (owner: 10Samtar) [14:15:05] (03PS4) 10Clément Goubert: Remove references to deprecated kubeyaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) [14:16:16] (03CR) 10Volans: [V: 03+2 C: 03+2] "Adding the source code for the file published at:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/843953 (owner: 10Volans) [14:17:19] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10herron) Thanks for the request @HShaikh, we'll just need a couple comments of approval added here to the task before proceeding. * @odimitrijevic @Ottomata could you please r... [14:19:18] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10herron) p:05Triage→03Medium [14:20:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10herron) Thanks @XenoRyet! @odimitrijevic @Ottomata could you please approve for analytics-privatedata-users? Thanks in advance! [14:22:44] (03CR) 10Clément Goubert: Remove references to deprecated kubeyaml (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [14:23:21] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for devnull - https://phabricator.wikimedia.org/T318104 (10herron) 05Open→03Stalled [14:23:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Isaac) [14:25:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Isaac) Hey SRE/Analytics/Legal -- we have a new contractor onboard: @Appledora. She needs access to HDFS and the stat machines for a new research project. Don't hesitate t... [14:26:19] (03PS1) 10Daniel Kinzler: Enable parsoid cache warming on testwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843955 (https://phabricator.wikimedia.org/T320535) [14:32:01] 10SRE, 10SRE-Access-Requests: Please add eigyan (essexigyan) to Restricted Group - https://phabricator.wikimedia.org/T318983 (10herron) a:03thcipriani [14:36:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10herron) [14:38:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10herron) p:05Triage→03Medium Hello! @KFrancis could you please confirm that we have an NDA on file for @Appledora? @odimitrijevic @Ottomata could you please approve f... [14:41:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10herron) p:05Triage→03Medium [14:41:35] (03PS1) 10Xcollazo: Modify jupyterhub files to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [14:42:33] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Jclark-ctr) opened ticket with Dell Create Service Request: Service Tag F1H2KQ3 [14:42:36] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Jclark-ctr) a:03Jclark-ctr [14:44:07] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10Aklapper) @herron: Please note that the Phab account @lanebecker is linked to a [self-created SUL account](https://meta.wikimedia.org/wiki/Special:CentralAuth?target=Lanebecke... [14:50:07] (03CR) 10David Caro: "@Majavah I addressed your comments, do you mind doing a last review?" [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [14:50:09] (03CR) 10AOkoth: [C: 03+2] vrts: fix download link [puppet] - 10https://gerrit.wikimedia.org/r/843579 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [14:50:29] (03CR) 10Ssingh: [C: 03+1] "As discussed in the Traffic meeting, let's merge this for now." [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833848 (https://phabricator.wikimedia.org/T315536) (owner: 10BCornwall) [14:53:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:56:36] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10Jclark-ctr) @RLazarus Server is out of warranty When can you depool server and i can swap dimm with a server that was recently decom. [14:56:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:57:50] something bad happening? [14:58:04] (03PS1) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843960 (https://phabricator.wikimedia.org/T312235) [14:59:12] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10Ladsgroup) I can shut it down and downtime it for you, when does it work (I don't want to leave it out of replication for too long) [14:59:42] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Add latency measurement program [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833848 (https://phabricator.wikimedia.org/T315536) (owner: 10BCornwall) [15:00:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:00:26] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843960 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [15:01:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:02:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.021 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:02:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10Ottomata) > Reason for access: Access to full superset information, especially for the banner bump investigation If you only need access to data via the Superset G... [15:02:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Ottomata) Approved! [15:03:40] (03PS1) 10Kosta Harlan: labs: Enable GrowthExperiments new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) [15:03:48] (03CR) 10Vgutierrez: [C: 03+2] reverse-proxy-staging: Update -cache-text07/-cache-upload07 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843944 (https://phabricator.wikimedia.org/T321072) (owner: 10Samtar) [15:03:48] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10Ottomata) Approved. [15:04:29] (03Merged) 10jenkins-bot: reverse-proxy-staging: Update -cache-text07/-cache-upload07 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843944 (https://phabricator.wikimedia.org/T321072) (owner: 10Samtar) [15:04:33] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [15:07:13] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10cmooney) I had a good chat with @aborrero today on some ideas on how to progress towards this goal. Some notes / additional thoughts... [15:09:23] (03CR) 10Klausman: [C: 03+1] Add a new production images for spark and spark-operator (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [15:10:47] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10VirginiaPoundstone) a:03hnowlan [15:13:11] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ladsgroup) >>! In T315486#8324480, @BTullis wrote: > Thanks @Ladsgroup - noted. So it seems that we have five topic or te... [15:13:46] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ladsgroup) Another thing. Mailman2 had many many issues but mailman3 (the current infra) is much easier to use and handle. [15:14:01] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10aborrero) >>! In T314847#8325727, @cmooney wrote: > I had a good chat with @aborrero today on some ideas on how to progress towards th... [15:14:38] (03CR) 10JMeybohm: [C: 03+1] "I'd say this is good to go. Nice work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:14:56] (03PS5) 10Btullis: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) [15:15:25] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:15:25] 10SRE, 10ops-eqiad: Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [15:16:00] 10SRE, 10ops-eqiad: Decommission old AQS cluster nodes - https://phabricator.wikimedia.org/T302277 (10Jclark-ctr) 05Open→03Resolved completed steps for decom process [15:16:51] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10Jclark-ctr) a:03Jclark-ctr I am available now if you are [15:17:22] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fixes and improvements!" [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [15:17:38] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10Ladsgroup) Sure, give me half an hour to shut it down properly. [15:20:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: DIMM replacement [15:20:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: DIMM replacement [15:22:48] 10SRE, 10ops-eqiad: eqaid: duplicate serial: - https://phabricator.wikimedia.org/T320772 (10Jclark-ctr) a:03Jclark-ctr [15:23:56] (03CR) 10Btullis: [C: 03+2] Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [15:26:26] 10SRE, 10ops-eqiad: eqaid: duplicate serial: - https://phabricator.wikimedia.org/T320772 (10Jclark-ctr) 05Open→03Resolved corrected duplicate serials [15:26:28] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add a new production images for spark and spark-operator (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [15:28:14] (03Merged) 10jenkins-bot: Add dse-k8s-worker as a permitted alias for the reboot-nodes cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/840186 (https://phabricator.wikimedia.org/T310196) (owner: 10Btullis) [15:30:22] (03CR) 10Ladsgroup: [C: 03+1] "Generally looks good to me but let's wait until next week for Manuel to come back in case it would need more work beside merging the patch" [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza) [15:30:48] PROBLEM - IPMI Sensor Status on restbase1018 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:31:19] (03PS1) 10Jdlrobson: Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 [15:32:06] (03CR) 10CI reject: [V: 04-1] Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 (owner: 10Jdlrobson) [15:32:25] (03PS2) 10Jdlrobson: Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 [15:33:07] (03CR) 10CI reject: [V: 04-1] Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 (owner: 10Jdlrobson) [15:33:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:40] (03CR) 10Sergio Gimeno: [C: 03+1] labs: Enable GrowthExperiments new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) (owner: 10Kosta Harlan) [15:36:39] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10Jclark-ctr) Replaced DIMM A6 with recently Decom host [15:39:33] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): PXE boot failure on cloudvirt1023 - https://phabricator.wikimedia.org/T319042 (10Jclark-ctr) @papaul where you able to make any progress with @ayounsi [15:41:03] (03CR) 10Filippo Giunchedi: systemd: drop timer-specific alert in favor of generic alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [15:41:15] (03CR) 10Ssingh: [C: 03+1] "Same as the other related commit, +1 as per the Traffic meeting discussion." [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833855 (owner: 10BCornwall) [15:41:24] (03CR) 10Ssingh: [C: 03+1] "Same as the other related commit, +1 as per the Traffic meeting discussion." [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833851 (owner: 10BCornwall) [15:41:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Jclark-ctr) 05Open→03Resolved Completed lvs connection moves [15:42:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10herron) [15:47:30] (03CR) 10Hashar: "recheck" [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843947 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [15:47:44] (03PS2) 10Cwhite: hiera: upgrade codfw to opensearch v2 [puppet] - 10https://gerrit.wikimedia.org/r/828110 (https://phabricator.wikimedia.org/T304440) [15:48:14] 10SRE, 10serviceops: service implementation tracking: arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319429 (10LSobanski) a:05Dzahn→03None Doesn't look like collab, unassigning from Daniel. [15:48:58] 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10LSobanski) a:05Dzahn→03None Doesn't look like collab, unassigning from Daniel. [15:49:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10LSobanski) [15:49:58] !log hashar@deploy1002 Started deploy [gerrit/gerrit@da5de16]: gerrit2002: remove motd plugin and its config # T321075 [15:50:03] T321075: Remove Gerrit motd plugin - https://phabricator.wikimedia.org/T321075 [15:50:08] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@da5de16]: gerrit2002: remove motd plugin and its config # T321075 (duration: 00m 10s) [15:51:22] !log hashar@deploy1002 Started deploy [gerrit/gerrit@da5de16]: gerrit1001: remove motd plugin and its config # T321075 [15:51:30] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@da5de16]: gerrit1001: remove motd plugin and its config # T321075 (duration: 00m 08s) [15:53:41] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10Jclark-ctr) 05Open→03Resolved [15:53:46] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10Jclark-ctr) [15:54:18] (03PS3) 10Jdlrobson: Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 [15:54:20] (03PS1) 10Jdlrobson: Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 [15:54:49] the stupid me has to restart Gerrit :( [15:54:58] (03CR) 10CI reject: [V: 04-1] Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 (owner: 10Jdlrobson) [15:55:25] !log Stopping Gerrit due to a mistake in deploying plugin (forgot to reinstall the builtin plugins) [15:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:59] * bd808 sees that gerrit is being restarted [15:57:09] (03CR) 10Clément Goubert: "This runs without errors on my VM, fixing a few errors that were present on master at the same time. Adding joe for confirmation approach " [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [15:58:06] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10Ladsgroup) After DIMM replacement, it looks good now. I will slowly start repooling it now. [15:59:04] (03CR) 10Hashar: [C: 03+2] Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [15:59:32] (03CR) 10Hashar: [C: 03+2] Remove motd plugin [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843947 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:40] PROBLEM - confd service on sretest1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:00:42] PROBLEM - Confd template for /etc/ferm/conf.d/00_defs_requestctl on sretest1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:00:46] (03CR) 10Hashar: "I have deployed the Gerrit change which has removed /srv/deployment/gerrit/gerrit/etc/motd.config" [puppet] - 10https://gerrit.wikimedia.org/r/843949 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:00:50] PROBLEM - Check for large files in client bucket on sretest1002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.139: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [16:02:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P35547 and previous config saved to /var/cache/conftool/dbconfig/20221018-160209-ladsgroup.json [16:03:23] (03CR) 10CI reject: [V: 04-1] Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:06:58] RECOVERY - confd service on sretest1002 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:07:08] RECOVERY - Check for large files in client bucket on sretest1002 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [16:07:26] (03PS1) 10Cwhite: opensearch: allow gc_log_flags reuse [puppet] - 10https://gerrit.wikimedia.org/r/844006 (https://phabricator.wikimedia.org/T304440) [16:08:21] (03PS2) 10Jdlrobson: Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 [16:09:22] (03CR) 10CI reject: [V: 04-1] Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 (owner: 10Jdlrobson) [16:09:36] (03CR) 10Cwhite: [C: 03+2] "PCC NOOP: https://puppet-compiler.wmflabs.org/pcc-worker1003/37593/" [puppet] - 10https://gerrit.wikimedia.org/r/844006 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [16:10:18] (03PS3) 10Cwhite: hiera: upgrade codfw to opensearch v2 [puppet] - 10https://gerrit.wikimedia.org/r/828110 (https://phabricator.wikimedia.org/T304440) [16:11:14] (03PS4) 10Jdlrobson: Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 [16:11:17] (03PS3) 10Jdlrobson: Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 [16:11:36] (03Merged) 10jenkins-bot: Remove motd plugin [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/843947 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:12:38] (03PS1) 10Cwhite: beta-logs: enable gc_log on collector nodes to match production [puppet] - 10https://gerrit.wikimedia.org/r/844007 (https://phabricator.wikimedia.org/T304440) [16:12:40] (03CR) 10CI reject: [V: 04-1] Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 (owner: 10Jdlrobson) [16:12:42] (03CR) 10CI reject: [V: 04-1] Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 (owner: 10Jdlrobson) [16:15:43] (03CR) 10Hashar: [C: 03+2] "There is something sketchy with Maven Central, I will have to investigate. Maybe there is a rate limit of some sort 😞" [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:16:45] (03PS5) 10Jdlrobson: Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 [16:16:47] (03PS4) 10Jdlrobson: Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 [16:16:49] (03PS1) 10Jdlrobson: Logos: yaml format change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843995 [16:17:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P35548 and previous config saved to /var/cache/conftool/dbconfig/20221018-161714-ladsgroup.json [16:20:16] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T320384 (10KFrancis) Hi all, I am confirming the NDA has been signed. Please proceed with the access request. Thanks! [16:21:03] (03CR) 10Cwhite: [C: 03+2] "PCC OK: https://puppet-compiler.wmflabs.org/pcc-worker1001/37594/" [puppet] - 10https://gerrit.wikimedia.org/r/828110 (https://phabricator.wikimedia.org/T304440) (owner: 10Cwhite) [16:28:48] (03CR) 10CI reject: [V: 04-1] Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:29:16] (03CR) 10Hashar: "grbmbl, I guess I will try again tomorrow." [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:29:32] <_joe_> !isspull [16:29:32] Error: This command can only be used in #wikimedia-ops [16:29:39] <_joe_> !issync [16:29:39] Syncing #wikimedia-operations (requested by joe_oblivian) [16:29:41] Set /cs flags #wikimedia-operations sirenbot +o [16:32:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P35549 and previous config saved to /var/cache/conftool/dbconfig/20221018-163219-ladsgroup.json [16:35:03] (03PS2) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843960 (https://phabricator.wikimedia.org/T312235) [16:35:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 5650 [16:35:30] PROBLEM - OpenSearch health check for shards on 9200 on logstash2025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f6f8b6e59b0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [16:35:30] org/wiki/Search%23Administration [16:36:36] RECOVERY - OpenSearch health check for shards on 9200 on logstash2025 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 632, active_shards: 1370, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [16:36:36] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:36:40] (03CR) 10jenkins-bot: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843960 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [16:37:06] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10taavi) > Probably makes sense to choose a /16 from 172.16.0.0/12 for the supernet, and allocate per-rack /24s from this. Please keep i... [16:38:05] (03PS1) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [16:38:22] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5650 [16:39:22] (03Abandoned) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843960 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [16:40:11] (03CR) 10CI reject: [V: 04-1] Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [16:41:08] (03CR) 10Dzahn: [C: 03+1] vrts: fix download link [puppet] - 10https://gerrit.wikimedia.org/r/843579 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [16:41:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10KFrancis) @herron I am confirming an NDA is on file. Please proceed with the access request. Thanks! [16:42:22] (03CR) 10Dzahn: [C: 03+2] gerrit: remove etc/motd.config [puppet] - 10https://gerrit.wikimedia.org/r/843949 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:43:41] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:44:48] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:44:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:34] (03CR) 10Dzahn: [C: 03+2] "@hashar - done - I already ran "unlink" (not rm) on 2 prod hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/843949 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [16:45:48] mutante: awesome thank you ;) [16:46:16] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:46:26] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:46:27] hashar: np, please just do the devtools part [16:46:37] that saves me having to load different keys [16:47:59] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) @ayounsi these Servers where removed from racks Dell had sent the wrong configuration. New servers where installed and took those names after removing informat... [16:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:54:42] mutante: oops, I forgot about devtools. Doing it now [16:54:46] (03PS25) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [16:55:40] hashar: thanks [16:57:27] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: admin: Add validation checks for missing realname and email in data.yaml - https://phabricator.wikimedia.org/T320937 (10Dzahn) [17:00:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: admin: Add validation checks for missing realname and email in data.yaml - https://phabricator.wikimedia.org/T320937 (10Dzahn) The entry for mvolz has another issue. It exists both in the shell admin and the ldap_only section. Users should not exist in... [17:01:35] (03PS1) 10Dzahn: admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937) [17:02:00] (03PS2) 10Dzahn: admin: fix duplicate entry for mvolz [puppet] - 10https://gerrit.wikimedia.org/r/843998 (https://phabricator.wikimedia.org/T320937) [17:07:08] (03CR) 10Urbanecm: [C: 03+1] labs: Enable GrowthExperiments new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) (owner: 10Kosta Harlan) [17:08:13] (03PS1) 10Dzahn: admin: add missing realname field for eugene_chernov [puppet] - 10https://gerrit.wikimedia.org/r/843999 (https://phabricator.wikimedia.org/T320937) [17:11:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10herron) [17:13:55] (03CR) 10Legoktm: [C: 03+1] mysql: new image for mysql backups (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [17:14:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10KFrancis) @herron I am confirming an NDA is on file. Please proceed with the access request. Thanks! [17:19:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10herron) [17:27:06] 10SRE, 10Wikimedia-Mailing-lists: Reassign owner of wikibaseug mailing list - https://phabricator.wikimedia.org/T321090 (10Ladsgroup) a:03Ladsgroup If @Masssly is okay, I can do it. Just give me your email address. you can email it to me if you don't want to disclose it publicly. [17:33:10] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844009 [17:35:36] (03PS1) 10Jdlrobson: i18n: Fix typo and simplify preference description [skins/Vector] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844027 [17:35:46] (03PS2) 10Jdlrobson: i18n: Fix typo and simplify preference description [skins/Vector] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844027 (https://phabricator.wikimedia.org/T321038) [17:41:51] (03PS2) 10Urbanecm: labs: Enable GrowthExperiments new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) (owner: 10Kosta Harlan) [17:41:53] jouncebot: nowandnext [17:41:53] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [17:41:54] In 0 hour(s) and 18 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T1800) [17:41:57] (03CR) 10Urbanecm: [C: 03+2] labs: Enable GrowthExperiments new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) (owner: 10Kosta Harlan) [17:42:11] (03PS3) 10Kosta Harlan: labs: Allow usage of GrowthExperiments NewImpact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) [17:44:53] hi vgutierrez, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/843944 showed up as an undeployed commit at deploy1002. it looks like a beta-only change, can you confirm that please? [17:45:36] urbanecm: indeed, that targets the beta cluster [17:45:58] ack, pulled it to deploy1002 then :) [17:46:06] thx [17:46:12] np [17:46:38] (03CR) 10Urbanecm: labs: Allow usage of GrowthExperiments NewImpact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) (owner: 10Kosta Harlan) [17:46:41] (03CR) 10Urbanecm: [C: 03+2] labs: Allow usage of GrowthExperiments NewImpact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) (owner: 10Kosta Harlan) [17:47:23] (03Merged) 10jenkins-bot: labs: Allow usage of GrowthExperiments NewImpact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843963 (https://phabricator.wikimedia.org/T313393) (owner: 10Kosta Harlan) [17:51:54] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:04] hashar and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T1800). [18:05:05] (03PS1) 10Kosta Harlan: labs: Beta wikis to use NewImpact module by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844046 (https://phabricator.wikimedia.org/T311299) [18:10:46] 10SRE, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10cmooney) >> /32 Service IPs should be from the cloud realm public /24 (185.15.56.0/24) if the service needs to be reachable from inter... [18:20:46] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:01] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10Dzahn) >>! In T321068#8325564, @Aklapper wrote: > @lanebecker is linked to a [self-created SUL account](https://meta.wikimedia.org/wiki/Special:CentralAuth?target=Lanebecker)... [18:27:32] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10Dzahn) This is concerning. I opened a ticket with ITS about this and CCed you Andre. [18:28:13] 10SRE, 10Infrastructure-Foundations, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10cmooney) FWIW I didn't get to the bottom of the MTU difference. But I was able to confirm that it is a real issue, i.e. there is a 4-byte "blackhole" where the switches will transmit packets wit... [18:35:31] (03PS1) 10Dzahn: phabricator: temp add other phab hosts to dump client hosts [puppet] - 10https://gerrit.wikimedia.org/r/844048 (https://phabricator.wikimedia.org/T313360) [18:36:34] (03CR) 10Dzahn: [C: 03+2] phabricator: temp add other phab hosts to dump client hosts [puppet] - 10https://gerrit.wikimedia.org/r/844048 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [18:36:42] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10RhinosF1) https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=36701617 - the currently linked account doesn't seem to be self created. [18:37:08] mutante: pm? [18:45:42] (03CR) 10CDanis: [C: 03+2] Re-introduce newconnrate [puppet] - 10https://gerrit.wikimedia.org/r/842539 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [18:47:57] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10RhinosF1) @lanebecker seems to have been updated since @Aklapper's comment and is now pointing at the WMF SUL account which was created by ITS. [18:48:30] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:49:32] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10Dzahn) Thanks @RhinosF1 please ignore my previous comments. I have also told ITS to delete the ticket I opened. [18:51:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:52:14] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844010 [18:52:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:52:54] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:55:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:57:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:58:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:58:48] !log rsyncing phab dump file - pull from phab1000 to all other hosts T313360 [18:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:54] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [18:59:42] 10SRE, 10Wikimedia-Mailing-lists: Reassign owner of wikibaseug mailing list - https://phabricator.wikimedia.org/T321090 (10GreenReaper) Thank you. My address is included in the original post at the end of the first paragraph. It's available elsewhere, and gets enough spam as it is, so I'm not that concerned ab... [18:59:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:02:31] (03CR) 10BCornwall: [V: 03+2 C: 03+2] "Thanks!" [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833851 (owner: 10BCornwall) [19:02:43] (03CR) 10BCornwall: [V: 03+2 C: 03+2] "Thanks!" [software/latency-measurement] - 10https://gerrit.wikimedia.org/r/833855 (owner: 10BCornwall) [19:09:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10RobH) [19:10:00] 10SRE, 10Wikimedia-Mailing-lists: Reassign owner of wikibaseug mailing list - https://phabricator.wikimedia.org/T321090 (10Masssly) >>! In T321090#8326324, @Ladsgroup wrote: > If @Masssly is okay, I can do it. Just give me your email address. you can email it to me if you don't want to disclose it publicly. Y... [19:10:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10RobH) [19:17:06] (03CR) 10Hashar: [C: 03+2] Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [19:18:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10RobH) [19:18:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10RobH) [19:19:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10RobH) [19:21:20] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Got it, thanks @Volans! I'll sync up with my team to get their thoughts and feedback on Thursday, and get back to you afterwards. [19:22:13] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10lanebecker) Yeah, sorry folks. Not entirely sure how I managed to connect to my personal account, but it has been fixed. Approved! [19:24:51] (03Merged) 10jenkins-bot: Remove motd plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/843945 (https://phabricator.wikimedia.org/T321075) (owner: 10Hashar) [19:28:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10RobH) [19:29:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10RobH) [19:43:21] 10SRE, 10SRE-Access-Requests: Requesting access to ssh for jupyter notebooks for hshaikh - https://phabricator.wikimedia.org/T321068 (10herron) [19:48:20] (03PS1) 10Herron: admin: add ssh key for hshaikh [puppet] - 10https://gerrit.wikimedia.org/r/844053 (https://phabricator.wikimedia.org/T321068) [19:54:17] (03PS1) 10Stang: Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) [19:56:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10herron) [19:58:04] (03PS1) 10Stang: tumwiki: Update project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844055 (https://phabricator.wikimedia.org/T320473) [19:58:34] (03PS6) 10Samtar: Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 (owner: 10Jdlrobson) [19:59:39] (03PS5) 10Samtar: Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 (owner: 10Jdlrobson) [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221018T2000). [20:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] * TheresNoTime can deploy :D [20:00:21] present [20:00:45] hi Jdlrobson :) going to start with those two logo changes [20:01:02] * urbanecm waves to TheresNoTime [20:01:22] ^^ [20:01:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 (owner: 10Jdlrobson) [20:01:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 (owner: 10Jdlrobson) [20:02:08] (03PS1) 10Herron: admin: add damilare to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/844056 (https://phabricator.wikimedia.org/T319057) [20:02:22] (03Merged) 10jenkins-bot: Move icons to dedicated folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843965 (owner: 10Jdlrobson) [20:02:26] (03Merged) 10jenkins-bot: Standardize wordmark names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843969 (owner: 10Jdlrobson) [20:02:44] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) Myself and @ayounsi were able to narrow down the issue a bit more during testing yesterday. It seems the iss... [20:02:54] !log samtar@deploy1002 Started scap: Backport for [[gerrit:843965|Move icons to dedicated folder]], [[gerrit:843969|Standardize wordmark names]] [20:03:20] (03PS1) 10Dzahn: phabricator: create /srv/homes and allow rsyncing it [puppet] - 10https://gerrit.wikimedia.org/r/844057 (https://phabricator.wikimedia.org/T313360) [20:03:21] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:843965|Move icons to dedicated folder]], [[gerrit:843969|Standardize wordmark names]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:03:33] Jdlrobson: could you double-check no icons have broken etc? [20:04:54] will do [20:05:09] is it on a debug server? [20:05:35] Jdlrobson: yes sorry, mwdebug1001 etc :) [20:09:16] (03PS2) 10Dzahn: phabricator: create /srv/homes and allow rsyncing it [puppet] - 10https://gerrit.wikimedia.org/r/844057 (https://phabricator.wikimedia.org/T313360) [20:09:35] TheresNoTime: this lgtm [20:09:44] merging [20:09:55] s/merging/syncing [20:12:30] (03PS3) 10Ryan Kemper: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) [20:12:35] (03PS4) 10Ryan Kemper: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) [20:12:41] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) (owner: 10Ryan Kemper) [20:14:01] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:843965|Move icons to dedicated folder]], [[gerrit:843969|Standardize wordmark names]] (duration: 11m 07s) [20:14:21] live :) moving on to 844027 now [20:14:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844027 (https://phabricator.wikimedia.org/T321038) (owner: 10Jdlrobson) [20:15:03] (03PS5) 10Ryan Kemper: Mount labstore to wcqs/wdqs instance for dumps reload [puppet] - 10https://gerrit.wikimedia.org/r/835596 (https://phabricator.wikimedia.org/T222349) [20:17:37] Jdlrobson: enwiki seems to be missing the globe logo on vector 2022, possibly related to your recent changes? [20:19:29] (ack ^) [20:19:48] taavi: you're quicker, just was writing that... [20:20:15] looks like this with timeless https://usercontent.irccloud-cdn.com/file/MS96uBiA/image.png [20:20:46] my fault, I've not purged [20:21:07] just did `echo 'https://en.wikipedia.org/static/images/icons/wikipedia.png' | mwscript purgeList.php`, seems to be there now? [20:21:09] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37596/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/844057 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [20:21:20] seems better now [20:21:26] can you purge all the other URLs too please? :) [20:21:40] yes, is there an easier way of doing that..? [20:21:47] write a for cycle [20:22:10] 10SRE, 10Wikimedia-Mailing-lists: Reassign owner of wikibaseug mailing list - https://phabricator.wikimedia.org/T321090 (10Ladsgroup) 05Open→03Resolved I added you and Lorenza as owners. Have fun. [20:22:26] TheresNoTime: something like `for filename in wikipedia.png foo.png bar.png; do echo "https://en.wikipedia.org/static/images/icons/$filename" | mwscript purgeList.php; done` [20:22:29] (03CR) 10Aftab: "Thanks for patch 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang) [20:22:56] urbanecm: ta! [20:22:59] or copy all the URLs from gerrit, put them to a txtfile, run a sed to prepend with https://en.wikipedia.org, and feed them all to purgeList.php [20:23:14] * TheresNoTime was thinking more along the "how to get from gerrit" lines, yeah [20:23:48] do we need to revert? [20:24:01] i think the purge was sufficient [20:24:06] taavi: what do you think? [20:24:12] cool :) (and phew!) [20:24:16] works for me now [20:26:52] 10SRE, 10Wikimedia-Mailing-lists: Reassign owner of wikibaseug mailing list - https://phabricator.wikimedia.org/T321090 (10GreenReaper) As much as mailing list administration is fun. :-) Thanks a lot for the prompt assistance! [20:27:45] everything purged now [20:29:20] (03Merged) 10jenkins-bot: i18n: Fix typo and simplify preference description [skins/Vector] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/844027 (https://phabricator.wikimedia.org/T321038) (owner: 10Jdlrobson) [20:29:30] 10SRE, 10Traffic: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10BCornwall) 05Open→03Resolved [20:29:46] !log samtar@deploy1002 Started scap: Backport for [[gerrit:844027|i18n: Fix typo and simplify preference description (T321038)]] [20:29:51] (03PS3) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [20:29:51] T321038: [[MediaWiki:Prefs-help-skin-limited-width/en]] (typo) replace "experinence" by "experience" - https://phabricator.wikimedia.org/T321038 [20:30:18] (03PS1) 10Stang: arwiki: Fix editeditorprotected restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844060 (https://phabricator.wikimedia.org/T321111) [20:31:59] (03PS1) 10CDanis: Fix newconnrate, as haproxy rates aren't normalized [puppet] - 10https://gerrit.wikimedia.org/r/844061 (https://phabricator.wikimedia.org/T306580) [20:33:17] (03CR) 10CDanis: [C: 03+2] Fix newconnrate, as haproxy rates aren't normalized [puppet] - 10https://gerrit.wikimedia.org/r/844061 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [20:34:26] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:844027|i18n: Fix typo and simplify preference description (T321038)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:34:32] Jdlrobson: live on mwdebug, did you want to check that typo fix is okay now, or should I just sync? [20:36:03] did it run a full scap? otherwise an i18n change won't make any difference... [20:36:09] (03PS4) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [20:36:25] chrcking [20:36:59] Reedy: hm, not following? It's rebuilt the languages if that's what you mean? [20:37:22] `scap backport` runs a full scap underneath [20:37:29] LGTM [20:37:38] (syncing) [20:37:47] (you can see it on https://www.mediawiki.org/wiki/Special:Preferences#mw-prefsection-rendering) [20:38:54] (03PS1) 10BryanDavis: striker: Bump container version to 2022-10-18-161910-production [puppet] - 10https://gerrit.wikimedia.org/r/844063 (https://phabricator.wikimedia.org/T316991) [20:39:30] (03PS5) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [20:42:01] (03PS6) 10Vlad.shapik: WIP: Provide additional tests to cover errors and exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) [20:43:44] (sync ongoing, but scap "feels slow") [20:43:55] *slower than normal [20:43:57] (03CR) 10Stang: Fix broken wordmarks in Bengali projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang) [20:45:27] TheresNoTime: I'm seeing the message update now :) [20:45:37] hi TheresNoTime, is there still time left for a few more patches? [20:45:53] koi: sure :) [20:46:04] can you add them to the calendar? [20:46:17] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:844027|i18n: Fix typo and simplify preference description (T321038)]] (duration: 16m 31s) [20:46:22] T321038: [[MediaWiki:Prefs-help-skin-limited-width/en]] (typo) replace "experinence" by "experience" - https://phabricator.wikimedia.org/T321038 [20:46:42] added [20:46:50] (03PS2) 10Samtar: tumwiki: Update project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844055 (https://phabricator.wikimedia.org/T320473) (owner: 10Stang) [20:47:30] (03CR) 10Jdlrobson: [C: 03+1] tumwiki: Update project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844055 (https://phabricator.wikimedia.org/T320473) (owner: 10Stang) [20:47:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844055 (https://phabricator.wikimedia.org/T320473) (owner: 10Stang) [20:48:00] (03CR) 10Jdlrobson: [C: 03+1] Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang) [20:48:27] (03Merged) 10jenkins-bot: tumwiki: Update project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844055 (https://phabricator.wikimedia.org/T320473) (owner: 10Stang) [20:48:50] !log samtar@deploy1002 Started scap: Backport for [[gerrit:844055|tumwiki: Update project logo (T320473)]] [20:48:55] T320473: Requesting permanent logo change for tum.wikipedia.org - https://phabricator.wikimedia.org/T320473 [20:49:19] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:844055|tumwiki: Update project logo (T320473)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:49:28] koi: can you test? ^ [20:50:01] !log phabricator - on new machines, find / -uid 497 -exec chown phd {}\; to fix privileges. (and then the same for -gid 498) The user phd used to be 497:498 (pid:gid) on old hosts but has been replaced with proper systemd system user using 920:920 T313360 [20:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:05] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [20:50:08] TheresNoTime: LGTM! [20:50:22] syncing [20:50:32] * TheresNoTime will remember to purge the cache this time.. [20:52:10] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10RLazarus) Thanks John! [20:53:33] (03PS2) 10Samtar: arwiki: Fix editeditorprotected restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844060 (https://phabricator.wikimedia.org/T321111) (owner: 10Stang) [20:54:09] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:844055|tumwiki: Update project logo (T320473)]] (duration: 05m 19s) [20:54:14] T320473: Requesting permanent logo change for tum.wikipedia.org - https://phabricator.wikimedia.org/T320473 [20:54:20] live & purged [20:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:54:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844060 (https://phabricator.wikimedia.org/T321111) (owner: 10Stang) [20:55:21] (03Merged) 10jenkins-bot: arwiki: Fix editeditorprotected restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844060 (https://phabricator.wikimedia.org/T321111) (owner: 10Stang) [20:55:45] !log samtar@deploy1002 Started scap: Backport for [[gerrit:844060|arwiki: Fix editeditorprotected restriction level (T321111)]] [20:55:50] T321111: Fix $wgRestrictionLevels for arwiki - https://phabricator.wikimedia.org/T321111 [20:56:08] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:844060|arwiki: Fix editeditorprotected restriction level (T321111)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:56:16] koi: ^ on mwdebug, can you test? [20:56:25] looking [20:58:21] TheresNoTime: the new restriction level presented in action=protect, new right appears in special:usergrouprights, so I thought LGTM [20:58:43] great :) syncing [21:02:53] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:844060|arwiki: Fix editeditorprotected restriction level (T321111)]] (duration: 07m 08s) [21:02:58] all live :) [21:03:00] T321111: Fix $wgRestrictionLevels for arwiki - https://phabricator.wikimedia.org/T321111 [21:03:18] !log closing UTC late backport window [21:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:45] !log otrs1001 - emptied exim paniclog [21:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:57] (03PS1) 10CDanis: For newconnrate, log to syslog both IP and rate [puppet] - 10https://gerrit.wikimedia.org/r/844065 (https://phabricator.wikimedia.org/T306580) [21:15:02] (03PS2) 10CDanis: For newconnrate, log to syslog both IP and rate [puppet] - 10https://gerrit.wikimedia.org/r/844065 (https://phabricator.wikimedia.org/T306580) [21:21:30] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/844011 [21:22:06] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:24:08] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:27] (03PS1) 10CDanis: haproxy newconnrate: half the interval & threshold [puppet] - 10https://gerrit.wikimedia.org/r/844066 (https://phabricator.wikimedia.org/T306580) [21:26:36] (03CR) 10CDanis: [C: 03+2] "PCC LGTM (matches my manual testing)" [puppet] - 10https://gerrit.wikimedia.org/r/844065 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [21:26:50] (03CR) 10CDanis: [C: 03+2] haproxy newconnrate: half the interval & threshold [puppet] - 10https://gerrit.wikimedia.org/r/844066 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [21:34:46] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (3) VMs requested for aux-k8s-etcd - https://phabricator.wikimedia.org/T321134 (10jhathaway) [21:35:04] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (3) VMs requested for aux-k8s-etcd - https://phabricator.wikimedia.org/T321134 (10jhathaway) a:03jhathaway [21:35:27] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (3) VMs requested for aux-k8s-etcd - https://phabricator.wikimedia.org/T321134 (10jhathaway) [21:48:43] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-ctrl - https://phabricator.wikimedia.org/T321137 (10jhathaway) [21:48:45] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-ctrl - https://phabricator.wikimedia.org/T321137 (10jhathaway) a:03jhathaway [21:48:55] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-ctrl - https://phabricator.wikimedia.org/T321137 (10jhathaway) [21:50:14] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-worker - https://phabricator.wikimedia.org/T321138 (10jhathaway) [21:50:22] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-worker - https://phabricator.wikimedia.org/T321138 (10jhathaway) a:03jhathaway [21:50:39] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-worker - https://phabricator.wikimedia.org/T321138 (10jhathaway) [21:51:16] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-ctrl - https://phabricator.wikimedia.org/T321137 (10jhathaway) [21:51:48] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:51:52] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (2) VMs requested for aux-k8s-ctrl - https://phabricator.wikimedia.org/T321137 (10jhathaway) [21:55:52] (03PS3) 10Stang: zhwiki: Add 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842918 (https://phabricator.wikimedia.org/T320859) [21:58:47] (03PS4) 10Stang: zhwiki: Add 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842918 (https://phabricator.wikimedia.org/T320859) [21:58:49] (03PS4) 10Stang: zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) [21:59:33] (03CR) 10CI reject: [V: 04-1] zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [22:05:11] (03PS5) 10Stang: zhwiki: Add 20 years logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842918 (https://phabricator.wikimedia.org/T320859) [22:05:14] (03PS5) 10Stang: zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) [22:06:00] (03CR) 10CI reject: [V: 04-1] zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) (owner: 10Stang) [22:07:26] (03PS6) 10Stang: zhwiki: Update 20 years logos in logos.php and IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842919 (https://phabricator.wikimedia.org/T320859) [22:16:03] (03PS2) 10Stang: Fix broken wordmarks in Bengali projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) [22:17:22] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [22:18:34] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:22:01] (03PS2) 10BryanDavis: bullseye: add bzip2 and zstd compression programs [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842992 (https://phabricator.wikimedia.org/T294607) [22:22:07] (03PS2) 10BryanDavis: mariadb: new image for mariadb/mysql backups [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) [22:23:08] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:23:25] (03CR) 10BryanDavis: "PS2 renames the image mariadb per discussion about versioning. I think we decided there that a version isn't needed at this time?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [22:29:56] PROBLEM - Disk space on aphlict1001 is CRITICAL: DISK CRITICAL - free space: / 635 MB (3% inode=88%): /tmp 635 MB (3% inode=88%): /var/tmp 635 MB (3% inode=88%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [22:43:42] (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) [22:43:44] (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844074 (https://phabricator.wikimedia.org/T314318) [22:52:40] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:46] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10AndyRussG) The change has rolled out to group 0 wikis, and should go to groups 1 and 2 this week. An example of a page with the ESI comment in the base HTML is [[ https:... [23:06:22] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Dzahn) I get it if it's just for our responsibility, but if there is an expectation that actually deletes it from the Internet.. just saying it's already been archived 14 times by archive.org and ther... [23:13:23] (03CR) 10Aftab: "@Stang I just realised i align the text in the Wikipedia-tagline-bn.svg file to top-center; i've updated the commons file again (23:02, 18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844054 (https://phabricator.wikimedia.org/T321124) (owner: 10Stang) [23:58:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1007.eqiad.wmnet with OS bullseye [23:58:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-presto1007.eqiad.wmnet with OS bullseye